This directory contains a large patch with numerous functionality 
enhancements to Palacios.   These enhancements add NUMA guest support 
to Palacios.

This work has been provided by Alexander Kudryavtsev
(a.o.kudryavtsev@gmail.com) of the Institute for System Programming of
the Russian Academy of Sciences (ISP-RAS: http://ispras.ru/en/).  Note
that this patch may also depend on previous patches within contrib/isp-ras,
contrib/MSI Support, and contrib/quix86-decoder.   

The V3VEE project is thankful for these patches, and we are in the
process of incorporating them.  If you are planning to apply these
patches to the devel branch, you should be aware that as of 3/23/11,
the quix86 decoder and several of the isp-ras and MSI Support patches
have already been incorporated.

Alexander's original description is included below:

======

I attach the patch which enables E820, SEABIOS, NUMA and large (>3.5GB)
memory support in Palacios. Below I will try to explain some key moments
about this patch.

1. It is now possible to select BIOS type from config menu. SEABIOS
files are located at bios/seabios folder and include binary,
configuration to build it, and patch required to run SEABIOS inside
Palacios's VM. VGABIOS is the same for both BIOSes.
To support SEABIOS, some changes are made inside NVRAM, including
setting the amount of high (>4GB) memory and the number of AP cores.

Currently, SEABIOS occupies 128KB of memory, and its code should be
mapped to the end of 4GB address space. So it starts from 0xe0000 and
not 0xf0000 as BOCHS. I'm not sure whether it will affect something like
vmxassist, it should be checked. Also it is possible that 64KB of code
is sufficient for first MB, I hope to check it later.
The new thing is Firmware Configuration (vmm_fw_cfg.c), it is used to
provide some useful data about system to SEABIOS. Initially it was taken
from QEMU and adopted for our case. Now it performs some setup for BOCHS
BIOS too.
With SEABIOS, MPTABLE device is disabled since this BIOS generates all
tables by self.

2. Memory configuration is changed in the following way:
<memory> <chunk>3072</chunk> <chunk>4096</chunk> </memory>
This means that we should try to allocate 3072 and 4096 MB from host to
use as a guest memory. Old format is also supported. Each chunk is
allocated from contiguous memory pool, but different chunks may be not
contiguous in memory. This helps to deal with 4GB memory hole (and other
memory holes in upper memory). Chunks are allocated in the same order as
they are listed in configuration. Note that allocated chunks may get
non-consequent addresses, so we sort chunks after allocating them.
A special case is when we use shadow paging. In this case, we cannot use
host memory above 4GB for guest memory below 4GB since 32bit mode page
tables cannot map such addresses. We check this case after allocating
and sorting chunks and free host memory which cannot be used.
Note that because of the shadow paging issue, I had to move the
determine_paging_mode call into pre_config_vm from pre_config_core. The
quick fix for now is to duplicate "shdw_pg_mode" and "flags" fields in
struct v3_vm_info, which are initialized by determine_paging_mode and
used to set the same fields in struct guest_info later.
Also note that checkpoints (vmm_checkpoint.c) are broken with chunks,
since memory image becomes sparse and should be stored accordingly.

3. E820 memory map is always generated according to memory chunk
configuration. E820 is supported by both bochs and sea BIOSes. E820
partially replaces the base memory region. It is constructed according
to allocated chunks, with necessary holes between them. There is a
tricky thing with 4GB memory hole in guest address space: some chunks
may overlap with the hole, but we can determine this hole bounds only
after all devices are initialized. This is the reason for additional
v3_finalize_e820 function which calculates the hole and cuts it from E820.
Also we need to know the amount of contiguous memory in low and high
(>4GB) address spaces. This data is required by NVRAM. We calculate this
amount early, when initializing E820, because later it will be used by
NVRAM to set memory fields. But we can create final E820 table only
after we initialize all devices, so there may be some inconsistency.

4. NUMA support. SEABIOS can generate ACPI tables, including SRAT table
describing NUMA configuration. The data is passed inside via vmm_fw_cfg
interface. NUMA configuration is possible using the following config:
<nodes count="4"></nodes> -  at top level
Cores can be attached to nodes, by default core goes to node 0:
<cores count="2"><core node="3"/><core node="1"/></cores>
Memory chunks can be attached to nodes in the same way:
<chunk node="0">4096</chunk>

Note that the guest should have ACPI/ACPI_NUMA support to see the NUMA
configuration.

There are a few problems with NUMA configuration, which I tried to
solve. At first, after allocating and sorting chunks, I check that node
memory is not interleaving since it is incorrect. If something fails,
the user have to change the configuration. For example, the following
config will likely fail:
<chunk node="1">512</chunk>
<chunk node="0">512</chunk>
<chunk node="1">512</chunk>
The next thing I have to do is to make node numbers starting from 0 and
ascending according to node memory order. This is required by SEABIOS
interface restrictions and, possibly, by ACPI standard itself. Another
requirement is that node 0 must have some memory, so I rearrange node
numbers such that nodes with memory start from number 0 and increase
according to memory order, and after all nodes with memory, the nodes
without memory get their numbers. I tried to make all this stuff as
flexible as possible, so that you can write:
<memory alignment="2MB">
<chunk node="8">512</chunk>
<chunk node="3">1024</chunk>
<chunk node="5">1024</chunk>
<chunk node="5">128</chunk>
<chunk node="6">512</chunk>
<chunk node="7">128</chunk>
<chunk node="0">128</chunk>
</memory>
<nodes count="10"></nodes>
<cores count="2"><core node="2"/><core node="0"/></cores>
and it will work, with appropriate messages about node renaming.
If no node configuration is found, SRAT will not be generated and we
will have one-socket guest.
Also I added the field "target_node" to memory chunk configuration
entry. It should allow to allocate memory from specified host node, but
currently it is not implemented since host interface should be extended
to support it. With Kitten we could locate chunks in right order with
right sizes to get memory from selected nodes, because Kitten memory
allocation is simple and predictable.

5. More about vmm_fw_cfg and SEABIOS. They can be used to do a lot of
things among NUMA support. We can pass SMBIOS table entries, additional
ACPI tables, even files to SEABIOS via fw_cfg. Also we can pass
specified kernel/initrd to SEABIOS and run it inside VM in the same way
as QEMU does it (with -kernel -initrd -append parameters). Finally, we
can pass device's Option ROM when passing device inside VM and allow
SEABIOS to run it.


Finally I want to note that the patch state is far from finished, from
my opinion. It would be nice to receive some feedback and advice from
Palacios developers about the code and introduced changes to
architecture. I tested it only with Kitten as a host, due to time
limitations - hope that later I will test it with Linux host too. Also I
only used QEMU as a target machine. Later I will test it on the real
hardware, I need to build simple Linux guest with busybox and ACPI/NUMA
support to do it.


Alexander