This directory contains a large patch with numerous functionality enhancements to Palacios. These enhancements add NUMA guest support to Palacios. This work has been provided by Alexander Kudryavtsev (a.o.kudryavtsev@gmail.com) of the Institute for System Programming of the Russian Academy of Sciences (ISP-RAS: http://ispras.ru/en/). Note that this patch may also depend on previous patches within contrib/isp-ras, contrib/MSI Support, and contrib/quix86-decoder. The V3VEE project is thankful for these patches, and we are in the process of incorporating them. If you are planning to apply these patches to the devel branch, you should be aware that as of 3/23/11, the quix86 decoder and several of the isp-ras and MSI Support patches have already been incorporated. Alexander's original description is included below: ====== I attach the patch which enables E820, SEABIOS, NUMA and large (>3.5GB) memory support in Palacios. Below I will try to explain some key moments about this patch. 1. It is now possible to select BIOS type from config menu. SEABIOS files are located at bios/seabios folder and include binary, configuration to build it, and patch required to run SEABIOS inside Palacios's VM. VGABIOS is the same for both BIOSes. To support SEABIOS, some changes are made inside NVRAM, including setting the amount of high (>4GB) memory and the number of AP cores. Currently, SEABIOS occupies 128KB of memory, and its code should be mapped to the end of 4GB address space. So it starts from 0xe0000 and not 0xf0000 as BOCHS. I'm not sure whether it will affect something like vmxassist, it should be checked. Also it is possible that 64KB of code is sufficient for first MB, I hope to check it later. The new thing is Firmware Configuration (vmm_fw_cfg.c), it is used to provide some useful data about system to SEABIOS. Initially it was taken from QEMU and adopted for our case. Now it performs some setup for BOCHS BIOS too. With SEABIOS, MPTABLE device is disabled since this BIOS generates all tables by self. 2. Memory configuration is changed in the following way: 3072 4096 This means that we should try to allocate 3072 and 4096 MB from host to use as a guest memory. Old format is also supported. Each chunk is allocated from contiguous memory pool, but different chunks may be not contiguous in memory. This helps to deal with 4GB memory hole (and other memory holes in upper memory). Chunks are allocated in the same order as they are listed in configuration. Note that allocated chunks may get non-consequent addresses, so we sort chunks after allocating them. A special case is when we use shadow paging. In this case, we cannot use host memory above 4GB for guest memory below 4GB since 32bit mode page tables cannot map such addresses. We check this case after allocating and sorting chunks and free host memory which cannot be used. Note that because of the shadow paging issue, I had to move the determine_paging_mode call into pre_config_vm from pre_config_core. The quick fix for now is to duplicate "shdw_pg_mode" and "flags" fields in struct v3_vm_info, which are initialized by determine_paging_mode and used to set the same fields in struct guest_info later. Also note that checkpoints (vmm_checkpoint.c) are broken with chunks, since memory image becomes sparse and should be stored accordingly. 3. E820 memory map is always generated according to memory chunk configuration. E820 is supported by both bochs and sea BIOSes. E820 partially replaces the base memory region. It is constructed according to allocated chunks, with necessary holes between them. There is a tricky thing with 4GB memory hole in guest address space: some chunks may overlap with the hole, but we can determine this hole bounds only after all devices are initialized. This is the reason for additional v3_finalize_e820 function which calculates the hole and cuts it from E820. Also we need to know the amount of contiguous memory in low and high (>4GB) address spaces. This data is required by NVRAM. We calculate this amount early, when initializing E820, because later it will be used by NVRAM to set memory fields. But we can create final E820 table only after we initialize all devices, so there may be some inconsistency. 4. NUMA support. SEABIOS can generate ACPI tables, including SRAT table describing NUMA configuration. The data is passed inside via vmm_fw_cfg interface. NUMA configuration is possible using the following config: - at top level Cores can be attached to nodes, by default core goes to node 0: Memory chunks can be attached to nodes in the same way: 4096 Note that the guest should have ACPI/ACPI_NUMA support to see the NUMA configuration. There are a few problems with NUMA configuration, which I tried to solve. At first, after allocating and sorting chunks, I check that node memory is not interleaving since it is incorrect. If something fails, the user have to change the configuration. For example, the following config will likely fail: 512 512 512 The next thing I have to do is to make node numbers starting from 0 and ascending according to node memory order. This is required by SEABIOS interface restrictions and, possibly, by ACPI standard itself. Another requirement is that node 0 must have some memory, so I rearrange node numbers such that nodes with memory start from number 0 and increase according to memory order, and after all nodes with memory, the nodes without memory get their numbers. I tried to make all this stuff as flexible as possible, so that you can write: 512 1024 1024 128 512 128 128 and it will work, with appropriate messages about node renaming. If no node configuration is found, SRAT will not be generated and we will have one-socket guest. Also I added the field "target_node" to memory chunk configuration entry. It should allow to allocate memory from specified host node, but currently it is not implemented since host interface should be extended to support it. With Kitten we could locate chunks in right order with right sizes to get memory from selected nodes, because Kitten memory allocation is simple and predictable. 5. More about vmm_fw_cfg and SEABIOS. They can be used to do a lot of things among NUMA support. We can pass SMBIOS table entries, additional ACPI tables, even files to SEABIOS via fw_cfg. Also we can pass specified kernel/initrd to SEABIOS and run it inside VM in the same way as QEMU does it (with -kernel -initrd -append parameters). Finally, we can pass device's Option ROM when passing device inside VM and allow SEABIOS to run it. Finally I want to note that the patch state is far from finished, from my opinion. It would be nice to receive some feedback and advice from Palacios developers about the code and introduced changes to architecture. I tested it only with Kitten as a host, due to time limitations - hope that later I will test it with Linux host too. Also I only used QEMU as a target machine. Later I will test it on the real hardware, I need to build simple Linux guest with busybox and ACPI/NUMA support to do it. Alexander