QEMU Internal: Memory Region, Address Space and QEMU IO

Introduction

In this post, I will introduce two significant data structures in QEMU: MemoryRegion and AddressSpace. Based on the information given above, I will give more details on the memory initialization in QEMU and address_space_rw, which is the core function of QEMU from my perspective. Furthermore, I give examples to explain what is STDIO and MMIO (memory-mapped IO).
Before reading this post, I strongly recommend reading /qemu/docs/memory.txt first. It will give a basic view of what I will talk about in this post.

Background: MemoryRegion and AddressSpace

First of all, I must give the definition of two significant data structure in QEMU. Both data structures will appear throughout the whole post. From my point of view, structure MemoryRegion is responsible for manipulating the operation on target emulated memory, while structure AddressSpace is responsible for describing the AddressSpace variable assigned to it.

struct MemoryRegion {
    Object parent_obj;
    /* All fields are private - violators will be prosecuted */
    const MemoryRegionOps *ops;
    const MemoryRegionIOMMUOps *iommu_ops;
    void *opaque;
    MemoryRegion *container;
    Int128 size;
    hwaddr addr;
    void (*destructor)(MemoryRegion *mr);
    ram_addr_t ram_addr;
    uint64_t align;
    bool subpage;
    bool terminates;
    bool romd_mode;
    bool ram;
    bool skip_dump;
    bool readonly; /* For RAM regions */
    bool enabled;
    bool rom_device;
    bool warning_printed; /* For reservations */
    bool flush_coalesced_mmio;
    bool global_locking;
    uint8_t vga_logging_count;
    MemoryRegion *alias;
    hwaddr alias_offset;
    int32_t priority;
    bool may_overlap;
    QTAILQ_HEAD(subregions, MemoryRegion) subregions;
    QTAILQ_ENTRY(MemoryRegion) subregions_link;
    QTAILQ_HEAD(coalesced_ranges, CoalescedMemoryRange) coalesced;
    const char *name;
    uint8_t dirty_log_mask;
    unsigned ioeventfd_nb;
    MemoryRegionIoeventfd *ioeventfds;
    NotifierList iommu_notify;
};

/**
 * AddressSpace: describes a mapping of addresses to #MemoryRegion objects
 */
struct AddressSpace {
    /* All fields are private. */
    struct rcu_head rcu;
    char *name;
    MemoryRegion *root;

    /* Accessed via RCU.  */
    struct FlatView *current_map;

    int ioeventfd_nb;
    struct MemoryRegionIoeventfd *ioeventfds;
    struct AddressSpaceDispatch *dispatch;
    struct AddressSpaceDispatch *next_dispatch;
    MemoryListener dispatch_listener;

    QTAILQ_ENTRY(AddressSpace) address_spaces_link;
};

In structure MemoryRegion it contains another structure MemoryRegionOps. The function pointer read and write provides the callback functions that process IO operation on the emulated memory.

/*
 * Memory region callbacks
 */
struct MemoryRegionOps {
    /* Read from the memory region. @addr is relative to @mr; @size is
     * in bytes. */
    uint64_t (*read)(void *opaque,
                     hwaddr addr,
                     unsigned size);
    /* Write to the memory region. @addr is relative to @mr; @size is
     * in bytes. */
    void (*write)(void *opaque,
                  hwaddr addr,
                  uint64_t data,
                  unsigned size);

    MemTxResult (*read_with_attrs)(void *opaque,
                                   hwaddr addr,
                                   uint64_t *data,
                                   unsigned size,
                                   MemTxAttrs attrs);
    MemTxResult (*write_with_attrs)(void *opaque,
                                    hwaddr addr,
                                    uint64_t data,
                                    unsigned size,
                                    MemTxAttrs attrs);

    enum device_endian endianness;
    /* Guest-visible constraints: */
    struct {
        /* If nonzero, specify bounds on access sizes beyond which a machine
         * check is thrown.
         */
        unsigned min_access_size;
        unsigned max_access_size;
        /* If true, unaligned accesses are supported.  Otherwise unaligned
         * accesses throw machine checks.
         */
         bool unaligned;
        /*
         * If present, and returns #false, the transaction is not accepted
         * by the device (and results in machine dependent behaviour such
         * as a machine check exception).
         */
        bool (*accepts)(void *opaque, hwaddr addr,
                        unsigned size, bool is_write);
    } valid;
    /* Internal implementation constraints: */
    struct {
        /* If nonzero, specifies the minimum size implemented.  Smaller sizes
         * will be rounded upwards and a partial result will be returned.
         */
        unsigned min_access_size;
        /* If nonzero, specifies the maximum size implemented.  Larger sizes
         * will be done as a series of accesses with smaller sizes.
         */
        unsigned max_access_size;
        /* If true, unaligned accesses are supported.  Otherwise all accesses
         * are converted to (possibly multiple) naturally aligned accesses.
         */
        bool unaligned;
    } impl;

    /* If .read and .write are not present, old_mmio may be used for
     * backwards compatibility with old mmio registration
     */
    const MemoryRegionMmio old_mmio;
};

Memory Initialization

To start the explanation of the memory initialization of QEMU, I must introduce 4 global variables below:

static MemoryRegion *system_memory;
static MemoryRegion *system_io;

AddressSpace address_space_io;
AddressSpace address_space_memory;

system_memory is used to emulate the memory space of guest machine. system_io is used to emulate the operation of IO-related operation in QEMU.

static void memory_map_init(void)
{
    system_memory = g_malloc(sizeof(*system_memory));

    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
    address_space_init(&address_space_memory, system_memory, "memory");

    system_io = g_malloc(sizeof(*system_io));
    memory_region_init_io(system_io, NULL, &unassigned_io_ops, NULL, "io",
                          65536);
    address_space_init(&address_space_io, system_io, "I/O");
}

Function memory_map_init is directly invoked by cpu_exec_init_all in main.c. In this function system_memory, address_space_memory, system_io and address_space_io are all initialized.

RAM Initialization

After the initialization of address_space_memory, it will not be used as RAM of guest machine directly. Instead, RAM will be initialized in pc_piix.c .

/* PC hardware initialisation */
static void pc_init1(MachineState *machine)
{
    PCMachineState *pc_machine = PC_MACHINE(machine);
    MemoryRegion *system_memory = get_system_memory();
    MemoryRegion *system_io = get_system_io();
    //some code
    /* allocate ram and load rom/bios */
    if (!xen_enabled()) {
        pc_memory_init(machine, system_memory,
                       below_4g_mem_size, above_4g_mem_size,
                       rom_memory, &ram_memory, guest_info);
    } else if (machine->kernel_filename != NULL) {
        /* For xen HVM direct kernel boot, load linux here */
        xen_load_linux(machine->kernel_filename,
                       machine->kernel_cmdline,
                       machine->initrd_filename,
                       below_4g_mem_size,
                       guest_info);
    }
    //some other code
}

FWCfgState *pc_memory_init(MachineState *machine,
                           MemoryRegion *system_memory,
                           ram_addr_t below_4g_mem_size,
                           ram_addr_t above_4g_mem_size,
                           MemoryRegion *rom_memory,
                           MemoryRegion **ram_memory,
                           PcGuestInfo *guest_info)
{
    int linux_boot, i;
    MemoryRegion *ram, *option_rom_mr;
    MemoryRegion *ram_below_4g, *ram_above_4g;
    FWCfgState *fw_cfg;
    PCMachineState *pcms = PC_MACHINE(machine);

    assert(machine->ram_size == below_4g_mem_size + above_4g_mem_size);

    linux_boot = (machine->kernel_filename != NULL);

    /* Allocate RAM.  We allocate it as a single memory region and use
     * aliases to address portions of it, mostly for backwards compatibility
     * with older qemus that used qemu_ram_alloc().
     */
    ram = g_malloc(sizeof(*ram));
    memory_region_allocate_system_memory(ram, NULL, "pc.ram",
                                         machine->ram_size);
    *ram_memory = ram;
    ram_below_4g = g_malloc(sizeof(*ram_below_4g));
    memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", ram,
                             0, below_4g_mem_size);
    memory_region_add_subregion(system_memory, 0, ram_below_4g);
    //some other code
}

Function address_space_rw

Though I have mentioned about the definition of address_space_rw in my write-up on HITB BabyQEMU, I will put the source together with the definition here.

/**
 * address_space_rw: read from or write to an address space.
 *
 * Return a MemTxResult indicating whether the operation succeeded
 * or failed (eg unassigned memory, device rejected the transaction,
 * IOMMU fault).
 *
 * @as: #AddressSpace to be accessed
 * @addr: address within that address space
 * @attrs: memory transaction attributes
 * @buf: buffer with the data transferred
 * @is_write: indicates the transfer direction
 */
MemTxResult address_space_rw(AddressSpace *as, hwaddr addr,
                             MemTxAttrs attrs, uint8_t *buf,
                             int len, bool is_write);


MemTxResult address_space_rw(AddressSpace *as, hwaddr addr, MemTxAttrs attrs,
                             uint8_t *buf, int len, bool is_write)
{
    hwaddr l;
    uint8_t *ptr;
    uint64_t val;
    hwaddr addr1;
    MemoryRegion *mr;
    MemTxResult result = MEMTX_OK;
    bool release_lock = false;

    rcu_read_lock();
    while (len > 0) {
        l = len;
        mr = address_space_translate(as, addr, &addr1, &l, is_write);

        if (is_write) {
            if (!memory_access_is_direct(mr, is_write)) {
                release_lock |= prepare_mmio_access(mr);
                l = memory_access_size(mr, l, addr1);
                /* XXX: could force current_cpu to NULL to avoid
                   potential bugs */
                switch (l) {
                case 8:
                    /* 64 bit write access */
                    break;
                case 4:
                    /* 32 bit write access */
                    break;
                case 2:
                    /* 16 bit write access */
                    break;
                case 1:
                    /* 8 bit write access */
                    break;
                default:
                    abort();
                }
            } else {
                addr1 += memory_region_get_ram_addr(mr);
                /* RAM case */
                ptr = qemu_get_ram_ptr(addr1);
                memcpy(ptr, buf, l);
                invalidate_and_set_dirty(mr, addr1, l);
            }
        } else {
            if (!memory_access_is_direct(mr, is_write)) {
                /* I/O case */
                release_lock |= prepare_mmio_access(mr);
                l = memory_access_size(mr, l, addr1);
                switch (l) {
                case 8:
                    /* 64 bit read access */
                    break;
                case 4:
                    /* 32 bit read access */
                    break;
                case 2:
                    /* 16 bit read access */
                    break;
                case 1:
                    /* 8 bit read access */
                    break;
                default:
                    abort();
                }
            } else {
                /* RAM case */
                ptr = qemu_get_ram_ptr(mr->ram_addr + addr1);
                memcpy(buf, ptr, l);
            }
        }

        if (release_lock) {
            qemu_mutex_unlock_iothread();
            release_lock = false;
        }

        len -= l;
        buf += l;
        addr += l;
    }
    rcu_read_unlock();
    return result;
}

This function will invoke address_space_translate to get the desired MemoryRegion from the AddressSpace. Then memory_region_dispatch_read and memory_region_dispatch_write will be called accordingly.

At this point, we have a general view of the workflow of the memory initialization in QEMU. Furthermore, we have got to know that address_space_rw is expected to execute in the end. But we did not touch any information on how these data structures work in QEMU emulator in practice.
In the next two sections, I will introduce how STDIO and MMIO work with the allocated MemoryRegion and AddressSpace

STDIO

In this section, I use the following code to explain how STDIO works.

#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <stdlib.h>
#include <fcntl.h>
#include <assert.h>
#include <inttypes.h>
 
int main()
{
	uint8_t *ptr;
	ptr = malloc(256);
	strcpy(ptr, "Where am I?");
	printf("%s\n", ptr);
	return 0;
}

After some effort of debugging, we can find the place that prints the string. It is surprising to find that the string is displayed byte by byte.

Thread 3 "qemu-system-x86" hit Breakpoint 1, address_space_rw (as=0x560902a68980 <address_space_io>, addr=addr@entry=1016, attrs=attrs@entry=..., buf=buf@entry=0x7fcb6b10e000 "W\004\004\020", len=len@entry=1, is_write=is_write@entry=true) at /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/exec.c:3062
3062	in /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/exec.c
$1557 = 0x560902a68980
$1558 = 0x3f8
$1559 = 0x7fcb6b10e000
$1560 = 0x57
#0  address_space_rw (as=0x560902a68980 <address_space_io>, addr=addr@entry=1016, attrs=attrs@entry=..., buf=buf@entry=0x7fcb6b10e000 "W\004\004\020", len=len@entry=1, is_write=is_write@entry=true) at /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/exec.c:3062
#1  0x000056090213aba0 in kvm_handle_io (count=1, size=1, direction=<optimized out>, data=<optimized out>, attrs=..., port=1016) at /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/accel/kvm/kvm-all.c:1806
#2  kvm_cpu_exec (cpu=cpu@entry=0x5609030e3de0) at /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/accel/kvm/kvm-all.c:2046
#3  0x0000560902118704 in qemu_kvm_cpu_thread_fn (arg=0x5609030e3de0) at /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/cpus.c:1128
#4  0x00007fcb691f77fc in start_thread (arg=0x7fcb623d7700) at pthread_create.c:465
#5  0x00007fcb68f24b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Then we set a breakpoint after invoking memory_space_translate

Thread 3 "qemu-system-x86" hit Breakpoint 2, 0x00005609020e58fb in address_space_write (as=0x560902a68980 <address_space_io>, addr=1016, attrs=..., buf=0x7fcb6b10e000 "W\004\004\020", len=1) at /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/exec.c:2960
2960	in /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/exec.c
(gdb) p/x $rax
$1561 = 0x56090359aaf0
(gdb) p/x *(struct MemoryRegion*)($rax)
$1562 = {parent_obj = {class = 0x560903087260, free = 0x0, properties = 0x5609042e30c0Python Exception <class 'gdb.error'> There is no member named keys.: 
, ref = 0x1, parent = 0x56090359a990}, romd_mode = 0x1, ram = 0x0, subpage = 0x0, readonly = 0x0, rom_device = 0x0, flush_coalesced_mmio = 0x0, global_locking = 0x1, dirty_log_mask = 0x0, ram_block = 0x0, owner = 0x56090359a990, iommu_ops = 0x0, ops = 0x5609028e6100, opaque = 0x56090359aa20, container = 0x560903060800, size = 0x00000000000000000000000000000008, addr = 0x3f8, destructor = 0x560902128950, align = 0x0, terminates = 0x1, ram_device = 0x0, enabled = 0x1, warning_printed = 0x0, vga_logging_count = 0x0, alias = 0x0, alias_offset = 0x0, priority = 0x0, subregions = {tqh_first = 0x0, tqh_last = 0x56090359ab98}, subregions_link = {tqe_next = 0x5609042e2210, tqe_prev = 0x56090434d958}, coalesced = {tqh_first = 0x0, tqh_last = 0x56090359abb8}, name = 0x560904318e70, ioeventfd_nb = 0x0, ioeventfds = 0x0, iommu_notify = {lh_first = 0x0}, iommu_notify_flags = 0x0}
(gdb) x/s 0x560904318e70
0x560904318e70:	"serial"
(gdb) x/2gx 0x5609028e6100
0x5609028e6100 <serial_io_ops>:	0x0000560902265120	0x00005609022657c0
(gdb) x/i 0x0000560902265120
   0x560902265120 <serial_ioport_read>:	lea    0x268d89(%rip),%rax        # 0x5609024cdeb0
(gdb) x/i 0x00005609022657c0
   0x5609022657c0 <serial_ioport_write>:	push   %r13

I can quickly locate the code, which is responsible for initialization of serial IO.

SerialState *serial_init(int base, qemu_irq irq, int baudbase,
                         CharDriverState *chr, MemoryRegion *system_io)
{
    SerialState *s;
    Error *err = NULL;

    s = g_malloc0(sizeof(SerialState));

    s->irq = irq;
    s->baudbase = baudbase;
    s->chr = chr;
    serial_realize_core(s, &err);
    if (err != NULL) {
        error_report_err(err);
        exit(1);
    }

    vmstate_register(NULL, base, &vmstate_serial, s);

    memory_region_init_io(&s->io, NULL, &serial_io_ops, s, "serial", 8);
    memory_region_add_subregion(system_io, base, &s->io);

    return s;
}

We can clearly see that “serial” MemoryRegion is a subregion of memory_space_io.

MMIO

In this section, I simplify the script in BabyQEMU to explain how MMIO works. Since part of MMIO is overlapping with the PCI device, I will not cover how MMIO is initilized with a PCI device here. More details will be be given in my next post on PCI device.

#include <assert.h>
#include <fcntl.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>

#define DMA_BASE 0x40000

unsigned char* iomem;
unsigned char* dmabuf;
uint64_t dmabuf_phys_addr;

uint64_t virt2phys(void* p)
{
	uint64_t virt = (uint64_t)p;
	// Assert page alignment
	assert((virt & 0xfff) == 0);
	int fd = open("/proc/self/pagemap", O_RDONLY);
	if (fd == -1)
		die("open");
	uint64_t offset = (virt / 0x1000) * 8;
	lseek(fd, offset, SEEK_SET);
	
	uint64_t phys;
	if (read(fd, &phys, 8 ) != 8)
		die("read");
	// Assert page present
	
	assert(phys & (1ULL << 63));
	phys = (phys & ((1ULL << 54) - 1)) * 0x1000;
	return phys;
}


void iowrite(uint64_t addr, uint64_t value)
{
	*((uint64_t*)(iomem + addr)) = value;
}

int main(int argc, char *argv[])
{
	int fd = open("/sys/devices/pci0000:00/0000:00:04.0/resource0", O_RDWR | O_SYNC);
	if (fd == -1)
		die("open");
	iomem = mmap(0, 0x1000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	if (iomem == MAP_FAILED)
		die("mmap");
	
	printf("iomem @ %p\n", iomem);
	dmabuf = mmap(0, 0x1000, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
	if (dmabuf == MAP_FAILED)
		die("mmap");
	mlock(dmabuf, 0x1000);
	dmabuf_phys_addr = virt2phys(dmabuf);
	iowrite(128, DMA_BASE+0x1000);
	return 0;
}

With some helper debugging script, we can break at hitb_mmio_write with the following information:

Thread 3 "qemu-system-x86" hit Breakpoint 1, address_space_rw (as=0x560902a68860 <address_space_memory>, addr=4271898752, attrs=attrs@entry=..., buf=buf@entry=0x7fcb6b10d028 "", len=8, is_write=true) at /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/exec.c:3062
3062	in /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/exec.c
$389 = 0x560902a68860
0x5609030bc0b0:	"memory"

Thread 3 "qemu-system-x86" hit Breakpoint 2, hitb_mmio_write (opaque=0x560904a73770, addr=128, val=266240, size=4) at /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/hw/misc/hitb.c:276
276	/mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/hw/misc/hitb.c: No such file or directory.
(gdb) bt 5
#0  hitb_mmio_write (opaque=0x560904a73770, addr=128, val=266240, size=4) at /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/hw/misc/hitb.c:276
#1  0x000056090212b4c8 in memory_region_write_accessor (mr=0x560904a74160, addr=128, value=<optimized out>, size=4, shift=<optimized out>, mask=<optimized out>, attrs=...) at /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/memory.c:528
#2  0x00005609021288cd in access_with_adjusted_size (addr=addr@entry=128, value=value@entry=0x7fcb623d69f8, size=size@entry=4, access_size_min=<optimized out>, access_size_max=<optimized out>, access=0x56090212b450 <memory_region_write_accessor>, mr=0x560904a74160, attrs=...) at /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/memory.c:594
#3  0x000056090212c92c in memory_region_dispatch_write (mr=mr@entry=0x560904a74160, addr=128, data=266240, size=size@entry=4, attrs=attrs@entry=...) at /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/memory.c:1334
#4  0x00005609020e5b09 in address_space_write_continue (mr=0x560904a74160, l=4, addr1=128, len=8, buf=0x7fcb6b10d028 "", attrs=..., addr=4271898752, as=0x560902a68860 <address_space_memory>) at /mnt/hgfs/eadom/workspcae/projects/hitbctf2017/babyqemu/qemu/exec.c:2904
(More stack frames follow...)
(gdb) p/x ((struct MemoryRegion*)(0x560904a74160))->name
$390 = 0x560904ab55b0
(gdb) x/s 0x560904ab55b0
0x560904ab55b0:	"hitb-mmio"

An important conclusion from the information above is that the “hitb-mmio” MemoryRegion is actually a subregion of system_memory.

Conclusion

In this post, I give a explanation of two significant data structures in QEMU: MemoryRegion and AddressSpace. Secondly, there are four important global variables in QEMU: system_memory, address_space_memory, system_io and address_space_io. Thirdly, I introduce one core function address_space_rw. Finally, I give two examples in QEMU to show how STDIO (system_io) and MMIO (system_memory) are expected to initialize and work in QEMU.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.