QEMU, a Fast and Portable Dynamic Translator

引言

Qemu是以前做实验用过的一个虚拟化的工具。Qemu最大的一个特点就是它的动态翻译的机制，处理这些之外，还要负责模拟一些其它的设备，比如网络设备、块设备等。

... It emulates several CPUs (x86, PowerPC, ARM and Sparc) on several hosts (x86, PowerPC, ARM, Sparc, Alpha and MIPS). QEMU supports full system emulation in which a complete and unmodified operating system is run in a virtual machine and Linux user mode emulation where a Linux process compiled for one target CPU can be run on another CPU.

这篇Paper是一篇比较短的文章，只有6页，比如动辄十多页的Paper少多了。

基本思路

在一个CPU的机器上运行另外一个架构的CPU的软件一种常用的方法就是模拟执行这些指令。很显然这样做的缺点就是其极低性能。Qemu为的就是解决这个性能的问题，它使用的方式就将这些CPU的指令动态翻译为运行机器上面的指令。每一个目标CPU的指令都会被分割成一些micro operations的操作。每一个micro operations都是一小段的C代码实现的，这些C源代码又会被GCC编译成目标文件。然后在运行的时候一个dyngen的工具会使用前面生产的目标文件来对执行的文件进行翻译。下面以一个PowerPC上面的例子来说明它是如何被翻译为x86上面运行的代码的，

addi r1,r1,-16   # r1 = r1 - 16

这条指令会被拆分为三个micro operations，

movl_T0_r1 # T0 = r1
addl_T0_im -16 # T0 = T0 - 16
movl_r1_T0 # r1 = T0

以第一个微操作为例，它对应的实现为：

void op_movl_T0_r1(void)
{
    T0 = env->regs[1];
}

对于实现加法的为操作，它有一个固定操作量constant parameter，会在运行的时候决定，

extern int __op_param1;
void op_addl_T0_im(void)
{
    T0 = T0 + ((long)(&__op_param1));
}

一小段翻译的逻辑就是下面这样，可以看作是这些微操作组合的操作，

//[...]
for(;;) {
  switch(*opc_ptr++) {
  //[...]
  case INDEX_op_movl_T0_r1: {
    extern void op_movl_T0_r1();
    memcpy(gen_code_ptr, (char *)&op_movl_T0_r1+0,3);
    gen_code_ptr += 3;
    break;
  }
  case INDEX_op_addl_T0_im: {
    long param1;
  	extern void op_addl_T0_im();
  	memcpy(gen_code_ptr, (char *)&op_addl_T0_im+0, 6);
  	param1 = *opparam_ptr++;
  	*(uint32_t *)(gen_code_ptr + 2) = param1;
  	gen_code_ptr += 6;
  	break;
}
// [...]
} }
//[...] 

然后生成的代码大概就是下面这样，

# movl_T0_r1
# ebx = env->regs[1]
mov    0x4(%ebp),%ebx
# addl_T0_im -16
# ebx = ebx - 16
add    $0xfffffff0,%ebx
# movl_r1_T0
# env->regs[1] = ebx
mov    %ebx,0x4(%ebp)
# On x86, T0 is mapped to the ebx register and the CPU state context to the ebp register.

Implementation details

Dyngen

对QEMU可以看出Dyngen是其的一个核心的组件。对前面的目标文件，它要做下面的一些工作：

解析object file得到符号表；
定位微操作。这里使用的方法是利用拷贝这些已经生成的机器码来生成。对于微操作函数的进入的操作和退出操作是不必要的；
获取参数的信息；
拷贝微操作；

一些其它的处理，比如在ARM上面处理常量的问题，

For some hosts such as ARM, constants must be stored near the generated code because they are accessed with PC relative loads with a small displacement. A host specific pass is done to relocate these constants in the generated code.

QUMU动态生成代码的时候是基于Translated Blocks (TBs)的。对于寄存器分配，使用的是固定分配的方式，这意味着目标CPU的寄存器是固定地映射到运行CPU的某个寄存器or内存位置的，这样做的好处是简单以及易于移植。QEMU也可以使用动态分配的方式来提高性能。其它的几个内容：

分支代码优化，在实际的CPU中，比如eflag中的信息是执行中CPU自动设置的，这样对性能没什么允许。在QEMU中就要处理这些flag。这里为了提高性能，使用的方法是lazy condition code evaluation，不死每一条指令之后的flag都会被使用，只在需要的时候才处理就可以了。
Direct block chaining，TB被执行的时候QEMU使用了一个模拟的PC，如果执行的之后跳到了下一个TB。而这个TB还没有被翻译，这个时候就需要先翻译。为了优化性能，这个可以通过修改TB的方式实现直接跳转，虽然比间接调整兼容性要差，但是性能更加好；

Memory management，QEMU使用mmap syscall来模拟一个MMU，软件管理。也使用了一些优化方式，

To avoid flushing the translated code each time the MMU mappings change, QEMU uses a physically indexed translation cache. It means that each TB is indexed with its physical address.// 使用间接

Exception support，Hardware interrupts，User mode emulation这些的讨论可以参考[1].

评估

这里具体可以参考[1]

User mode QEMU (version 0.4.2) was measured to be about 4 times slower than native code on integer code. On floating point code, it is 10 times slower. This can be understood as a result of the lack of the x86 FPU stack pointer in the static CPU state. In full system emulation, the cost of the software MMU induces a slowdown of a factor of 2.
In full system emulation, QEMU is approximately 30 times faster than Bochs [4].

参考

QEMU, a Fast and Portable Dynamic Translator, ATC’05.