Mathy Vanhoef: December 2013

As ARM is becoming more and more popular, the need to reverse engineer ARM binaries is increasing as well. For example, nearly all mobile phones have at least one ARM processor. In this post I show how to set up a virtual ARM environment using Qemu, give an introduction to ARM assembly (while highlighting the differences with x86), show how to reverse ARM binaries, and finally demonstrate how to write basic exploits for ARM. We will use the trafman challenge of rwthCTF as an example.

Virtual ARM Environment

To start we need an environment capable of running ARM binaries. Since I didn't have an ARM machine I created a virtual ARM environment using Qemu. Qemu is similar to VirtualBox or VMWare, except that it can support multiple architectures. This allows you the emulate ARM on your default x86 or x64 machine.

First we need to know which ARM architecture to pick. Most Linux distributions support two architectures: armel and armhf. Armel supports the ARMv4 instruction set and emulates floating point operations in software, while armhf supports the ARMv7 instruction set and uses hardware floating point operations. At least that's the case for Debian, Ubuntu uses the term "armel" differently [Ubuntu FAQ, ARM FOSDEM]. In this post I will stick to Debian. Though I haven't tested this myself, it should be possible to run armel binaries on an armhf system (this should be true for both Ubuntu and Debian). For completeness, there were also the 'arm' and 'armeb' architectures, but they are no longer supported and don't appear to be used anymore. We will use the armhf architecture, in particular Debian Wheezy armhf.

Unlike x86/PC systems, ARM systems typically don't have a BIOS. This means nothing initializes the hardware, reads the first sector of the disk, or finds where to execute code from. Instead, Qemu allows you to start a system using a "-kernel" option. It loads the given binary into memory at 0x6001000 and uses a kernel calling convention to pass the commandline and the location of initrd [#IO ARM Challenge]. In practice this means we'll always have to start Qemu using a -kernel and -initrd option.

With that background we're ready to install Debian Wheezy armhf using Qemu. First create an empty directory for the virtual environment, create a virtual harddisk for Qemu, and download the appropriate initrd and vmlinux from the Debian FTP:

qemu-img create -f raw hda.img 3G
wget http://ftp.debian.org/debian/dists/wheezy/main/installer-armhf/current/images/vexpress/netboot/initrd.gz
wget http://ftp.debian.org/debian/dists/wheezy/main/installer-armhf/current/images/vexpress/netboot/vmlinuz-3.2.0-4-vexpress

Note that vexpress stands for the system we will emulate, namely a Versatile Express board. We will emulate it with a cortex-a9 processor, so we pass vexpress-a9 as an argument to Qemu. Now start the installer under Qemu as follows:

qemu-system-arm -M vexpress-a9 -kernel vmlinuz-3.2.0-4-vexpress -initrd initrd.gz -append "root=/dev/mmcblk0" -drive if=sd,cache=unsafe,file=hda.img

Follow the instructions. A graphical environment is not required. After the installation we have to extract the installed initrd image, otherwise (if you use the downloaded initrd) the installer will again boot. To mount the virtual filesystem we first have to find the proper offset. Execute "file hda.img". The output should include something like "startsector 2048". The offset is now 512*2048. In case your offset is different than 2048 you will have to update the numbers in the next commands:

mkdir mountdir

mount -o loop,offset=$((512*2048)) hda.img mountdir/

cp mountdir/initrd.img-3.2.0-4-vexpress .

umount mountdir/

And now you can boot your virtual ARM environment with:

qemu-system-arm -M vexpress-a9 -kernel vmlinuz-3.2.0-4-vexpress -initrd initrd.img-3.2.0-4-vexpress -append "root=/dev/mmcblk0p2" -drive if=sd,cache=unsafe,file=hda.img

You should now have a working Qemu environment.

One warning: Qemu is still an emulator. When testing rare edge cases of exotic ARM instructions (which do not occur much in practice) there's always a chance the emulator is wrong.

Network Interface in Qemu

If the program you want to reverse engineer is a network service, it is convenient to be able to connect to it from your host system. This allows you to use tools on your host system to connect and exploit the binary (while debugging it in the guest system). This is especially convenient if Qemu is slow.

To enable network support we must specify a network device (NIC) that will be emulated. And we must specify the network backend that Qemu should use to interact with the emulated NIC. By default Qemu emulates an Intel e1000 PCI card with a user-mode network stack. User mode networking is great for allowing access to network resources, including the Internet. By default, however, it acts as a firewall and does not permit any incoming traffic. It also doesn't support protocols other than TCP and UDP. For example, ping won't work [WikiBooks].

The first option to allow incoming traffic is port redirection. This allows you to redirect a port on the host OS to a port on the guest OS. For debugging a simple service this is sufficient. As an example, redirecting TCP port 6666 on the host to port 8080 on the guest is done as follows:

qemu-system-arm -M vexpress-a9 -kernel vmlinuz-3.2.0-4-vexpress -initrd initrd.img-3.2.0-4-vexpress -append "root=/dev/mmcblk0p2" -drive if=sd,cache=unsafe,file=hda.img -redir tcp:6666::8080

You can now run a service on port 8080 in the Qemu guest, and connect to it from the host using port 6666. Multiple ports can be redirected. Port redirection can also be used to share folders between the guest and the host [WikiBooks]. The downside is that you cannot dynamically add new redirects once Qemu is running. Additionally, only TCP and UDP are supported.

The second option is to create a TAP interface [Tun/Tap Intro, QemuWiki]. It offers very good performance and can be configured to create virtually any type of network topology. This creates a new virtual interface on your host system, and you can use this interface to communicate with the guest. To start Qemu using a TAP interface as network backend, use the following command:

qemu-system-arm -M vexpress-a9 -kernel vmlinuz-3.2.0-4-vexpress -initrd initrd.img-3.2.0-4-vexpress -append "root=/dev/mmcblk0p2" -drive if=sd,cache=unsafe,file=hda.img -net nic -net tap,ifname=qtap0

Unfortunately you will have to manually configure IP addresses (e.g. setup a DHCP server, configure DNS, configure IP forwarding/NAT, etc). I will not go into detail here. If you don't need an internet connection and only need basic communication between the host and guest, execute the following commands in the guest:

ip addr add 172.20.0.2/24 dev eth0
route add default gw 172.20.0.1 eth0

I prefer port redirection because it's easier to configure and is usually sufficient to get the job done.

Introduction to ARM Assembly

Simplified, an ARM processor can be in two modes: ARM mode and Thumb mode. When the CPU is in ARM mode it executes 32 bit ARM instructions, and when it is in Thumb mode it executes mixed 16- and 32-bit Thumb instructions. The purpose of Thumb mode is to improve code density. How to switch between ARM and Thumb mode will be explained shortly.

In both ARM and Thumb mode we have access to 16 32-bit registers called r0..r15. Special registers are:

r12: intra procedure register (ip)
r13: stack pointer (sp)
r14: link register (lr)
r15: program counter (pc)

Register r12 is sometimes used as an Intra Procedure (ip) call scratch register, meaning it holds intermediate values when a subroutine is being called. So if you see the ip register in a disassembly listing, it does not stand for instruction pointer. Finally, subroutines must preserve the contents of r4..r11 and the stack pointer r13. A few short examples:

sub sp, #8 ; sp -= 8 // allocate stack memory
add r7, sp, #0 ; r7 = sp + 0
add r7, #0 ; r7 += 0
str r0, [r7, #4] ; *(r7 + 4) = r0 // dereference ptr
str r0, [r7, #0] ; *(r7 + 0) = r0
nop ; No Operation
movw r0, #33848 ; r0[0:16] = 0x8438 // move to bottom
movt r0, #1 ; r0[16:32] = 0x1 // move to top
; Now r0 contains 0x18438

Here r7 is used as a frame pointer to access local variables or function arguments. Also note that certain instructions (like add) can have three arguments (unlike the common two arguments in x86). In order to move a 32bit word in a register we need two instructions: one to set the bottom half of the register, and one to set the upper half of the register. This is a consequence of the RISC architecture of ARM: we have simple instructions that execute quickly, but must be more verbose to get stuff done.

What we called jumps in x86 lingo are called branches in ARM. For example "jmp offset" on x86 becomes "b addr" in ARM. Function calls from x86 are also called branches, in particular "call func" from x86 becomes "bl addr" in ARM. Here BL stands for Branch and Link, meaning the program counter r15 is set to addr and the link register r14 is set to the return address. This highlights an important difference with x86: in ARM the return address is not pushed on the stack by the caller. Instead, it it always saved in the link register (lr). The callee is responsible for saving the return address (generally the callee will save it on the stack).

The instructions bx, blx, and bjx can be used to switch processor mode from ARM to Thumb (or reverse). There are two use cases: the target is a label or the target is a register [ARM]. In the case the target is a label, the instruction is of the form "blx addr", and the mode is always switched. In case the target is specified using a register, e.g. "blx lr", the mode will be set according to the least significant bit (LSB) of the register. Note that all ARM and Thumb instructions must be 16-bit (two bytes) aligned, thus the least significant bit (LSB) of function addresses is in fact always zero. An even address means the target code is an ARM instruction, an uneven address means it is Thumb code. For example, when the register contains the value 0x8001 the processor will switch to Thumb mode and start executing at 0x8000. The same is true when returning from a function: if the return address is even it will switch to ARM, if it is uneven it will switch to Thumb mode. Returning can be done using "pop {pc}" or "bx lr". Both instructions will set the processor context accordingly. Note that branch instructions without an "X" in them do not change processor mode.

One practical consequence of ARM vs. Thumb addresses is that we often see references to "symbol+1", indicating that the symbol is referring to a function in Thumb mode. Additionally we have to use correct addresses in our exploit: Your debugger might say a function starts at address 0x8000, but if it is a function using Thumb instructions we must actually use the address 0x8001 (depending on context). Finally we note that we can inspect the current mode in gdb by using the "i r cpsr" command. Bit 0x20 will be set if we are in Thumb mode, i.e. Thumb mode is set if (cpsr & 0x20) != 0.

Let's end with the second half of the previous example:

blx 0x82ec ; branch to 0x82ec (and switch mode)
bl 0x8398 ; branch to 0x8298 (don't switch mode)
add r12, r12, #8, 20 ; r12 = 0x8000
; 0x8000 = RotateRight(8, 2*20) [ref]
mov sp, r7 ; set stack pointer to original value
pop {r7, pc} ; restore r7 and return to caller

The first instruction branches to address 0x82ec and switches the current processor mode, with the return address saved in the lr register. In case we were executing Thumb code this return address will be uneven, otherwise it will be even. The second instruction branches to 0x8398 without switching mode (there is no x in the branch instruction). Hence the function will be executed in the mode the processor is currently in. We also have an interesting add instruction: it adds RotateRight(8, 2*20) to r12 (which is equal to 0x8000). For more background on how this constant is encoded read the ARM documentation.

If you want to understand ARM in more detail, a fun way to learn it is by making an ARM intro challenge. Essentially that means displaying cool graphics using only ARM code. The full background on how to get started is on #IO STS. Perhaps luckily, a detailed understanding is not required to reverse engineer a program and follow this guide.

Reversing the Trafman Challenge

As an example we will reverse engineer the Trafman challenge from rwthCTF (for the binary to work properly you need to create a directory called "db" next to the binary). We open the binary in IDA and see that it failed to disassemble the first instructions:

This is because start is located at the uneven address 0x880D. In other words the binary starts executing in Thumb mode at address 0x880C, but IDA thinks it starts executing at 0x880D. We fix this by selecting address 0x880C, pressing ALT+G, and changing the value to 1 (this tells IDA that Thumb code is located at this address). Now again select address 0x880C and press C to directly convert the section to code.

We see a call which initializes libc. The pointer to the main function is stored in r0. Hence main() is located at address 0x87E0 and is Thumb code. So go to address 0x87E0, press ALT+G, set value to 1, and repeat until all instructions are in Thumb. Now select all the instructions belonging to the main function and press P to create a function of this code block. We have to manually select all the instructions in the function because otherwise IDA fails to create the function for us. Rename the function to main.

With this starting point you can analyze what the binary is doing. Also remember to interact as much as possible with the application before looking at assembler code. In fact, I found one vulnerability by simply entering unusual inputs.

The binary gets its input and output using stdin and stdout, respectively. During the CTF a new executable was spawned for every new connection and its input/output was forwarded. On startup the binary asks for a username. Our first goal is to find a valid username.

We want to locate the code responsible for checking the username. To do this we find where the error message is referenced. So, open the Strings subview and double click on the "ERR Invalid User" string. Now select the name of the string (aErrInvalidUser) and press X to find all references to the string. Only one function is referencing this string, open it. After a little bit of reverse engineering we get the following:

The ASCII string "traffic_operator" is loaded and compared to the string entered by the user. Based on the result of this comparison we either get the error message or are successfully logged in. That solves our first problem: the username is traffic_operator.

Once logged in the user has three options:

Get command: the user enters a file name of exactly 40 alphanumeric characters. If this file was previously added (see next point) its content is loaded and displayed to the user.
Execute command: the user enters a file name of at least 40 alphanumeric characters. It creates a file with as name the 40 first characters and allows to user to write arbitrary content to it.
Exit.

After testing these functions we find a buffer overflow when getting a command (i.e. when reading a file). To trigger the overflow we write a large file using the second option, and then read it using the first option. In my virtual ARM environment I used the following commands to test and trigger the vulnerability:

ARM host: nc -c ./trafman -l -p 8000
Guest OS: perl -e 'print "traffic_operator\n2\n"."a"x40 ."\n"."A"x400 ."\n1\n"."a"x40 ."\n"' | nc localhost 8000

We will analyze the crash in more detail in the next section. First, we reverse engineer the option menu. It's easy to locate this code because it's contained in the only function that main() calls. So open main and double click on the only non-library function being called. The text-based interface shows only three options. However, when looking at the code, we see there is a hidden option. When entering 23 the address of printf is displayed:

Notice that r0 is set to 0x8FC4 which is a pointer to the ASCII string "> %p". Register r7 is initialized in the beginning of the function to 0x12010. This address contains the dynamically loaded address of printf. Hence, it will tell us where libc is located in memory.

Getting Control of the Program Counter

We know what to do to crash the program. But we don't yet know what is causing the crash, which code is responsible, and whether it's exploitable. To answer these questions we run the binary in a debugger. Executing the following commands allows us to connect to the service from our host OS yet still debug the trafman binary itself:

gbc nc
set follow-fork-mode child
r -c ./trafman -l -p 8000

We now trigger the crash with:

perl -e 'print "traffic_operator\n2\n"."a"x40 ."\n"."A"x400 ."\n1\n"."a"x40 ."\n"' | nc localhost 8000

This results in a segmentation fault because it's trying to execute code at 0x41414141. Executing "x/20x $sp" in gdb to dump the stack shows that it's likely a stackoverflow vulnerability. With this alone we know enough to write a functional exploit. But since we're learning ARM we will take a look at the code that is responsible for the buffer overflow. We open the function that is executed when the user selects option 1 in the menu and reverse it.

What happens here is that each byte in the file gets read using fgetc until we are at the end of the file. In every loop the new byte is saved to an array on the stack, which will eventually overflow. The function returns to the caller using a "pop {pc}" instruction. An interesting instruction here is "adds r0, #1". It adds one to the register r0, and updates the status (a.k.a. condition) flags according to the resulting value. In particular, if the result is zero (meaning r0 was -1), the zero flag will be set. Using the BNE instruction we test for the zero flag. If it is set we exit the loop. Note that more instructions can set the status flag this way, and they can be recognized using the S suffix.

So we indeed have a classic buffer overflow (without a stack canary). The return address is saved on the stack and popped on the function exit. Our first step in exploiting the binary is to find out where we have to put the return address in our file (remember that the content of the file is written to the stack and parts of it will overflow the saved return address). To quickly accomplish this we use the pattern_create tool from metasploit. We again start the binary in Qemu as follows (I won't repeat this again):

gbc nc
set follow-fork-mode child
r -c ./trafman -l -p 8000

On our host we execute the following commands:

PATRN=$(/usr/share/metasploit-framework/tools/pattern_create.rb 400)
perl -e 'print "traffic_operator\n2\n"."a"x40 ."\n"."'$PATRN'" ."\n1\n"."a"x40 ."\n"' | nc localhost 8000

We let pattern_create.rb generate a unique pattern of 400 characters. As an example, a unique pattern of 20 characters would be "Aa0Aa1Aa2Aa3Aa4Aa5Aa". If we now know which bytes in this pattern get loaded in the program counter, we can let pattern_offset.rb calculate the offset where the return address is located in our file. In gdb we see it segfaults with the program counter at 0x41326a40. So we find the offset with the following command:

/usr/share/metasploit-framework/tools/pattern_offset.rb 41326a40

This returns 276. So we have to put the return address 276 bytes into the payload. As a quick test, execute the following command:

perl -e 'print "traffic_operator\n2\n"."a"x40 ."\n"."A"x276 ."ABCD\n1\n"."a"x40 ."\n"' | nc localhost 8000

This will segfault with the program counter at "ABCD". We now need to point the program counter to code that will get us a shell. Unfortunately the binary uses both ASLR and NX, meaning it uses random addresses, and we can't execute data on the stack and/or heap.

Defeating ASLR and NX

Our goal is a return-to-libc attack. In particular we will attempt to execute system("/bin/sh"). Defeating ASLR is trivial due to the hidden menu option which prints the address of printf. Based on the address of printf we can locate the addresses of other variables and functions within the libc library. To get the address of printf and system for a specific run we execute "p printf" and "p system" in gdb. This gives:

system: 0x76f2aa38
printf: 0x76f347d4

There is one catch: gdb says these functions are located at an even address. However, the functions are actually Thumb code! So make sure that the correct processor mode is used when executing these functions (if necessary make the addresses uneven). Anyway, the hidden menu option returns an uneven address, hence to get the address of system we take the address of printf and substract 0x9D9C from it.

So we can call system, but still need to pass an argument to it. There are two steps to accomplish this, first we need to find out where in memory the string "/bin/sh" is located, then we need to find a way to put a pointer to this string into register r0 (recall that r0 contains the first argument of a function).

Finding a "/bin/sh" string it easy. Internally the system() function uses it as well, so it's located in the libc library. To quick find it execute the following in gdb:

find &system, +99999999, "/bin/sh"

In my particular instance this returned 0x76fc6528. So to dynamically get the address of "/bin/sh" we take the address of printf and add 0x91D53 to it (again remember that the hidden option returns the uneven address). To load this pointer into r0 with NX enabled we will use return-oriented-programming (ROP).

We will use ROP gadgets located in the libc library, because that's the only library we know the addresses of. Within the library we have to find gadgets which set r0 and let us return to the system function. To do this I transferred libc.so.6 to the host and executed:

arm-linux-objdump -d libc.so.6 | grep "pop.*r0.*pc"

To install the arm-linux-* tools see IO SmashTheStack. There are likely better tools to find usable gadgets, but this quick and dirty method got the job done. Important is to search for both Thumb and ARM gadgets! Forcing Thumb mode can be done using -Mforce-thumb. The most interesting gadget found was:

5a7bc: pop {r0, r4, pc}

Remark that this is ARM instruction. In objdump printf is located at 0x387d4. I found this by executing:

arm-linux-objdump -d libc.so.6 | grep "printf>:"

So to dynamically get the location of the gadget we take the location of printf and add 0x21FE8 to it. Good, we now have everything in place to trigger the overflow and make it execute system("/bin/sh"). Combining all findings results in the following exploit, where I wrote the nclib myself and its functionality should be self-explanatory. When executing the exploit it looks like this:

DONE. FINALLY! :D

Mathy Vanhoef

Pages

Sunday, 15 December 2013

Reversing and Exploiting ARM Binaries: rwthCTF Trafman