Mathy Vanhoef: 2012

Monday 5 November 2012

Common Pitfalls When Writing Exploits

When you're exploiting software (legally hopefully ;) there are some common problems you might encounter. In this post I'm going to focus on three specific problems. I'll assume you're already familiar with basic buffer overflows and have tried to write one before. Oh and even if you were successful in writing the exploit, maybe you encountered some annoyances that are addressed in this posts. Let's go!

My exploit only works under gdb?

A common question people ask is why their exploit works when running the target program in gdb, but why no longer works when the program is started normally. There's actually another variation of this question: people wonder why they didn't obtain elevated privileges when executing the exploit under gdb. I'll first explain the elevated privileges problem and then we'll address the original problem.

No elevated privileges

When you are exploiting a suid program (e.g., for local privilege escalation) your exploit may work under gdb, yet you don't obtain any new privileges. First and for all, a "suid program" is a program that a normal user can execute, but runs under root privileges (to be precise it runs as the user that owns the program). Such programs are marked with an "s" suid bit. For example, the passwd utility is a suid program:

root@bt:~# ls -l /usr/bin/passwd
-rwsr-xr-x 1 root root 37140 2011-02-14 17:11 /usr/bin/passwd

This makes perfect sense, as passwd has to be able to update /etc/passwd and /etc/shadow and this requires root privileges. As a side note this means that if we can exploit passwd we can elevate our privileges to those of root. To get back to our original problem: if we exploit a suid program under gdb we don't obtain elevated privileges. What's happening? Before we answer this question, one should first realize that this is actually wanted behavior! Otherwise we could simply open the suid binary in gdb and overwrite the current code using

set *(unsigned int)(address) = value

This way one could directly inject shellcode without exploiting anything. So being able to debug a suid binary as a less privileged user shouldn't be possible. Yet you seem to be debugging the suid binary anyway?! Well, when launching the targeted suid program using gdb no elevated privileges will be granted to the program. You can then debug the program, though exploiting it won't result in elevated privileges (since it was not given elevated privileges).

Different stack addresses

Another problem is that the stack addresses of variables, fields, pointers, etc. will change when the targeted program is debugged using gdb. Let's use the following program to investigate these stack differences:

When directly executing the program I got the values "env=0xbfffddbc arg=0xbfffddb4 esp=0xbfffdcfc" but when running it under gdb I got "env=0xbfffdd8c arg=0xbfffdd84 esp=0xbfffdccc". We notice that all the addresses have changed! Why did this happen? Well there's a reason the program also prints the environment variables :) Looking at the output we can see that our program was given different environment variables under gdb. These environment variables are saved on the stack. And if different environment variables are given the space required to save them will also be different. Because of these different space requirements, the stack addresses of variables saved on the stack will change. Looking at the stack layout in the simplified illustration below we see that this influence nearly all stack addresses under in a program:

lower addresses

esp

[argv]

[envp]

higher addresses

Remember that the stack grows towards lower addresses (in the illustration above it grows upwards).

One way to solve this is using the command "env -i ./program". This runs the program using an empty environment. However, when launching gdb using "env -i gdb ./program" and running the program, we notice that gdb still added some environment variables. Damn you gdb! One possible way to deal with this is to include these variables when directly executing the program using something like

env -i COLUMNS=97 PWD=/root LINES=29 SHLVL=0 /root/a.out

Note that gdb uses the full path to start the program, and that this path is given to the program in argv[0]. So we must also use the full path when directly running the program (since the arguments are also saved on the stack). Although the addresses are now the same, this is annoying to do manually all the time. Our approach also breaks a few bash-specific tricks because the SHELL variable is cleared (can be fixed by setting SHELL=/bin/bash). An easier solution is to use this script written by hellman. Directly running the program or debugging now becomes:

./r.sh ./a.out
./r.sh gdb ./a.out

Both runs will have the same stack addresses. Perfect!

Padding in structures, stack, etc.

This is really more of a remark. When given the source code of a program you know the general layout of structures and function stacks. However, you cannot predict actual offsets (i.e., the precise location of fields). The reason is that most compiles will add padding. This is done so that fields are 2-byte or 4-byte aligned (or anything else that your compiler deems appropriate). The introduced padding can be seemingly random. So while you can use the source code to quickly detect vulnerabilities, you should still disassemble the compiled binary to calculate the offsets.

A common question is then "why does my debugger add X amount of padding bytes?" A frequent answer would be for performance, which is hardware/processor dependent. The answer can change for different versions of the compiler as well. There's just no general answer here. Also, another thing the compiler can do is change the order of variables on the stack, say placing a function pointer before a buffer even though it's not declared like that in the source code. (This way overflowing the buffer won't affect the function pointer, and used in combination with stack canaries this is used to decrease the potential impact of buffer overflows.)

Placing a Suid Script or Suid Shell as Backdoor

Alright. Say you've successfully exploited a vulnerability. Then it can be convenient to create a backdoor to easily obtain elevated privileges at a later point in time. Two seemingly easy strategies would be to either create a suid script or copy the /bin/sh executable and making it suid. Unfortunately the first strategy is not possible on Linux, and the second strategy needs some special attention. The first strategy fails because even if the script file is marked with suid, the kernel doesn't grant elevated privileges when starting scripts. Let's confirm this behavior in detail by inspecting the linux kernel. Essentially we need to learn how the kernel starts script files. Remember that scripts are treated as real executables and can be started using something like

execve("/myscripts/somescript.sh", argv, envp);

where we assume somescript.sh starts with "#!" as usual.

So let's assume that a script is being started using the execve system call. This function is implemented in the do_execve_common function in the linux kernel. Essentially it checks for errors, fills in a so-called binary structure parameter, and then calls search_binary_handler [1]. The real work is being done in this last function, which consists of scanning a list of registered binary formats until a match is found, and then calling that handler. Scripts are detected by checking if the file starts with "#!". The handler for scripts is located in binfmt_script.c in the function load_script. In this handler you don't explicitly see something like "don't grant suid to script files". In fact you see no mention of the suid bit at all. But that's the point, suid is never granted in the script handler. On the other hand, if we look at the handler for ELF linux executable, we notice that suid is explicitly set using SET_UID and SET_GUID. [2] The reasons scripts are not run as suid is because it's too easy to write insecure suid scripts.

Now to address the second problem. First, on my machine /bin/sh is a symlink to /bin/bash, so the remaining discussion will be specific to bash. Anyway, as mentioned copying /bin/sh to something like hiddenshell and making it suid can be problematic: You'll notice that starting your copy called hiddenshell won't grant you a suid shell. This is because bash automatically drops its privileges when it's being run as suid (another security mechanism to prevent executing scripts as suid). Looking at the source code of bash confirms this:

We see one interesting global variable in the if test: privileged_mode. Looking further into the code one learns that this flag is set when the "-p" parameter is given. So starting your suid shell using the "-p" parameter won't drop privileges! That solves our problem. To create a backdoor we copy /bin/sh and make it suid. The backdoor shell must then be started with the "-p" parameter.

But even though we solved the original problem, a new question arose. That new question is: Why does calling "system(command)" work in a suid binary. That is, when a normal suid binary calls the system function, that supplied command is also executed as being a suid program. Remember that system(command) will fork the process and then use execve to run "/bin/sh -c command". If bash always drops privileges, the command shouldn't be executed as suid! Let's first look at the code of the system() function in the glibc library:

What's different from our situation? It may seem pretty silly, but the difference is the name of the executable. Yes, try renaming hiddenshell to sh and then execute it again. Now you will get a suid shell even without supplying the "-p" parameter. Apparently my bash installation doesn't drop privileges when it's started using the "sh" command, probably to preserve backwards compatibility. Interestingly this is not behavior defined in the original source code of bash. No, it's a patch added by several linux distributions. See for example the changelog of bash for ubuntu (search for "drop suid"). It's implemented by updating the if test to:

if (running_setuid && privileged_mode == 0 && act_like_sh == 0)
disable_priv_mode ();

There, that concludes our very detailed discussion of suid scripts and suid shells!

[1] Playing with binary formats, marzo, 1998
[2] I'm actually not happy with this explanation at all. Unfortunately I don't understand the linux kernel well enough to give a really decent explanation. If you know more about this please comment!

Sunday 30 September 2012

Compat-Wireless Injection Patch for Aircrack-ng

Update: Compat-Drivers Patch

28 may 2013: My previous patch for compat-drivers was incomplete. The new patch for compat-drivers works for both the 3.8 and 3.9 versions. It will make monitor mode, injecting, changes channels, and fragmentation working properly. Before continuing make sure you have the linux header installed. If not, execute:

apt-get install linux-headers-$(uname -r)

Once the headers are installed, the following commands will download the latest compat-drivers version and apply the updated patch:

Quickstart: Compat-Wireless Patch

To get monitor mode, injecting, changing channels, and fragmentation properly working I recommend downloading the latest stable compat-wireless 3.6 package and applying my patch. This can all be accomplished by executing these commands and then rebooting:

The issue where you aren't able to change channels should be fixed for all drivers, and the ability to inject QoS headers is also fixed for all drivers. An issue specific to the RTL8187 driver used by AWUS036H and AWUS036NH that prevents a successful fragmentation attack is also fixed by my patch.

Background Behind the Patch

Normally when working with the aircrack-ng tool suite or other wireless network tools under linux, it's recommended you use the latest compat-wireless package to get access to the latest drivers. Generally the thought is: the newer the drivers, the better they work. Unfortunately this isn't always true.

First it's good to have a basic understanding of what compat-wireless and compat-drivers offers. Essentially you can think of compat-wireless (in fact, you should) as a sized-down version of the kernel tree, one that contains only the sources of the wireless drivers and the wireless stack. Therefore, you can apply any wireless-related patches to it and recompile them without having to recompile the whole kernel [quoted aircrack-ng docs]. So really it's a backport of the latest drivers so they can be used in older kernels. To make your life more difficult [1] the compat-wireless project has recently been renamed to compat-drivers, and now again seems to be renamed to "backports".

My own problems started when I was working on a few vulnerabilities I found in WPA-TKIP (I found a few bugs that prevented from packets being injected properly). To fix these I first looked at some of the existing patches available from the aircrack-ng directory. But no luck there, and nothing posted on the forum helped either. After messing around I managed to patch the driver myself. But what wasn't working? Well there were three main bugs I encountered:

Unable to change the channel of the monitor interface with the error message "SET failed on device mon0 ; Device or resource busy.".
When injecting packets the Quality of Service (QoS) header was being overwritten by the driver.
Injecting fragments was not working properly. Only the first fragment was being transmitted.

I played around with various configurations and version to see if some of them didn't have these problems. Unfortunately with everything I had problems. In particular I tried the following three things:

Backtrack 5 R3: Changing channels worked, but the Quality of Service (QoS) header was overwritten, and when using fragmentation only the first fragment was transmitted.
Compat-wireless 3.6 stable: All three problems were present (can't change channel, QoS overwritten, fragmentation not working).
Latest snapshot of compat-drivers: All three problems were present.

At one point I also tried using Backtrack 4 with an older release of compat-wireless. But that one also had bugs. Bugs, fucking bugs everywhere. I gave up finding a decent configuration and decided to patch the drivers myself.

Changing Channel

I fixed this bug by commenting out two lines in the function cfg80211_set_monitor_channel of ./net/wireless/chan.c file:

//if (!cfg80211_has_monitors_only(rdev))
// return -EBUSY;

It appears we couldn't change the channel when "normal" virtual interfaces are also using the device. Looking at the commit history I found the specific commit mentioning this: "having .set_monitor_channel work with non-monitor interfaces running would make interface combinations accounting ambiguous". So the new drivers prevent you from changing the channel if the device is also being used "normally". Practically this means that (if you don't apply my patch) you need to disable them by executing "ifconfig wlanX down" until you only have monitor interfaces over.

However disabling them all the time is annoying, and not many people know this! That's why I decided to remove this check in my patch. Most of the time if you're playing with monitor mode you're not using the device in a normal mode anyway, so this shouldn't be a problem. For the compat-drivers the file ./net/mac80211/cfg.c in function ieee80211_set_monitor_channel also needs to be changed:

} else /*if (local->open_count == local->monitors)*/ {

This again disables the check that only monitor interfaces are allowed to be present. I also found a post on the aircrack-ng wiki explaining how to install the latest compat-wireless and compat-drivers packages. That post discusses an older problem and its solution. So if you tried that one and it failed, try my patch again.

Sending Raw QoS Header

The QoS header is modified in the function ieee80211_set_qos_hdr of ./net/mac80211/wme.c and is called from ieee80211_xmit in ./net/mac80211/tx.c. We simply have to prevent this call from happening in monitor mode.

// Don't overwrite QoS header in monitor mode
if (likely(info->control.vif->type != NL80211_IFTYPE_MONITOR)) {
ieee80211_set_qos_hdr(sdata, skb);
}

This kills the bug. As a side node the "likely" macro is used for branch optimization by the compiler.

Patching Fragmentation

This one turned out to be specific to some devices. My patch is for the AWUS036H and AWUS036NH using the RTL8187 driver. The problem is that it will only transmit the first fragment. I did a simple test to further isolate the issue by instructing to driver to send the following frames (from a userland program):

Send the first fragment
Send an ARP request packet
Send the second fragment, which is the last one

It turned out the device actually transmits the ARP request packet first, and only then sends the first fragment! So the hypothesis was that it's first waiting for ALL the fragments before it begins sending them. Furthermore it would only send the next fragment once the previous one has been acknowledged (which isn't detected in monitor mode, hence only the first fragment is transmitted).

Luckily this can easily be fixed by removing the RTL818X_TX_DESC_FLAG_MOREFRAG flag that is being passed to the device (firmware). It will then immediately transmit the fragment. So the patch is at ./drivers/net/wireless/rtl818x/rtl8187/dev.c in the function rtl8187_tx:

// When this flag is set the firmware waits until ALL fragments have
// reached the USB device. Then it sends the first fragments and waits
// for ACK's. Of course in monitor mode it won't receive these ACK's.
if (ieee80211_has_morefrags(tx_hdr->frame_control))
{
// If info->control.vif is NULL it's mostly likely in monitor mode
if (info->control.vif != NULL && info->control.vif->type != NL80211_IFTYPE_MONITOR) {
flags |= RTL818X_TX_DESC_FLAG_MOREFRAG;
}
}

And hurray, that fixed the last bug =)

[1] It's annoying because most tutorials will reference older compat-wireless releases. Also finding the proper links on the new compat-drivers website is annoying.

Monday 20 August 2012

Secrets of Reverse Engineering: Flaws in Cryptex

The book Reversing: Secrets of Reverse Engineering has an interesting exercise. In the chapter on Deciphering File Formats the author created a command-line file encryption tool. It's called Cryptex and allows you to encrypt one or more files. Although it's a relatively simple tool it was claimed that:

"If you use long and unpredictable passwords such as j8&1`#:#mAkQ)d* and keep those passwords safe, Cryptex would actually provide a fairly high level of security." -- page 200

It's unsure what the author meant with "fairly high level of security". But when ignoring brute force attacks it's either secure or it's not. And in this case Cryptex is not secure. To be fair it wasn't the point of the author to make a truly secure implementation. The point was to have an interesting file format to analyze. But funny enough Cryptex does precisely what the author warned about:

"Perhaps (and this is more common than you would think) the program you are auditing incorrectly uses a strong industry-standard encryption algorithm in a way that compromises the security of the encrypted files." -- page 202

Indeed more common than you would think.

Let's first give a simplified overview on how Cryptex stores the encrypted files. All files are combined and saved in a single .crx file. The content of the file always starts with "CrYpTeX9" which acts as a signature to verify it's a file created by Cryptex. The remaining content of the .crx archive is divided into sectors. Each sector is 4096 bytes long. The first sector following the "CrYpTeX9" signature contains a list of all the encrypted files in the archive and their location in the .crx file. Finally the .crx archive contains all the encrypted files. Each file starts in a new sector and large files are spread out over multiple sectors.

The problem lies in how Cryptex encrypts its archive. It first derives a key from the password using a SHA1 hash and passes it to the Triple DES block cipher. So far so good. But then each sector is encrypted independently with the same key. Cryptex does this by resetting the state of the Triple DES cipher after encrypting a sector. Among other things this means that if certain sectors repeat we will also notice this repetition in the encrypted archive.

Also troublesome is when a small modification is made to a file which is then encrypted to a new Cryptex archive. To see this I created a file called short1.txt with as content only asterisks:

**************************************************
**************************************************
**************************************************
**************************************************
**************************************************
**************************************************
**************************************************
**************************************************
**************************************************

And another file called short2.txt with only one modification:

**************************************************
**************************************************
**************************************************
**************************************************
**************************************************
**************************************************
**************************************************
*******0******************************************
**************************************************

Encrypting short1.txt to short1.ctx and short2.txt to short2.txt gives the following:

We can see that both encrypted archives contain identical parts! The identical parts start at the beginning of the sector where files are saved. This allows someone to see where a file has been modified without knowing the password of the archive. Clearly the encrypted archive leaks information.

In the screenshot above we can see that the first 23 * 16 = 368 bytes are identical and the first difference is at 369 bytes from the start of the sector. For our text files each line is saved in 52 bytes (50 characters plus two bytes for the newline and carriage return). This means that the first different character is actually at position 7 * 52 + 7 = 371. Why don't these two positions match? The answer isn't too difficult: it's because 3DES is a block cipher and always encrypts blocks of 8 bytes at once. And the block containing the modified character starts at byte 369.

You might still wonder why the remaining blocks are also different. After all, both files have a sequence of asterisks at the end of the file. The reason is because 3DES is used in Cipher-Block Chaining (CBC) mode [2]. Essentially this means that previous processed blocks influence how the current block is encrypted. So once there is a difference between both files, it will influence all the blocks after it.

To conclude you should never reuse a key. Unfortunately that's exactly what Cryptex is doing: it incorrectly uses a strong industry-standard encryption algorithm in a way that compromises the security of the encrypted files.

Saturday 28 July 2012

WhatsApp Follow Up: Unauthenticated Upload

A bit more than two months ago I wrote a rather large post on the lack of security in WhatsApp. The conclusion of that post was that WhatsApp is insecure but they're working on it. Personally I'd never use it to send serious/secret/sensitive messages.

But not all the security vulnerabilities were explained in that post! There was one more, one that might be very severe. I also contact WhatsApp about this vulnerability and they said it would take some time to fix the issue. Considering that was more than two months ago they've had enough time to fix it. After explaining the problem we'll check if it's still present in the current version of WhatsApp.

The Problem

When using WhatsApp it's possible to send attachments to your contacts. The files you send to each other are saved on the server of WhatsApp so the recipient can download them at all times. Uploading is done by sending the following POST request over HTTPS:

Notice that no login details are required. For the example shown the file got uploaded to

https://mms303.whatsapp.net/d11/27/17/3/a/3a.html

which includes the original file name. In fact you can open the above file and see the html file. This means even though files get uploaded with a Content-Type of application/octet-stream they're still being treated as an ordinary HTML file once uploaded. This of course makes you wonder whats happens when sharing php files using WhatsApp. I tried uploading the same file as shown in the screenshot but now I named it 3a.php. The upload was successful and the file was saved at

https://mms303.whatsapp.net/d4/27/17/3/a/3a.php

but as you'll notice opening .php files is blocked with a 403 error message. Furthermore filenames such as index.php and .htaccess are blocked. So some protection seems to be included to avoid the user from uploading malicious files. Unfortunately I can't further test their server-side security since if I did that, I would be attacking their server and breaking the law.

So at first sight malicious files can't be uploaded. However only very minimal tests are possible without having permission of WhatsApp to test it in detail. But the fact is that it's not designed with security in mind.

Current Situation

After starting my Android emulator again (also after two months) and opening WhatsApp I was greeted with the message that my current version of WhatsApp was out of date. In fact it was so old that it simply couldn't connect to the WhatsApp servers. This seemed good. Maybe they also changed the upload process and it's now all authenticated and secure.

Unfortunately I got my hopes up too early - the bug wasn't fixed. The method outlined above still works and anyone can upload files. Considering this issue was reported more than two months ago I have decided to make it public in the hopes it will get fixed sooner.

WhatsApp could give every uploaded file a random filename. All downloaded files should be treated with a Content Type of application/octet-stream, which is currently not being done since the .html file could displayed in the browser. And of course only authentication users should be able to upload files!

Conclusion

As I've said before: watch you when using WhatsApp. Don't use it for any serious or important messages. Don't blindly trust incoming messages.

Monday 11 June 2012

MySql Authentication Bypass Explained

Yesterday a severe vulnerability in MySql was published (CVE-2012-2122). It allows an attacker to login under any username simply by trying to login around 300 times. No special tricks required. The line below shows how you can test for this vulnerability:

for i in `seq 1 1000`; do mysql -u root --password=bad -h <remote host> ; done

Where <remote host> should be replaced by the server you want to test. If it's vulnerable there's a high probability you will successfully connect to the MySql instance.

The flaw was located in sql/password.c in the function check_scramble:

Can you see what's wrong with it? To give a hint, the vulnerability is in the last line of the function, where the call to memcmp is made.

Before explaining the bug, first some background. When a user connects to MariaDB/MySQL database, a token (SHA1 hash over the given password and a random scramble string) is calculated and compared with the expected value [source]. So in essence the above code checks if the supplied password is correct. The comparison of the two hash values is done in the last line using a call to memcmp.

Reading the manual page of memcmp we see that it returns zero when both hashes are equal. More precisely

The memcmp(s1, s2, n) function returns an integer less than, equal to, or greater than zero if the first n bytes of s1 is found, respectively, to be less than, to match, or be greater than the first n bytes of s2.

The problem is that memcmp can return any integer (depending on the input value). Although most implementations of memcmp only return -1, 0, or 1 this is not required in the specification of memcmp. Now what happens when our implementation of memcmp returns a different number? Let's find out by assuming that it returned the number 0x200. Since this value is not equal to zero the two hashes are not equal, hence the passwords were also not equal.

Unfortunately the integer 0x200 is being cast to a my_bool type, which in turn is typedef'ed as a char. Because a char is smaller than an integer the number has to be truncated. In practice only the last byte of 0x200 (the return value of memcmp) will survive. And this last byte is 0x00, so simply zero.

We now get that even though memcmp returned a value different than zero (say 0x200) the function check_scramble does return zero! As we can see in the description of the function it should only return zero when the password is correct... which here is clearly not the case, hence the security vulnerability.

Apparent "randomness" in return values of memcmp

The question is now when, and why, memcmp would return a number different than -1, 0 or 1. The answer lies in how memcmp compares the input buffers. To compare if two bytes are equal it subtracts them. Only when the result is zero are the two bytes equal. If not zero, one could simply return the result of the subtraction, as this would match the behaviour specified in the manual page. The range of values memcmp would then return lies between -128 and 127. But this doesn't include our example number 0x200!

To speed up the comparison memcmp will subtract multiple bytes at once (when possible). Let's assume it will subtract 4 bytes at once. Then the result of memcmp will be within the range -0x80000000 and 0x7FFFFFFF. This range does include our example value of 0x200.

The apparent randomness comes from the fact that the protocol uses random strings. So each time a random input is given to the memcmp function. The security vulnerability only occurs when the last byte of the integer returned by memcmp is zero. Hence the probability of the vulnerability occurring is 1/256.

The Fix

The vulnerability was fixed by defining the macro

#define test(a) ((a) ? 1 : 0)

and putting it around the memcmp call.

Tuesday 22 May 2012

WhatsApp Considered Insecure (2012)

The views expressed here reflect the views of the author alone, and do not necessarily reflect the views of any of their organizations.

Summary

Since this post gets many hits let's start right away with the conclusion: I consider WhatsApp to be insecure. Personally I'd never use it to send serious/secret/sensitive messages. And you should never blindly trust incoming messages!

Introduction

Quick links: WhatsApp Security Advisory MVSA-1 and MVSA-2.

During my internship at a Ernst & Young I created a methodology to test the security of mobile applications. After I finished it, and after completing my internship successfully, I decided to take a look at WhatsApp and apply the methodology I created on my own. Several new vulnerabilities were found, including a very severe one that even affected people not using WhatsApp. But before going into detail let's first investigate the security history of WhatsApp.

WhatsApp Security History

Over its lifetime WhatsApp has gotten some bad security attention. One of the first vulnerabilities was found in May 2011 where an authentication flaw was found making it possible to register any phone number. This video demonstrates the flaw. As will be explain in this post, there were (are?) still flaws in the phone number verification process.

Around the same time it was found that WhatsApp doesn't encrypt the messages you send and receive, meaning that if you use an unsecure wireless network people can sniff your WhatsApp messages. A few days ago a script was released demonstrating how an attacker can abuse this flaw to sniff all the messages you send and receive [Article]. At the time of writing this post this vulnerability has been fixed for some clients but not for all (notably it's fixed for Android).

In September 2011 two new vulnerabilities were found in the registration process of WhatsApp. The first was found by Andreas Kurtz and the second by SEC Consult. The one found by Kurtz is a symptom of a fundamentally flawed registration process. In essence during the registration process the servers of WhatsApp fully trust anything the client says. However everything happening on the phone (= client) can be manipulated by someone with sufficient technical skills. In security it's well known to never trust anything the client is telling. Unfortunately that's exactly what WhatsApp is doing. Again this flaw in the registration process has not been adequately fixed.

The second vulnerability found by SEC Consult also allows an attacker to register any phone number he or she wants (by brute forcing the challenge PIN number). WhatsApp has implemented a basic fix, but as SEC Consult already mentions it provides insufficient protection. An attacker targeting a group of, say, 100 people, will on average still successfully compromise at least one individual.

During their research SEC Consult also found another vulnerability which makes it possible to change the status of any WhatsApp user. They reported this issue to WhatsApp but after waiting 3 months WhatsApp still didn't fix the issue, so they decided to make the information public [Source]. As a result someone created the website whatsappstatus.net where anyone could enter someone's phone number and the new status message of the user [Article, Blogpost, Script]. This removed the small technical knowledge needed to execute this attack. Fortunately this has now been fixed and whatsappstatus.net is no longer online.

To provide some more background the research paper "Evaluating the Security of Smartphone Messaging Applications" is also interesting. It shows that nearly all mobile messaging applications have some security vulnerabilities. So although WhatsApp will be subject of this post, remember that other mobile applications might also contain several vulnerabilities. If you want to have some fun yourself I suggest checking out the applications mentioned in that paper ;)

Anyway all this shows that WhatsApp had some severe vulnerabilities in the past. The phone number registration flaws are a symptom of a fundamentally flawed design. Their late response to the work of SEC Consult also isn't good. Another big negative is that they never mention security updates on their website. As a security researcher you don't know whether or not a vulnerability has been fixed. And as a user of WhatsApp you are never warned of potential problems!

Man, this section turned out longer than expected (actually this whole blog post is a lot longer than expected :). Time to move on the real stuff!

Authentication Design Flaw(s)

When researching WhatsApp I started by investigating the registration process of WhatsApp. And this was also where the first severe vulnerability was found. To understand this vulnerability we'll first have to understand the registration process. It begins with the user entering his or her phone number, after which WhatsApp will issue the following HTTPS request:

https://r.whatsapp.net:443/v1/exist.php?cc=<cc>&in=<phone>&udid=<id>&sim=<sim>

Here <cc> and <phone> stand for the country code and phone number entered by the user, <id> for the hardware identifier of the device, and <sim> for the MSISDN of the device (the real phone number of the phone). The sim parameter is not used and can be dropped. Essentially this request checks if the phone number has already been registered on the device. If it hasn't been registered the reply will be as follows

<?xml version="1.0" encoding="UTF-8"?>
<exist><response status="fail" result="incorrect"/></exist>

This mechanism is interesting in its own right. But for now we'll ignore this request and come back to it later (see section "What's the Password?"). So let's say the phone number hasn't yet been registered on the device. WhatsApp continues by making a second HTTPS request:

https://r.whatsapp.net:443/v1/code.php?cc=<cc>&to=<cc><phone>&in=<phone>&lg=en&lc=US&mnc=260&mcc=310&imsi=<imsi>&method=self&reason=&token=<token>&sim=<sim>

An example value of <token> is 9fe2a4f90b4acff715d1daf84428bddd. You'd think this token is going to be used in the next stages somehow. But it's not. WhatsApp has made several of these strange implementation decisions, so don't be surprised when more strange behaviour is going to be discussed. Also note the parameter method which specifies the authentication method used. WhatsApp offers several authentication methods in an attempt to assure one of them will always work. The "self" method is used first. Anyway the default response to this request is of the form

<?xml version="1.0" encoding="UTF-8"?>
<code><response status="success-attached" result="613"/></code>

The success-attached parameter is interesting and it's corresponding value, here 613, will be of importance during the next stages of the registration process. When the response is received WhatsApp continues by asking the user if he or she really wants to register the given phone number.

When the user clicks OK WhatsApp will attempt to send a special SMS message to the phone itself. The code that achieves this is:

As you can see the SMS message "WhatsApp <code> WhatsApp internal use - safe to discard" is sent to a random port of the phone number the user is trying to register. An example of the included <code> is ROGEuMirJNfCqXnuFdSUrwGYYfeA8G36. So essentially we're sending an SMS message to ourselves. After sending the message WhatsApp will monitor incoming messages. Unfortunately I'm using the Android emulator and can't fully emulate the required behavior yet (intercepting outgoing SMS messages on the Android emulator, and sending data SMS messages, is a good subject for a new blog post). My guess is that if WhatsApp detects that an SMS message is received from the phone number the user is trying to register, and has the exact same <code> as previously transmitted, it will complete the registration.

Now after getting this far we can do some very targeted google searches to see if other researchers are also working on this. And indeed! We find a Spanish blog post of Jose Selvi where he has done similar research. His post confirms our guess and even has a small video demonstrating the process.

So when WhatsApp receives the SMS it sent itself, it continues by registering the phone number. This contains a huge design flaw. The client (so your phone) is fully trusted! Basically the client does some stuff, and then tells the WhatsApp servers "hey register this number". But like we mentioned the client can't be trusted. And this has been demonstrated by modifying the WhatsApp client so an attacker can register any phone number he or she wants [Source].

Selvi has reported this vulnerability to WhatsApp and apparently they "fixed" it. But how can it be fixed if they're still using the method and still trusting the client?! I contacted WhatsApp myself and asked them about this. They answered that the "self" authentication is still in use, and that "the described method as published in the url has been resolved". Of course it's not because the exact method described by Selvi no longer works that it's safe! As mentioned on the blog:

"No obstante, no se ha realizado una auditoría profunda de la aplicación más allá de la prueba de concepto que se realizó como demostración de la charla, por lo que es posible que otras vulnerabilidades similares, o de otro tipo, puedan existir."

Roughly translated he's saying that no real audit of WhatsApp has been made and that it's possible vulnerabilities still exists (thanks google translate). In a comment on his blog it was even mentioned that more vulnerabilities were found. However these are not published so we have no idea how they work exactly.

After all this it's no longer our task to demonstrate these vulnerabilities to convince WhatsApp that they should improve their security. It's a waste of time. It's now WhatsApp responsibility to prove their registration process is secure. Not the other way around. I consider it completely broken and insecure until proven otherwise. I contacted WhatsApp and explained that the client can't be trusted to prove the phone number actually belonged to the user. As a reply I got the message "understood", so let's hope they'll actually remove this inherently flawed authentication method in the future.

Server Sent Verification SMS

In case the "self" authentication method fails a second method is used. It begin by requesting the HTTPS page

https://r.whatsapp.net:443/v1/code.php?cc=<cc>&to=<cc><phone>&in=<phone>&lg=en&lc=US&mnc=260&mcc=310&imsi=<imsi>&method=sms&reason=self-send-timeout&token=<token>&sim=<sim>

Note that the method parameter is now set to sms. When the WhatsApp server receives this request it will send a text message containing "WhatsApp code <pin>" to the phone number specified in the HTTP request. Here <pin> is a 3 digit number. WhatsApp will attempt to detect this incoming message. If the phone number entered by the user is indeed his/her number the text message will arrive. When the message is received WhatsApp will prove this to the server by sending the challenge <pin> number using the following request

https://r.whatsapp.net:443/v1/register.php?cc=<cc>&in=<phone>&udid=<udid>&code=<pin>

If the pin is correct the phone number will now to tied to the given hardware id (udid). Essentially the phone number will be the username and the hardware id will be the password. This is another mistake: the hardware identifier (i.e. the password) can be read by other applications! More on that later though.

Remember the number included in the success-attached response earlier? Go look again, it's right above the "We will be verifying the phone number XXX" screenshot. In our case the number was 613. Well guess what, this is the number WhatsApp (used to) include in the challenge SMS message! Hence you don't have to own the phone number to register it. Simply write down the number in success-attached and request the register.php page with this pin number. The vulnerability has been demonstrated on a test phone number given by WhatsApp and worked.

This vulnerability has been reported to WhatsApp and has been fixed as explained in my security advisory. Important is that this affected people not using WhatsApp too! An attacker could register your phone number and send messages in your name to all your friends who use WhatsApp. He or she would also receive all messages send by people using WhatsApp.

Sent PIN by Voice Call

Finally it's possible to let WhatsApp send the PIN by a voice call. This is basically the same as the server sent verification SMS, except that the user now has to answer the voice call and manually enter the pin number.

Responsible Vulnerability Disclosure

After finding the registration vulnerability (and a few others) I decided to contact WhatsApp. The plan was to get these vulnerabilities fixed before making them public. This is something where WhatsApp did do well: they responded timely to all my emails and remarks. The security advisory includes a timeline of messages sent. Based on this I had good hope to get most of these issues resolved.

After contacting them they requested a demonstration of the phone number registration vulnerability on a test phone number. So I registered the test phone number using the technique we just covered. As a response they said an updated client for Android was available which solved the registration vulnerability. This was strange since the vulnerability can only be fixed server side! And indeed the vulnerability was not fixed.. So I investigated what can be done to fix the vulnerability. Apparently you can't simply leave out the success-attached parameter in the response since the client requires this parameter to be present. Hence I suggested to simply include a random PIN number for the success-attached parameter. They agreed. And so this vulnerability was fixed :)

Another vulnerability I discovered was that the Android client didn't verify the Common Name of the HTTPS certificate. Remember that all request made when registering the phone are done over a secure HTTPS connection. Unfortunately this vulnerability allows an attacker to intercept all traffic when he or she is performing a man-in-the-middle attack. Any valid certificate would be accepted by the Android client. This vulnerability was also reported to WhatsApp and has been fixed in the latest Android Client.

Finally a less severe vulnerability was also reported and patched. It was a full path disclosure in a PHP warning. The functions producing the warning appeared not to be further exploitable, although this hasn't been explicitly tested. A redacted screenshot of the warning can be seen here.

I also asked them when encryption will be implemented for sending and receiving messages. Apparently they already submitted an Android version using an encrypted channel on May 7. An update for the Windows Phone was submitted to the marketplace on May 11. They can't say when the other versions will be patched because each version has its own software stack and development cycle. Although I cannot test the versions currently in the marketplaces, the Android version available on the WhatsApp website indeed uses encryption to send its messages. The actual strength of the encryption system has not been investigated.

Brute Forcing the PIN

Currently WhatsApp uses a basic brute force protection which allows the user to guess the PIN number only 10 times. After more than 10 attempts the phone number is blocked and one must contact WhatsApp in order to register the number. However the pin number has only 10^3 = 1000 possible values. As already mentioned by the security advisory of SEC Consult this doesn't provide sufficient protection.

The problem is that it's still feasible to attack a group of users. For example when attacking a group of 30 users the probably that all brute force attempts fail is (990/1000)^30 = 76%. In turn this means the probability of at least one successful brute force attempt among the 30 users is 24%. This is too high. The PIN number should be a lot larger! In a response to this WhatsApp stated that they are actively working on brute force protections. So, although the current brute force protection is a lot better than nothing, hopefully it will be improved in the future.

Auto Update Vulnerabilities

This one only applies to the Android version when directly downloaded from the website of WhatsApp. Versions downloaded from the marketplace use a different updating mechanism. Anyway, during startup it checks the following URL to see if an update is available:

www.whatsapp.com:80/android/9/WhatsApp.version

This page returns the version number of the latest available client for Android, for example 2.7.7435. If this version number is higher than the installed version WhatsApp will first request the checksum of the latest .apk file of the Android file by loading the URL

www.whatsapp.com:80/android/9/WhatsApp.cksum

An example response is 4273431032. It then continues by requesting the new .apk file

www.whatsapp.com:80/android/9/WhatsApp.apk

The problem is that all this is done over an unsecured HTTP connection. And unfortunately WhatsApp doesn't verify the downloaded .apk using a digital signature or similar. Thus an attacker can intercept these requests and force the user into downloading a malicious .apk file. So that's the first vulnerability: an attacker can intercept the HTTP traffic and injecting his own application. When further investigating the auto update functionality we find more. The downloaded .apk file is saved to

/mnt/sdcard/WhatsApp/WhatsApp.apk

and accompanied with the empty file called WhatsApp.upgrade in the same directory. The next time WhatsApp starts it will look for these files, and if both are present display a dialog asking if the user wants to install the update:

Unfortunately this can be abused by a 3rd party application. A 3rd party application can write the files WhatsApp.apk and WhatsApp.upgradate to the directory /mnt/sdcard/WhatsApp, where WhatsApp.apk can be a malicious application. This will trigger the update dialog of WhatsApp as shown above. The user will them be prompted to install the malicious application from the context of WhatsApp. So luckily the user still has to approve the installation. But because he or she thinks the program is an update of WhatsApp it's more likely that the user will agree to install the application, especially if the name would be similar to WhatsApp and uses the same logo. This is not something you want!

What's the Password?

Time to return to the very first HTTPS request sent when registering a phone number. It was

https://r.whatsapp.net:443/v1/exist.php?cc=<cc>&in=<phone>&udid=<id>&sim=<sim>

Where the sim parameter can be dropped. What's interesting is that if the number was previously registered on the phone this request will return

<?xml version="1.0" encoding="UTF-8"?>
<exist><response status="ok" result="<cc><phone>"/></exist>

and WhatsApp will start and log you in. No further information is needed! From this behavior we can derive that the phone number can be seen as the username and the udid as the password. The udid is a "Unique Device IDentifier" of the phone (a hardware id).

There's one problem with this design, namely that every application on your phone can read this unique device identifier (given the appropriate permissions). For example on Android an application having the permission android.permission.READ_PHONE_STATE can read the udid. Let's phrase this differently: another application can read your WhatsApp password and impersonate you. WhatsApp has responded that they are "working on unique 160-bite [sic] random passwords, but is going to take some time".

Conclusion

Let's recall the research paper "Evaluating the Security of Smartphone Messaging Applications" where it was shown that nearly all mobile messaging applications contained security vulnerabilities. In this post we specifically targeted WhatsApp. We found that it contained numerous vulnerabilities, and I personally consider WhatsApp an insecure application that I wouldn't use. On the bright side WhatsApp is taking these security issues more seriously and appears to be in the processing fixing all mentioned vulnerabilities. Only time will tell if they actually manage to make it a secure application or not. It's also an open question how insecure the other messaging applications really are.

In general mobile developers need to learn more about security!

Monday 27 February 2012

TEDxUHasselt Salon: How you could be hacked

TEDxUHasselt is an independently organized TED event by three of my fellow students. In order to extend the local community around it, and attract passionate people, they recently organized TEDxUHasselt Salon. As they state: "TEDxUHasselt Salon is an informal event aimed at building a local community of passionate people". Basically anyone with an interesting idea or passion can register and give a talk of 10 minutes.

I decided to give a talk as well, with as subject something related to computer security. To keep it interesting and relevant for the audience I had chosen to work out a realistic example of how a cyber criminal could attack TEDxUHasselt attendees. For the reader who's familiar with security tools it was essentially a small demo of metasploit and meterpreter, combined with a realistically forged email designed to manipulate the target into clicking a deceptive (spoofed) URL.

The presentation started with a few examples of how cyber criminals are trying to steal money. Then we layout the groundwork for a fictive attack on the TEDxUHasselt attendees. Arguably the most interesting part was a video showing a demonstration of the victim reading the spoofed email and following the link, which results in the victim being hacked. Then we show everything a hacker can do with your computer, including downloading sensitive files, logging all the keys the victim presses, and even recording the microphone of the victim!

The slides of the presentation can be seen here:

TEDxUHasselt Salon: How you can be hacked

The actual execution of the fictive attack has been recorded and you can see it in the youtube video below. As you'll notice it isn't hard to launch the attack and to retrieve sensitive information once you have control of the victim's computer. Note that the video immediately starts at the part that is relevant for all users, the first part of the video consists of the hacker configuring his tools to launch the attack.

The presentation itself was a big success and everyone seemed to like it! People are interested in security, you just have to make it understandable and relevant for "the common man". A thank you goes out to Rutger, Niels, and Bob for organizing TEDxUHasselt Salon!

Wednesday 1 February 2012

Foundations of Privacy

Privacy is a difficult concept and there are many sides to the privacy issues we are facing in our digital age. For my master thesis I studied privacy in databases, where the goal is to find a mathematical definition of privacy. But in this post I won't focus too much on the math behind it all, instead I'll go over some interesting observations I have made during my work, and explain some of the basic concepts. To get started we'll look at some privacy fiasco's that occurred in the past.

1. AOL Search Fiasco

We begin with the AOL search fiasco where AOL released around 20 million search queries from 65,000 AOL users. In an attempt to protect privacy, the username of each individual has been changed to a random ID. However, different searches made by the same person still have the same ID. Shortly after the release the person with ID 441779 was identified as Thelma Arnold. By combining all the searches she made, it became possible to find out her real identity. This already shows that simply removing unique identifiers such as a person's name, address or social security number do not provide privacy, and more generally that anonymity does not imply privacy (more on this later). No real care was given to anonymizing the search results. So this is not a true example that removing identifying attributes (eg., name, address, etc.) fails to protect privacy, as no real care was given to anonymizing the dataset. This can be deduced from the observation that the person who was responsible for releasing the data was quickly fired.

2. The HMO Records

In this attack the medical records of the governor of Massachusetts were revealed. The attack begins with the following observation: One can uniquely identify 63% of the U.S. population with only knowing the gender, ZIP code and date of birth of an individual. We now turn our attention to two different datasets. The first one is the voter registration list for Cambridge Massachusetts and includes information of each voter. A bit simplified the dataset can be represented using the following relation:

VoterRegistration(ZipCode, BirthDate, Gender, Name, Address, DateLastVoted)

The second dataset is the Massachusetts Group Insurance Commissions medical encounter data, containing medical information on state employees (and thus also contained the medical records of the governor of Massachusetts). A small part of this medical data, without all identifying information such as name, address and social security number removed, were released. It can be represented using the following relation:

MedicalData(VisitDate, Diagnosis, Procedure, ZipCode, BirthDate, Gender)

As we can see both datasets can be linked to each other by combining gender, ZIP code and date of birth! So even though the names were removed in the medical dataset, researchers were still able to link the data back to individuals. As a result the medical records of the governor of Massachusetts were found.

3. Kaggle

Finally we come to an example that shows it's possible to find correspondences between social networks merely based on the structure of underlying friendship relations. Even more interesting is that the art of "de-anonymizing" a dataset was used to win a competition. Kaggle organized a competition where participants were given a dataset of people and their underlying relations, which we will call a social graph (each node is represented using a random ID and no information such as username, name, location, etc. were included). An example of a social graph is shown in the figure below:

The circles/letters represent people and a line is drawn between them if they are friends. However not all relationships were given to the participants. In fact, the goal of the competition was to determine whether certain given relationships, which were not present in the original social graph given to the contestants, were fake or real. Of course if we'd knew who the individuals behind all the nodes are, we could just look up if two nodes are really friends or not! So if we manage to de-anonymize the social graph given by Kaggle we can game the competition: Instead of making a machine learning algorithm to predict the answer we simply look them up.

It was found out the social graph was created by crawling Flickr. So the researchers made their own crawl of Flickr and based on the structure alone created a mapping between their own crawl and the Kaggle social graph. In other words they now know which individuals correspond to the supposedly anonymous nodes, and thus they identified the individuals using only the structure of his or her social graph.

The Difficulty of Absolute Privacy Protection

It should be obvious by now: Assuring privacy is hard. We can't simply remove attributes such as name, address, social security number, etc. from a dataset. The reason is that seemingly innocent information such as gender, ZIP code, and birthdate can still be used to uniquely identify an individual. Clearly the need for a rigorous privacy definition is needed. Researchers have proposed theoretical models such a k-anonymity, but it turned out to have problems so L-diversity was suggested. I turn weakness were found in L-diversity, resulting in a new definition called T-closeness. But still there are problems even for T-closeness. So in practice assuring privacy is hard, and even finding a robust mathematical definition appears to be a challenging task. What can be done?!

Before answering that question we're going to argument that the situation, at least from a theoretical point of view, is even worse than one might already think. There is in fact theoretical evidence suggesting that absolute privacy protection is impossible. This proof was heavily based on an earlier attempt at the proof. Now of course not releasing your information does provide absolute privacy. But what they have proven is that if a database is sufficiently useful there is always a piece of external information that, combined with the output of the database, violates the privacy of an individual. An example explains this best. Assume that learning the salary of an individual is considered a privacy violation. Further assume that we know the salary of Alice is 200 EUR higher than the average salary of a Belgian citizen (this is the external information). So we don't know her exact salary. Let's say we now receive access to a database containing the salary of every Belgian citizen. From this database we can calculate the average salary of a Belgian citizen, say 2500 EUR. Combining this with the external information teaches us that the salary of Alice is 2700 EUR! Et voila, the privacy of Alice has been violated. All due to gaining access to a database from which we merely learned an average of a particular value.

So it seems it's very difficult (impossible) to assure that there will be no privacy violation whatsoever. What can we do? The answer is simple. We will reduce the probability of a privacy violation as much as possible.

Differential Privacy

One of the promising definitions in the privacy research community is called differential privacy. It provides relative privacy prevention, meaning that the probability of a possible privacy violating occurring can be greatly reduced. Important is that when using differential privacy the dataset is not handed out to the public. Instead users can pose questions (queries) to the dataset (eg., over the internet), and answers will be given in such a way to reduce the probability of a privacy violation. In practice this is done by adding a small amount of random noise to the answer. Note that there is always a tension between accuracy of the data, and the privacy guarantees provided by the data release. The higher the privacy guarantees, the lower the accuracy.

Differential privacy assures that, when you provide your personal information to the database, the answers to queries will not differ significantly. In other words handing over your information should not be visible in the calculated answers. This results in a more natural notion of privacy: When giving your information your privacy will on be very minimally reduced (remember that we consider absolute privacy protection impossible).

More formally, when you have two databases, one with your personal information (D1) and one without your personal information (D2). Them the probability (Pr) that an answer mechanism (K) returns a specific answer (S) should be nearly identical (multiplication by exp(epsilon)) for both databases. Mathematically this becomes

The parameter epsilon defines how much the probabilities are allowed to vary. A small epsilon such as 0.01 means the probabilities should be almost identical (within multiplicative factor 1.01). However for a larger epsilon such as 1 the probabilities can differ by a larger amount (they must now be within multiplicative factor 2.71). The reason we use the exp(epsilon) instead of just epsilon is because manipulating formulas that use exp(epsilon) is a lot more straightforward.

We can now design algorithms that answer queries while assuring differential privacy. In short we can now prove we assure a certain definition of privacy! There are still some difficulties in implementing differential privacy in practice. There first one is that it's not clear what a good value for epsilon would be. The second problem is that you cannot ask an unlimited amount of questions (queries) under differential privacy. So in practice a reliable mechanism must still be designed to ensure only a limited amount of queries are answered.

Another downside is that the actual dataset is not released. For researchers being able to visually inspect the dataset and first "play around with it" can be important to gain a better understanding of it. Therefore the problem of releasing an anonymous dataset, while minimizing the probability of a privacy violation, is still an important topic.

Anonymity vs. Privacy

Another important observation is that anonymity does not imply privacy. Take for example k-anonymity. It states that an individual must be indistinguishable to at least k-1 other individuals. Below we give an example of a 4-anonymous database. Based on the non-sensitive attributes, which are used to identify an individual, we notice there are always at least 4 rows having the same values. Hence each individual is indistinguishable with at least 3 other individuals.

If you know a person having Zip code 1303 and an age of 34 is present in the database, he or she can correspond to any of the four last rows. So the anonymity of that individual is preserved. However all four rows specify that he or she has cancer! Hence we learn our targeted individual has cancer. Anonymity is preserved while privacy was violated.

The Ethics of Using Leaked Datasets

Another observation made during my master thesis is that it's hard for academic researchers to obtain dataset to test their algorithms or hypothesis. Even worse is that most are forced to create their own datasets by crawling public profiles on sites such as Twitter, LiveJournal and Flickr. This means their test data only contains public profiles, creating a potentially significant bias in their datasets. Another problem is that researchers won't make these dataset publically available, probably out of fear of getting sued. And that is not an irrational fear, demonstrated by the fact that Pete Warden was sued by facebook for crawling publicly available data. Because these datasets are hard to obtain, peer review becomes difficult. If another, independent, researcher doesn't have access to the data he or she will have a very hard time trying to reproduce experiments and results.

But more interestingly is that there are public dataset available! It's just that no one seems to dare to use them. As mentioned the AOL dataset was publicly released, but researchers are hesitant to use it. Another dataset, called the Facebook100 data, is also a very interesting case. Researchers were given social graphs of 100 American institutions in anonymized form (private attributes such as name, address, etc. were removed). Amazingly the dataset contains all friendship relations between individuals present in the dataset, independent of their privacy settings on facebook. As we've seen an unmodified social graph can be vulnerable to re-identification attacks (see the Kaggle case). A week after the release facebook requested the dataset to be taken down. Deleting data on the internet is hard however, and copies of it are still floating around. Nevertheless, because facebook requested to dataset not to be used, researchers seem to be hesitant to use this dataset.

"Privacy issues are preventing a leap forward in study of human behavior by preventing the collection and public dissemination of high-quality datasets. Are academics overly-sensitive to privacy issues?"

It's clear why researchers aren't touching these datasets. Either they include too much personal information, the data provider has requested to take the dataset down, or out of fear of getting sued. The question is if this behavior really benifits the user, the one who we're trying to protect. The answer is NO. Let's look at all the players in this game:

Researchers don't conduct research on the "leaked" datasets. This hinders scientific progress.
Users are less aware of vulnerabilities as researchers can't demonstrate them.
Companies benefit as they don't get bad press about leaked data & possible privacy issues.
Malicious individuals benefit since vulnerabilities remain unknown, which in turns morivates users to continue sharing information publicly. They are also not limited by moral objections and will use the leaked data if valuable.

Currently only the companies and malicious individuals benefit from this behavior. Scientists and users are actually in a bad position by not allowing/doing research on public/leaked datasets. A utopian view would be that researches conduct analysis on the datasets and make advancements in their field. At the same time users can be warned about potential vulnerabilities caused by the leaked data before hackers will abuse them.

Fear Based Privacy

A mistake to watch out for, at least in my opinion, would be what I call fear based privacy. Yes, you have to remain diligent to make sure the government and companies don't collect too much information. And yes, you should be careful about the information you provide to the public! But one must first carefully consider the arguments for, and against, making information public. A knee-jerk reaction saying that any information you share might potentially be abused by a malicious individual is not a valid argument. It's an argument purely based on fear: "Even if we don't know yet how a hacker could abuse certain information, who knows he or she might find a method in the future, so it's better not to share anything."

Now don't get me wrong here. Currently I believe that an absurd amount of information is being collected, especially on the internet, and that this is not a good thing. But a good balance must be found between the benefit of sharing information, and the potential risks for sharing said information. Compare it to driving a car. It involves the risk of possibly getting in an accident. However, the benefit that a car provides outweighs its risks, and instead of completely avoiding cars we attempt to reduce the probability and severity of accidents.

Interesting Links

33 Bits of Entropy: Privacy related blog by the researcher Arvind Narayanan who is behind the Kaggle de-anonymization. He also has a twitter acount.
Database privacy at Microsoft Research.
28c3 talk by Conrad Lee titled Privacy Invasion or Innovative Science?
A Firm Foundation for Private Data Analysis: A good paper covering the applications of differential privacy.

Special thanks goes to Jan Van den Bussche of University Hasselt (Belgium) for helping me during my master thesis.

Tuesday 31 January 2012

Memory Hacking: Anyone can do it!

More than four years ago I wrote a small tutorial on memory hacking. Even someone new to programming and computers is able to create simple "memory based hacks". Depending on the program/game you are targeting you can use it to change your score in a game, increase your ammunition, teleport yourself to another coordinate, etc. This was the first thing that really got me interested in computer security and reverse engineering, so I'm reposting the tutorial on this blog.

If you ever wondered how aimbots or unlimited ammunition hacks are made, then this post is a good foundation to learning how they work. It will only be a small introduction that can be followed by anyone! The goal is to show it's indeed easy and to motivate you to try it yourself on a few programs ;) In this post we will attempt to freeze the timer of Minesweeper.

Background

Internally a computer works only with numbers, so every single thing on your computer is represented by a number. The smallest type of number one can directly access is called a byte. It can store the numbers between (and including) -128 and 127. We can then group 2 bytes togheter and can represent every number between -32768 and 32767. With 4 bytes we get -2147483648 and 2147483647. We can continue this with 8 bytes and so on.

It is the programmer that gives meaning to these numbers. For example, we can say that numbers 0 to 25 stand for each letter in the alphabet. In a different situation we can say that 0 to 11 stands for each month in the year. A number can stand for the amunition a player has, the coordinates of a player, the ID of a weapon he is holding, and so on. We can see that the “meaning” of these numbers indeed depends on where and how they are used.

Every byte has a so called "address". This address is again a number, and we use this number to access the byte. For example, say we have 2 bytes and want to add them together. Assume the first byte is saved at address 2345 and the second one at address 5345. We can then tell the computer to add the bytes at address 2345 and 5345 together (and optionally save this result at another address). Addresses are commonly written in hexadecimal notation.

For a more detailed explanation on how numbers are stored and represented on computers you can read “The Art of Assembly“.

Freezing the timer

It’s now our job to find where the number that represents the timer is saved. Once we know it’s location we can simply overwrite it with a new value and thus change the timer in Minesweeper. To find the address we will use the tool “Memory Hacking Software” (MHS).

The first thing we need to guess is how the timer is saved. Since the timer already is a number this is trivial (the number of the timer is saved directly without any conversion). We only need to determine the size of the number. Since a byte is not large enough to save the biggest possible value of the timer (999), we will guess the programmer of minesweeper used (at least) 2 bytes to save the timer.

Start Minesweeper. Now launch MHS.

Go to File -> Open Process, select Minesweeper and click on Open

Then do Search -> Data-Type Search. Select Short as Data Type (Short is the same as 2 bytes) and Exact Value as evaluation type.

Since we haven’t started the game in minesweeper yet, the timer is currently zero. In "Value to Find" type 0. Now click OK.

It will say how many addresses (in the minesweeper process) had the value 0. There will probably be a lot of them! I had 1497378 results, and one of these (probably) is the timer.

Filtering the Results

We know that one of these address is the timer, however there are too many results and practically this list is still useless. What we need to do is shrink the list. And this will be done by doing a “sub search” on our previous results. In this case we can start playing minesweeper so the timer will start. We now know that the timer has increased, so we will search for an “increased value” in our current result list and thus shrink the list.

Go to Search -> Sub Search so we can further “filter” our results of the previous search. We know the timer has increased so we select Increased as Search Type.

I got 46 results. Still too much. I again do a sub search and again search for an increased number. Now I only get 3 results! Continue this until you only have a few results left. Once you have a small list, it should be easy to spot the timer by observing the Current Value field. This will always be equal to the timer in minesweeper. In my case the timer is saved at address 0100579C (this address can be different for you).

Double click on the address in the “Found Addresses” list. It will be added to the “main address list”. Double click on the address in the main address list. We will now lock the value of the timer to zero. We do this by checking Locked and entering an Exact Value of zero.

And there you go, you froze the timer. Because of the way minesweeper was made it will actually display a time of 1 instead of 0, but nevertheless it’s frozen.

Conclusion

This was small and basic introduction to memory hacking. You can try this method on other programs (eg., on number of bullets left, current health, high score, etc). However you will notice that it doesn’t always work. You won’t be able to easily find the address or the address could change each time you play the game. To solve these problems more advanced techniques must be used.

An old article I read 6+ years ago on more advanced tricks was titled Dynamic Memory Allocation and Code Injection: DMA to static address. He still used SoftIce in those tutorial, but that program is now dead. Instead use OllyDbg, IDA Pro, or similar.

Pages