Suramya's Blog : Welcome to my crazy life…

August 23, 2018

Identifying Programmers by their Coding Style

Filed under: Computer Security,Computer Software,Tech Related — Suramya @ 8:42 PM

There is an interesting development in the field of identifying people by what they write. As some of you may already know researchers have been able to identify who wrote a particular text based on the analysis of things like word choice, sentence structure, syntax and punctuation using a technique called stylometry for a while now but it was limited to natural languages and not artificial ones like programming languages.

Now there is new research by Rachel Greenstadt & Aylin Caliskan who are professors of computer science at Drexel University & at George Washington University respectively that proves that code, like other forms of writing is not anonymous. They used Machine Learning algorithms to de-anonymize coders and the really cool part is that they can do this even with reverse compiled code from Binaries with a reasonable level of confidence. So you don’t need access to the original source code to be able to identify who coded it. (Assuming that we have code samples from them in the training DB)

Here’s a simple explanation of how the researchers used machine learning to uncover who authored a piece of code. First, the algorithm they designed identifies all the features found in a selection of code samples. That’s a lot of different characteristics. Think of every aspect that exists in natural language: There’s the words you choose, which way you put them together, sentence length, and so on. Greenstadt and Caliskan then narrowed the features to only include the ones that actually distinguish developers from each other, trimming the list from hundreds of thousands to around 50 or so.

The researchers don’t rely on low-level features, like how code was formatted. Instead, they create “abstract syntax trees,” which reflect code’s underlying structure, rather than its arbitrary components. Their technique is akin to prioritizing someone’s sentence structure, instead of whether they indent each line in a paragraph.

This is both really cool and a bit scary because suddenly we have the ability to identify who wrote a particular piece of code. This removes or atleast reduces the ability of people to release code/software anonymously. This is a good thing when we look at a piece of Malware or virus because now we can find out who wrote it making it easier to prosecute cyber criminals.

However the flip side is that we can now also identify people who write code to secure networks, bypass restrictive regime firewalls, create privacy applications etc. There are a lot of people who contribute to opensource software but don’t want to be identified for various reasons. For example if a programmer in China created a software that allows a user to bypass the Great Firewall of China they would definitely not want the Chinese government to be able to identify them for obvious reasons. Similarly there are folks who wrote some software that they do not want to be associated with their real name for some reason and this would make it more difficult for them to do so.

But this is not the end of the world, there are ways around this by using software to scramble the code. I don’t think many such systems exist right now or if they do they are at a nacent stage. If this research is broadly applied to start identifying coders then the effort to write such scramblers would take high priority and lots of very smart people would start focusing their efforts to invalidate the detectors.

Well this is all for now. Will write more later.

– Suramya

Original source: Schneier’s Blog

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress