Using NLP for information gathering

The Wikipedia page for Natural Language Processing (not the Darren Brown stuff) describes it as “a subfield of artificial intelligence and computational linguistics.” So why am I discussing this on the BlueHat blog? If, like me, you sucked at linguistics in school, you might think that it has no place in IT security. However, the more I play with NLP, the more excited I get about the possible applications it could have in information gathering — and thus in IT security.

For the past 18 months or so I have been pouring my heart, my mind and my wallet into Maltego, a framework that is used for information collection. Maltego consists of ‘entities’ and ‘transforms’ – and allows you to convert (or transform) one type of entity to another. As a simple example you might think of a DNS Name entity and an IP Address entity, where a Transform would simply resolve the DNS Name to an IP Address, or the reverse, resolve the IP Address to a DNS Name. It turns out that information is a lot more connected than I originally thought and as of now we’ve created more than 100 transforms on about 15 entities types. To take the DNS/IP address example further, an IP address is located within a network which has ‘whois’ information (and geo-location information) which leads to names, e-mail addresses, and phone numbers which lead to persons, domains (from the e-mail address) and geo-locations, which in turn lead to other entities and so on and so forth – to a never ending mesh of loosely related chunks of information that can be visualized on a graph as a set of inter-connected nodes.

Anyhow – back to NLP. I spent last week fiddling with NLP transforms for Maltego, specifically building Information Extraction into Maltego. Information Extraction is a specific application of NLP and if you integrate it into a framework where automation and correlation is possible, then it becomes very interesting. For the first time I get the feeling that I can actually discover information that’s ‘hidden in plain sight’.  At the moment the stuff is still unstable and eats CPUs and memory as fast as a pre-lunch snack but we’re working on making it more stable and production ready.

Information extraction is used to extract ‘entities’ from text. As an example, it will be able to look at a web page and extract all the person names from the page. It can also extract the organization names and the location names. In fact, it can extract anything that can be programmatically described from text. Given time, it will even be able to extract the ‘facts’ from the page. This means that if you have four web pages that basically say the same thing NLP will be able to connect the pages. And for the first time, I have the ability to give ‘human-like’ functionality to a transform.

Again – why is this interesting for security people?

Let’s look at a very concrete example. Suppose you are doing a footprint for a multi-national company. We all know these guys could have thousands of domains. Let’s assume that you have some method of collecting the odd 500 domains which you think might be associated with the organization. These domains are in multiple countries, each registered at a different registrar, and each registrar has its own method for formatting ‘whois’ information. So even if you manage to collect all the ‘whois’ information, you have no way of properly formatting the information so that you can easily grep for the target organization. Using NLP you don’t need to worry about it, you can tell your Information Extractor to simply extract organization names from the collected data. Without NLP, you will either need to write a parser for every format or you need to eyeball every ‘whois’ result.

Not convinced? Sure – perhaps you’ve never experienced the joys of doing a footprint. So let’s look at something else. Let’s say you want to know who are the key figures associated with a specific phrase. Your phrase could be something like “High ranking diplomats” or even a company name. You can feed the phrase to your favourite search engine and get a list of URLs where the phrase appears (in the first example you’ll want to restrict the results to a specific domain or country). Next, you can feed all the text on the resulting web pages to your Information extraction engine – it will neatly ‘parse’ the text into person names. And voila – within minutes you have a list of names. If you use something like Maltego to do this you can now get an idea who is the most prominent (or most vocal) person is – as some names will be mentioned on more than one page.

So let’s try it – the proof of the pudding etc. Let’s run the process on a phrase where we can verify the results. Using the phrase ”BlueHat conference” as a starting point, we end up with a graph that looks like this:

From the graph we quickly see all the usual suspects. Of course the program cannot give any kind of context to results. For instance it won’t tell you that Ryan Naraine is a reporter that covers IT security and that Andrew Cushman organises the conference. Also, the graph is littered with false positives, but this is a mere annoyance as it disappears as white noise when looking at the frequency of extracted entities.

You may say ”but I knew these people were key figures at BlueHat – what’s the big deal?“ Well – consider that you can enter ANY phrase into the system and minutes later know who the key players are. When we combine this kind of functionality with other gathering techniques for open source information, it becomes a bit frightening. Consider a process that subsequently looks up these people’s e-mail addresses and start sending custom crafted malware (although, the list above would hardly be good candidates) or perhaps automatically resolve social network memberships, and where possible, create fake identities. Bottom line – if you are looking to attack individuals connected to a certain ‘phrase’ in an automated fashion you first need to know who they are.

Real life, practical hacking is really all about collecting information. The people who have been doing this for a while will tell you that an exploit is really only 5% of the entire exercise. Of course you don’t need to always have the evil bit set, mining the Internet for information is very useful for law enforcement, intelligence agencies or anyone that simply wants to gain insight into a subject.

In the past you could have hoped to fool e-mail address harvesters by writing your email address as roelof at paterva dot com. With NLP you can run…but you can’t hide.

R.o.e.l.o.f  T.e.m.m.i.n.g.h

PS: NLP implementations are hairy beasts that breathe fire, with sharp teeth, complex eyes and many legs. I think of it as a big black box filled with magic — but as long as I can get a slice of magic, I am quite happy.