Monday 12 August 2013

My own fears around PRISM and the misuse of Big Data

So a friend of mine and a venerable debater on the circuit posted a very good blog post on why the PRISM scandal didn't particularly bother him from a privacy point of view (Go read it at http://7min15sec.blogspot.nl/2013/08/why-i-welcome-nsa-reading-my-facebook.html). However, there are some issues with the NSA's program that I feel need looked at with a little more scrutiny, and it ties in with the growth of statistical analysis of 'Big Data' (The huge datasets that are gathered about everything and anything in the modern digital world).

My problem with the PRISM program is not the collection of the data. It's totally legit in my mind that a security service can get access to random bits of data about people they need to get access to. I also totally buy the relativistic nature of privacy. The problem I have is in the collection of the data. Now having a big bag of data is not in itself a problem, it's the potential for misuse that bothers me. And not even misuse from a pokey-nosey point of view. It's the potential misuse of statistical analysis to detect people who are behaving in a manner that the model says is a terrorist.

As an engineer working on reliability problems I have a reasonable understanding of statistical analysis and the differences between data driven and physical model based approaches to identifying problems. In this case the detection of terrorists is analogous to a reliability problem. Lets say you have a machine that you have hundreds of hours of data on and you know the precise physical mechanisms that cause that breakdown. From that data you can identify incipient problems in the machine and thanks to your knowledge of the ways it breaks on a physical level you can easily inspect the device in some way and figure out the issue. If you don't have a good understanding of the physics of failure you can't do that detailed inspection, but the data can still tell you something is funky. However, that data could equally be telling you shit thanks to an oddball statistical quirk or a flaw in the model. Without a detailed knowledge of how the data connects to the real world breakdown of the machine you can only take guesses as to what is wrong. The result is a powerful but sometimes unreliable method of detecting failure that can tell you something is up when it isn't.

Where is this engineering analogy going? Well, lets look at this as a detection of terrorism problem. You have a large amount of data and a corroborating hard intelligence report on a suspect. The hard intelligence report represents something similar to that physical model of failure. It's your grounding in the real world outside of abstract models and big data. That grounding tells you 'Okay something isn't right here and we know EXACTLY how this is going down because we have a detailed understanding of the processes behind it'. Conversely if you only have data then you can still recognise behavioural, communication and movement patterns that look like recorded behavioural patterns of terrorists. The thing is you don't have that evidence based grounding in what is actually happening in the real world. Of course in the US and areas where the US has influence you can quite easily get that physical grounding by sending a bloke to see if the person in question is building a bomb or shagging the neighbor's missus. But in areas where that might be impossible there might be a temptation to go straight to the direct action phase. To 'fix the problem before it becomes a fault' to carry the analogy through. We know the US uses metadata to target its drone strikes. The temptation might be to use these models to identify targets for those type of strikes too. Or to tip off a 'friendly' government with less than excellent human rights records to go find out what this guy knows. Or alternatively end up with a much needed CIA agent running around Kandahar chasing ghosts. None of those are particularly useful or appealing results in the fight against organized (or disorganized) terrorism.

Nate Silver writes about this type of failure in 'The Signal and the Noise' fairly eloquently warning of the dangers of over reliance on data. It's easy to see 'signal' when there's only noisy data displaying a statistically improbable but possible pattern. The fact that you have this vast repository of data means you can construct these complex models to find your terrorists but without good human intelligence might not have grounding in the real world. For example, in the 1960s and 70s governments around the world started to discover the uses of satellite data in assessing the capabilities of their cold war enemies. They could gather huge amounts of data and analyze it from space without having to risk lives or pay for expensive foreign intelligence officers to collect the information. This led to an over reliance on satellite, rather than good old human intelligence. This came to a head when the Soviet Union's space reconnaissance and signals intelligence detected the build up to the NATO Able Archer 83 exercise. Without the adequate human intelligence inside Washington and Whitehall (a rare failure for the much feared KGB) the Soviets were left with their data and their models (which reached absurd levels of paranoia as they counted the lights on inside the MOD). They never asked their intelligence agents on the ground, preferring to interpret the data centrally from Moscow. The result was that the USSR dramatically misinterpreted the Able Archer 83 exercise as prelude to a first strike by NATO. Their model got it spectacularly wrong, but by simply consulting their intelligence agents and getting a read on the situation through a low level understanding of the thinking inside the NATO establishment they could have identified early the mistake in their data driven reasoning. And it was only after the fact, after cooler heads prevailed in the Kremlin that human intelligence (Oleg Gordievsky for instance) told the West how close the Politburo had been to pressing the Big Red Button.

This is an extreme example of the problem, but other examples of failures in making low-level connections due to relying on remotely collected data can be seen in the spectacular failures of intelligence on Al Qaeda and associated groups leading up to 9/11. To their credit the West's intelligence establishment have stepped their human intelligence game since then up to face the very human threat from suicide bombers and masked gunmen. These are cells that can't usually be found on satellite, can't be tracked by SIGINT and can't have their codes broken by a modern day Turing. The only way to crack these cells is by good application of human intelligence. But then comes the ability to track the ways that terrorists DO communicate. The ability to look through social networking histories, track cellphone data and sniff out traces of communications through the internet is a great boon to intelligence agencies. It can be used to confirm suspicions and find elusive terrorist leaders. It can be used to correlate intelligence from various sources and to read the private communications of those who would carry out atrocity. But this can't be the only thing agencies look at. A sliding into reliance on this data could mirror the dangers of relying on Satellite and SigInt data in the 1980s. Only this time there is, peversely, more scope for error. The massively increased load of data and the much larger pool of 'suspects' makes the task of sorting the signal from the noise that much harder, especially when dealing with signals from clandestine behaviour that could be anything from a drugs deal to a terrorist incident to someone having an illicit affair. And who is to say that the next threat (likely to come from cyberspace itself) will be vulnerable to the intelligence gathering techniques of PRISM and it's ilk? As terrorists learned to hide from satellites maybe the next threat to peace and security will learn to evade the trawls of social networks and the monitoring of cell phones. There's a common factor in any intelligence operation: Somewhere behind the data is a human with an idea, and you need a human to read that intent and act upon it and that will always be true.

The intelligence community is definitely interested in this technological advance (http://www.stanford.edu/group/mmds/slides2012/s-fahey.pdf and there's more out there like this). Nations are already interested in how big data can help with their security problems. However I can't help but feel that at best and with the best oversight possible you're going to get a system that puts out a lot of false positives and wastes a lot of security service time, or at worst poses a serious threat to the life and liberty of innocents, while putting the intelligence agencies into a position of over reliance on a technology that might not be capable of dealing with future threats.