August 2008

Sun Mon Tue Wed Thu Fri Sat
          1 2
3

4

5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31            

Editor Login


Convener in chief:


David Lazer
(Methodology, Networked Governance)

Editors:


Stanley Wasserman
(Current Trends, Methodology, Social Networks)

Allan Friedman
(Simulations)

Nathan Eagle
(Technology, Social Computing, Powerlaws, Current Trends)

Ben Waber
(Technology, Social Computing)
Thomas Langenberg
(Technology, Social Computing, Social Networks, Current Trends)

Ines Mergel
(Knowledge Sharing, Social Computing, Social Software, Current Trends)

Brian Rubineau
(Social Dynamics, Societal Networks, Simulations)

Maria Binz-Scharf
(Qualitative Methodology, Knowledge Sharing, eGovernment)

Jeff Boase
(Technology, Societal networks)

Alexander Schellong
(Admin, eGovernment, Citizen Relationship Management)

Categories

Archives

Recent Entries

Recent Comments

Notification


« NSA data | Main | NSA data mining—what patterns to look for: expansive scenario (II) »

7 January 2006

NSA data mining—what patterns to look for (I)

So, what data mining could one do with the data the NSA has collected from telecomm companies? Obviously, it is still unclear as to what is being collected, so this is quite speculative, which is a little different from my normal role of cautious academic. My hope is that this speculation, in the end, will yield some productive discourse about this important subject. I also want to make clear that I am not endorsing (or condemning) such data mining for now. Later I will discuss some of the privacy and policy issues. For now, I just want to do a thought experiment of how one might analyze these data in a fashion that might detect terrorist activity.

My assumption here is that the objective is to identify candidate nodes (individuals) for surveillance.

I am going to start with what I consider a less expansive scenario. In this particular scenario, one is starting out with some phone numbers and e-mails that are designated as “high risk�—e.g., from other intelligence. A simple analysis would simply snowball outwards from these high risk nodes to their contacts, and to their contacts’ contacts, etc. As one snowballs outwards, one will likely find overlaps, where some nodes are members of multiple circles. In the simplest analysis, the more circles that a node is a member of (and the closer to the center of those circles), the higher risk they should be considered.

Obviously, the analysis should get substantially hairier than that, because of the nature of the sampling from the network. For example, I am guessing that the identifications of high risk nodes are not independent events. Imagine that an Al Qaeda cell is identified and its members apprehended in Jordan, and their computers, address books (or equivalents) acquired. One would then snowball outwards from these contacts. However, to find overlap among the contacts of these cell members presumably conveys different information than if one found overlap among the contacts of different cells from different countries (presumably the latter would be more significant).

One could devise a weighting system that depends on the number of paths that go through a particular node, other information about nodes, etc, to develop a ranking of who should be watched. These weights could be validated by fitting them to part of the network data, and then examining whether the technique was effective at identifying those nodes that you knew were already “high risk.�

Ideally, one would use communication data going back in time as far as possible—thus, while telecomm companies are sharing data, you would want them to go back as far as possible. This would also be useful in case you wanted to do sequence and timing analysis—e.g., it’s not just who you call, but it’s when you call (say after some event), or that you called Anne after Joe called you.

Obviously, there are lots of difficult issues re sampling. Further, one would hypothesize that any terrorist worth their salt would be careful about recording contact information, and, more generally, their use of electronic communication. And I would guess that most of the people that terrorists communicate with are non-terrorists, and their contacts, in turn, are even less likely to be terrorists, so the vast majority of people caught in this net are going to be non-terrorists. So, to mix metaphors, one may have removed from the haystack proportionally more hay than needles, but you are still left with a very large haystack with just a few needles.

Once one has identified some risky nodes, the next step would be to monitor actual communications. Presumably, the NSA has finite capacity to have humans listen to conversations, and thus the key management question is how to allocate this scarce resource. The first level of monitoring would therefore simply be recording of conversations. Presumably, this is fairly cheap to do, so, putting civil liberties concerns aside, one would adopt a pretty low risk threshold for recording. This would allow going back in time for human monitoring if an individual were subsequently identified as high risk. A second level, if it is technically possible (at some level it surely is), would be to apply voice recognition to those recordings, where the content of conversations would adjust the evaluated risk level of those nodes. Further, such voice recognition could pick out candidate snippets of conversations for human monitoring. Such “snippet-based� monitoring, I think, would explain why the FISA court process was circumvented, since it might result in the brief, human-based monitoring of a very large number of people (conceivably exceeding the number of warrants approved by the FISA court in its history very quickly), and in the computerized monitoring of a still larger numbers of people. That is, the oversight process specified by FISA would be unable to cope with the sheer volume of requests. Further, the basis of monitoring these snippets is probably weaker than what has traditionally been brought before the FISA court. It would also explain why some defenders of the policy (who presumably know more than has been publicly released) have stated that having a computer monitor your conversation was not a privacy intrusion (thus suggesting that a major component of the program did involve computerized monitoring).

This is the less expansive scenario that I have come up with (although how expansive it is depends on a number of parameters—how many steps out one goes from the initial sample, what is the threshold for monitoring, etc, so the actual numbers of people who are in some fashion caught in the net might number anywhere from thousands to millions). This is a pretty rudimentary analysis, as compared to how one would actually do it, but I think has the essential ingredients. My next entry will consider a more expansive scenario.

Posted by David Lazer at January 7, 2006 12:53 PM

Comments

David,

Good overview on possible NSA data miuning... I wrote something similar originally back in 2002 with 9/11 still fresh in our minds.

Just starting with a kernel of 2 known players in the network you can snowball sample quite a bit...

http://www.orgnet.com/tnet.html

Posted by: Valdis at January 7, 2006 3:00 PM