February 2008

Sun Mon Tue Wed Thu Fri Sat
          1 2
3

4

5

6

7

8

9

10 11 12 13 14 15 16
17 18

19

20 21 22 23

24

25

26

27

28

29  

Editor Login


Convener in chief:


David Lazer
(Methodology, Networked Governance)

Editors:


Stanley Wasserman
(Current Trends, Methodology, Social Networks)

Allan Friedman
(Simulations)

Nathan Eagle
(Technology, Social Computing, Powerlaws, Current Trends)

Ben Waber
(Technology, Social Computing)
Thomas Langenberg
(Technology, Social Computing, Social Networks, Current Trends)

Ines Mergel
(Knowledge Sharing, Social Computing, Social Software, Current Trends)

Brian Rubineau
(Social Dynamics, Societal Networks, Simulations)

Maria Binz-Scharf
(Qualitative Methodology, Knowledge Sharing, eGovernment)

Jeff Boase
(Technology, Societal networks)

Alexander Schellong
(Admin, eGovernment, Citizen Relationship Management)

Categories

Archives

Recent Entries

Recent Comments

Notification


« NSA data mining—what patterns to look for (I) | Main | ABC/Washington Post survey on privacy »

9 January 2006

NSA data mining—what patterns to look for: expansive scenario (II)

A more expansive scenario would be that the NSA collects all phone log data from US sources as well as non-US calls that pass through US switches, plus locational information from cell phones where available (+ e-mail traffic, etc).

The expansive scenario offers a significant security and logistical advantages to the NSA. The security advantage is that under the more limited scenario, the NSA would have to share critical security information with telecomms, by asking them for information about only certain individuals. That delimited information is terribly sensitive intelligence—by telling telecomms who they want to monitor, etc, it is essentially telling them who the government has received intelligence about.

The logistical advantage is that as the NSA finds out about potentially risky individuals, they can avoid the hurdles of making requests of the telecomms—they could just instantly access the information as they needed it.

Would such a massive data set be useful? Probably. Certainly, the locational information would be very helpful—one would be able to evaluate the physical proximity of people. Further, some of the patterns one would look for would involve the locations of individuals making and receiving calls—a set of calls to different numbers to Washington, DC, from a high risk source might be indicative of a potential event there.

One could also refine the techniques to identify members of a loosely connected set of people by testing them on known sets of people. There has been a lot of work recently on identifying groups of people from network data, e.g., by Ken Frank, Mark Newman, and Bernardo Huberman. I’m not sure how their algorithms would scale up to such a massive data set. I suspect that you could produce some algorithm that could do something similar (if not as well) for a dataset of hundreds of millions, although perhaps I am wrong.

This problem is significantly different, since you would be starting with somewhat more information—e.g., that a handful of people belong to a particular group—and not really want to produce a list of all groups in the data set. Further, you would have more information than just who talks with whom, but when they talked, and through what medium.

So it would be necessary to produce a new, and much more sophisticated, algorithm, testing on a variety of groups where you could validate the results. For example, one could test it on social network analysts—start with a handful of people who you know do social network analysis, and produce an algorithm that does a reasonable job of finding other social network analysts, adjusting the parameters of the algorithm to fit. Repeat this with a variety of groups, until you produced a reasonably robust algorithm.

Hard to say how effective this would be without doing it. And whether any groups that you could validate would have communication patterns like terrorists seems rather unlikely. Of course, you might be willing to settle for lots of false positives if you are looking for a terrorist—e.g., 100 or 1000 false positives to find one true positive.

Of course, such data would have a lot of potential for "collateral" usage, which I will turn to next.

See web pages of Ken Frank, Mark Newman, and Bernardo Huberman.

Posted by David Lazer at January 9, 2006 8:02 PM

Comments

In the wake of the latest revelations of NSA data collection, I was wondering if, beyond the well-publicized debate over its legality and civil liberties issues, if it makes practical sense to do so? That is, does the NSA have the capability to pull out from the vast volume of data collected the information it seeks? And is it worth the cost in both money, manpower and time? Are there other, perhaps more practical alternatives (such as good old-fashioned human spies)?

Posted by: Tom LeCompte at May 12, 2006 12:50 PM

Notification

Enter e-mail address to receive notification of new comments to this entry

Post a comment




Remember Me?