January 2006
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31        

Editor Login


Convener in chief:


David Lazer
(Methodology, Networked Governance)

Editors:


Stanley Wasserman
(Current Trends, Methodology, Social Networks)

Guy Stuart
(Economic Sociology, Finance)

Allan Friedman
(Simulations)

Nathan Eagle
(Technology, Social Computing, Powerlaws, Current Trends)

Ben Waber
(Technology, Social Computing)
Ines Mergel
(Knowledge Sharing, Social Computing, Social Software, Current Trends)

Maria Binz-Scharf
(Qualitative Methodology, Knowledge Sharing, eGovernment)

Alexander Schellong
(Admin, eGovernment, Citizen Relationship Management)

Categories

Archives

Recent Entries

Recent Comments

Notification


« NSA data mining—what patterns to look for (I) | Main | ABC/Washington Post survey on privacy »

9 January 2006

NSA data mining—what patterns to look for: expansive scenario (II)

A more expansive scenario would be that the NSA collects all phone log data from US sources as well as non-US calls that pass through US switches, plus locational information from cell phones where available (+ e-mail traffic, etc).

The expansive scenario offers a significant security and logistical advantages to the NSA. The security advantage is that under the more limited scenario, the NSA would have to share critical security information with telecomms, by asking them for information about only certain individuals. That delimited information is terribly sensitive intelligence—by telling telecomms who they want to monitor, etc, it is essentially telling them who the government has received intelligence about.

The logistical advantage is that as the NSA finds out about potentially risky individuals, they can avoid the hurdles of making requests of the telecomms—they could just instantly access the information as they needed it.

Would such a massive data set be useful? Probably. Certainly, the locational information would be very helpful—one would be able to evaluate the physical proximity of people. Further, some of the patterns one would look for would involve the locations of individuals making and receiving calls—a set of calls to different numbers to Washington, DC, from a high risk source might be indicative of a potential event there.

One could also refine the techniques to identify members of a loosely connected set of people by testing them on known sets of people. There has been a lot of work recently on identifying groups of people from network data, e.g., by Ken Frank, Mark Newman, and Bernardo Huberman. I’m not sure how their algorithms would scale up to such a massive data set. I suspect that you could produce some algorithm that could do something similar (if not as well) for a dataset of hundreds of millions, although perhaps I am wrong.

This problem is significantly different, since you would be starting with somewhat more information—e.g., that a handful of people belong to a particular group—and not really want to produce a list of all groups in the data set. Further, you would have more information than just who talks with whom, but when they talked, and through what medium.

So it would be necessary to produce a new, and much more sophisticated, algorithm, testing on a variety of groups where you could validate the results. For example, one could test it on social network analysts—start with a handful of people who you know do social network analysis, and produce an algorithm that does a reasonable job of finding other social network analysts, adjusting the parameters of the algorithm to fit. Repeat this with a variety of groups, until you produced a reasonably robust algorithm.

Hard to say how effective this would be without doing it. And whether any groups that you could validate would have communication patterns like terrorists seems rather unlikely. Of course, you might be willing to settle for lots of false positives if you are looking for a terrorist—e.g., 100 or 1000 false positives to find one true positive.

Of course, such data would have a lot of potential for "collateral" usage, which I will turn to next.

See web pages of Ken Frank, Mark Newman, and Bernardo Huberman.

Posted by David Lazer at January 9, 2006 8:02 PM