| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | ||||
| 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| 25 | 26 | 27 | 28 | 29 | 30 | 31 |
« How many data are enough? | Main | Health Inequalities and Anchoring Vignettes »
27 September 2007
Not to take anything away from David Lazer's presentation today at the Applied Stats workshop, but the star of his talk was the data. The crowd favorite appeared to be a dataset of all cell phone transactions over a several-week period for 7,000,000 subscribers somewhere in Europe (wouldn't say where). David and his colleagues have built a graph of interpersonal connections based on the call data, and are trying to answer questions like, "How many degrees of separation are there between two randomly selected people in the network?" (Answer: 13.) But to me an even more compelling question came up in the Q&A session: where do you get data like this?
David's answer was basically that you need to know the right people; it sounded as if he or one of his colleagues knew key executives at the phone company who were able to provide the call records. Lee Fleming offered that grad students might find their way to data like this by getting to know scholars like David who have access to it. (How many degrees of separation are there between you and your dream dataset?)
But the importance of knowing cell phone execs would be the wrong takeaway from David's talk, which after all was basically about how we are all awash in data these days. Yes, to get data on cell phone calls you may need to have friends at the phone company, and yes, to get information on where a group of MIT students spends every hour of the day over a few weeks you will have to launch your own experiment (as described in David's talk today), but for those of us with fewer connections and smaller research budgets there is still an enormous amount of data out there to collect, much of it from the web. I've actually spent a fair amount of time in the past year learning how to collect data from the web, and I look forward to blogging here about web scraping and other data collection approaches in the next few months. But right now I'm going to go check whether David left any tracking devices in my bag.
Posted by Andy Eggers at September 27, 2007 12:44 AM
Is there any way people such as me, in a faraway land can get a look at the presentation - sounds very interesting!
Posted by: Dibyo at September 27, 2007 6:11 AM
3 cheers for PERL.
Posted by: patrick at September 27, 2007 9:47 AM
@Dibyo -- I'll check to see if there are slides available and post another comment if they are.
@Patrick: People do cool things with Perl (including you!). For someone starting out I would recommend Ruby or Python because I prefer the syntax. But choice of scripting language is probably best left to another post/flamewar, I suppose.
Posted by: Andy at September 27, 2007 10:47 AM
David's slideshow can now be found on the Gov 3009 course page. (You have to scroll down a little.)
Posted by: Andy at September 27, 2007 3:17 PM
Or the better question would be, "Where do you get data like this, legally?"
Posted by: Okinawa at September 29, 2007 9:48 PM
Believe it or not it is quite easy to search Google and find what you want in peoples database backup files if you know how to search effectively.
I'll give you a hint: they have the SQL extension and you can search for them by their version number and find lists of databases wide open for download.
Posted by: Myrtle Beach Condo Rentals at October 4, 2007 5:01 PM
The other option is to work with a PhD student who is employed by the company -- that would be me ;)
I'm in a joint program between a University and a private company, big enough to have a reasonable R&D department, and desperate enough to fund theoretical research. Because many European university are desperate for cash, these things have been developing recently, and they are generally beneficial, provided the company is knowledgeable enough to value science for what it is about. In the case of complex network (the only field that would justify call-level data base) it is very interesting for companies, up to the point that we are not allowed to do certain operations not by the Information authority, but by the Competition authority. We can't really cry foul at Business regulation going into the ways of science: if data is processed as we would have needed it, then it would certainly be used by other people in the company.
This data looks like the one I'm working on, actually, and even from within the company, there was a lot of social engineering. No copy sitting on-line that I know of, though. Maybe we could try to swap--have to talk to my boss first.
Posted by: Bertil at October 5, 2007 6:41 AM
wow...7,000,000 cell phone record, thats a lot data to compile
Posted by: guna at October 7, 2007 8:51 AM