| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | |||||
| 3 | 4 |
5 |
6 |
7 | 8 |
9 |
| 10 | 11 | 12 | 13 | 14 | 15 | 16 |
| 17 | 18 | 19 |
20 | 21 | 22 | 23 |
24 |
25 |
26 |
27 | 28 |
29 |
« Overview of U.S. and Canadian 311 city and county service center | Main | "Marshall Van Alstyne on "Diffusion, Network Structure & Information Advantage" »
27 September 2007
I presented some of my work on "computational social science" at the Applied Stats workshop yesterday. One of the questions that came up was what are the best tools to deal with unusual and massive data sets. Clearly, part of the answer is that there is nothing truly "off the shelf" that you have to write a lot of code from scratch. But the other part is that there are some flexible platforms/tools that are vastly better than others, and I would be interested in comments on what you think is useful for datasets, say, with millions of observations, or pulling text and link structure off of the web, etc.
Posted by David Lazer at September 27, 2007 10:25 AM
I enjoyed your presentation quite a bit, and thanks for linking to my post on the Social Science Stats blog.
I can't speak to the best frameworks for doing computation with large datasets, but I have some experience with collecting text off the web and also with storage. For collecting data off of the web, generally people use a scripting language like Perl, Python, or Ruby (my favorite), each of which has libraries for 1) navigating websites programmatically and 2) parsing the html content. For data storage, I use an open-source database system called MySQL or its lighter-weight cousin sqlite.
Ironically these are the same tools used to create and maintain many websites from which you might collect data. It's always better if you can just get the database files directly rather than doing a bunch of clever scripting to replicate them from scratch, but this isn't always possible.
Posted by: Andy Eggers at September 30, 2007 12:00 PM
Hello David et al. I'm currently dealing with a location tracking dataset obtained from a Ultra Wide Band based system which was deployed in an office building and used by 51 people for 6 weeks. I've got thousands of xyz observations (with 4 updates per second)and I'm manipulating it using MATLAB to query the multiple excel spreadsheets I've got, extract some meaningful info and mapping it with the help of GIS software MapInfo. By the way I'm trying to measure the duration and location of interactions in the office.
Posted by: Irene Lopez de Vallejo at October 1, 2007 9:23 AM
I would break up the need for computational tools in data analysis into four different types of tasks: (1) data collection, which at this point means data scraping from the web and more or less structured text files; (2) data manipulation, which means transforming the data into a data structure in your favorite computational system in such a way that you can "compute" e.g. execute abstract operations on it); (3) data visualization, which is summarizing data in plots, graphs etc to get a qualitative impression of your data; and (4) data analysis proper, which means applying mathematical and statistical tools to your data.
Programming languages and software packages out there cover different subsets of these tasks, but it seems obvious to meet that it is a good thing if all of these can be performed within a single environment where one task flows naturally into the other. Integrating these four types of tasks is what lies at the heart of the computable data initiative that I helped develop in Mathematica. This is something that could potentially be done in other programming languages and environments (such as Python and R). But I can definitely say that a huge amount of work goes into making all these pieces fit together :-)
Posted by: Fred Meinberg at October 1, 2007 9:27 PM