| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | |||||
| 3 | 4 |
5 |
6 |
7 | 8 |
9 |
| 10 | 11 | 12 | 13 | 14 | 15 | 16 |
| 17 | 18 | 19 |
20 | 21 | 22 | 23 |
24 |
25 |
26 |
27 | 28 |
29 |
« The genetic basis (?) of political orientations | Main | Spring schedule for Cambridge Colloquium on Complexity and Social Networks »
14 February 2006
I’m beginning a collaboration with British Telecom in an effort to analyze their massive call network dataset. This is a dynamic, directed network that contains ~250 million nodes (ie: distinct phone numbers) and ~2000-5000 edges (ie: calls) generated each second. The phone numbers are of course one-way hashed such that it is impossible to link a node’s identifier to an actual phone number. However we do have information about the country and region to which the node belongs (ie: country code / area code). While it is not inclusive of every call to and from the UK, it is estimated that the dataset includes approximately 80% of landline calls and 30% of mobile calls.
So my question to the complex systems / social network community is this: what are some questions we should attempt to ask of this dataset? Possible examples include calculating the strength of a particular region’s relationships with other regions and countries, analyzing the dynamics involved in “call cascades�, inferring the average size of an individual’s hierarchical social groups (from close friend to possible acquaintance), etc...

While many metrics may be impossible to calculate for a network of this magnitude, simple sampling can yield interesting results. For example, the plot above represents the duration of outgoing calls from 100,000 randomly sampled nodes during 6 month intervals over the course of October 1995 to March 1998. It is clear that there are an increasing number of very long calls (over 10^4.2 seconds ~ 4.5 hours) which could be a good indicator of the uptake of dial-up internet in the UK during this timeframe.
Posted by Nathan Eagle at February 14, 2006 9:21 AM
Pretty amazing data set. It would be interesting to aggregate these data at the regional level and to examine how they correlate with a variety of economic and political variables-- everything from social capital measures, to average gdp, to political turnout.
Obviously, one could look at the relationship between distance and communuication, the degree of triad closure, reciprocity, etc.
If you had these data over an extended period, one could examine the purely structural components of how networks evolve which have been impossible to date-- e.g., dynamic tendencies toward balance in networks, etc.
Many many more possibilities-- will be interested in hearing what other people have to say.
Posted by: David Lazer at February 14, 2006 2:44 PM
Great data set. Some questions and ideas
- When and where to are calls made? (Pricing)
- Household members size
- Lets say Person X lived in city X for a period of time. Now when X moves to city Z it would be interesting to see how the old city X networks collapses (if it does) and a new one evolves.
- Also differentiate the data into private, gov (public admin), gov (politicians) and business entities
- Which country has the most connections out of the UK (i.e. implications would be a stronger cultural understanding, support, etc...)
Posted by: Alexander at February 14, 2006 3:02 PM
I've analysed call data for a mobile operator in Australia. There are some fascinating applications, and also some critical limitations. One which I found was that the network represented <30% of all mobile subscribers in Australia, and of course had no information on fixed phones. Your data set is more comprehensive, and certainly from a fixed line perspective pretty much represents the full national network.
I guess questions will be in large part dictated by whether you have a commercial objective or purely sociological research objective - either obviously opens tremendous scope. For example, on the latter, even egocentric measures of connectedness (even simple degree, especially re reciprocity) could be used as indicators of social capital. At a regional level these could be correlated with economic indicators of poverty, etc (NB sociology isn't my forte!)
At a commercial level, I found it very helpful to also tie to other customer demographic data stored in the CRM warehouse. Some interesting insights can be gained about "network value" of customers, compared to the more traditional individual measures of customer value. For example, valuing customers based on a weighted value of their n-degree networks, or centrality, etc. There are limitations, given the size of the network, to how much full network data you can obtain (eg for centrality measures), and to processing grunt. However, even running automated snowball sampling can help generate egocentric networks pretty quickly and give some interesting results. Even so, you tend to run up against the edges of the BT network reasonably quickly, and won't have access to off-net subscribers (but you can infer connectivity if you look at off-nets who connect to already connected BT subscribers). Sorry I digress - some of the methodological issues get pretty interesting.
On the commercial side there are word of mouth issues related to churn prediction (H0: don't lose central subscribers or you may lose their subscribers) and new service diffusion (eg my work was on social influence predictors of MMS adoption -- interesting if you have call data streams in addition to voice). There may be some interesting questions around network relationships among households (fixed) and personal (mobile), which could give BT insights relevant to service bundling packages and tariffs.
There is obviously much of interest to be explored - best of luck with it!
Posted by: Michael Foulds at June 13, 2006 9:14 AM