November 2009
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30          

Authors' Committee

Chair:

Matt Blackwell (Gov)

Members:

Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
Andy Eggers (Gov)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.24-en


May 23, 2009

Distribution of Swine Flu Cases by Weekday

How will you expect swine flu cases to be distributed by weekday? More specifically, will you expect more cases distributed in weekdays or in weekends? My first reaction is that there will be more cases if there are more social gatherings.

Following this logic, the reasons for supporting more cases in weekdays may include that susceptible population have more contacts with infected population in weekdays, either through school or through work, etc. In addition, as people are more likely to travel in weekends, it means that they will have more contacts with infected subjects during their traveling, but because it takes around two days for the virus to have impacts, the cases will not be identified until a couple of days later. Could this also be due to the fact that there are less clinical services provided in weekends and that people are less likely to visit clinics in weekends?

Here is an old graph I made according to the swine flu updates (4/26/2009 - 05/21/2009) published on WHO's website. To be more accurate, I drew a new graph using the number of confirmed new cases rather than the cumulative number of confirmed cases.

As the reporting times for confirmed new cases vary, some at 18:00 while others at 6:00, etc., I kept only records between 05/01 and 05/21 whose reporting time is at 6:00 and redrew the graph. Weekdays are redefined as well. For example, Thursday 6:00 to Friday 6:00 is defined as Thursday. Could you still see any salient patterns, like the differential between weekdays and weekends? Why Friday is so spiky this time?
SwineFlu3.jpeg

Posted by Weihua An at 12:38 AM

April 4, 2009

Can Nonrandomized Experiments Yield Accurate Answers?

Here is some latest progress (at least to me) on causal inference. William R. Shadish, M. H. Clark, and Peter M. Steiner published a paper on JASA (December 1, 2008, 103(484): 1334-1344.) based on "a randomized experiment comparing random and nonrandom assignments". Basically "In the randomized experiment, participants were randomly assigned to mathematics or vocabulary training; in the nonrandomized experiment, participants chose their training." As the authors acknowledged, unsurprisingly, the randomized and nonrandomized experiments provided different estimates of the training effects, very likely through the selection bias caused by math phobia. The key finding is that statistical adjustment including propensity score stratification, weighting, and covariance adjustment can reduce estimation bias by about 58-96%.

Here is a link to the PPT of the paper. The comments on the paper are also very insightful.

Posted by Weihua An at 10:31 PM

March 7, 2009

How to Take Log of Zero Income

I encounter a problem when using a Log normal distribution to model income distribution. Namely, there are a bunch of people in my dataset who report zero income, maybe due to unemployment, and I am wondering how to logarize the zero incomes. I notice some researchers just drop the observations with zero income while others assign a small amount of income to them so that logarithm can be taken legitimately. Obviously, we can try both ways to see how the results stand. But I am wondering if there are some experts on this topic who can clarify the pros and cons of these and other approaches treating zero incomes.

A related question is what model you think fits the income distribution best, a Lognormal, a power distribution, or a mixture model of a Normal and a point mass at zero, and so on.
Look forward to your thoughts on these questions.

Lastly, here is an interesting animation of the income distribution in the USA.

Posted by Weihua An at 6:07 PM

February 21, 2009

My Basketball Friend

I met one of my friends on basketball court. This is selection. I select him as my friend because he plays good basketball and is an avid player. We have been friends for almost three years. When either of us wants to play, most times we will call each other and meet on the court. I think without knowing him, I will still play basketball, but not that many times. So we influence each other. Sometimes we eat Vietnamese noodles together at Le's right after game. Contextual factors matter, but it is him who makes me eat more times of noodles than I would have by myself. Probably, our friendship has some impacts on both of our weights and may make them change more synchronously. Similarly, if you are a runner, you will surely like running with your friends and may run more because you get a runner as friend. So the empirical question is whether you indeed play more basketball when you get a friend who likes playing basketball and run more if you get a runner friend. It is also possible that because you play more or run more, you eat more, which offsets the weight loss due to those extra exercises.

Given only observational data, it is hard to disentangle the effects of selection, induction and contextual factors on weight changes. We have to assign you friends (roommates) randomly and check if you and your friends gain/lose weight together, possibly because you two play more basketball, run more, eat similar things, have similar living styles, share similar standards about what consists of a normal weight, etc.

It is interesting to see that the effects of friendship seem to be directional or asymmetric. Only people you think as friend can induce you to lose weight. You can not induce a person who does not think you are his friend to lose weight, although you think he is your friend. This is kind of opinion leader effect.

The directionality of friendship effects also counters the challenging of contextual factors hypothesis, because if contextual factors matter, you would expect friends' weight changes correlate without directionality. Also, if they matter, you would expect your neighbors' weight changes synchronize with yours and the weight of your friend who lives hundreds of miles away should not correlate with yours. But neither is corroborated by data.

Hence selection should be the largest concern in this case. Now the questions are whether using weight changes or obese status changes will remove the selection effect and how we could control it better.

One of my friends told me two weeks ago that, he did not buy the points in "The Spread of Obesity in a Large Social Network over 32 Years" until he read the real paper. I confessed, "Same here." Read the real paper, not the popular press. But you are absolutely not obligated to buy the points. Here are more.

K.P. Smith and N.A. Christakis, "Social Networks and Health," Annual Review of Sociology 34: 405-429 (August 2008)

Journal of Health Economics, Volume 27, Issue 5, September 2008

Ethan Cohen-Cole, Jason M. Fletcher, "Is obesity contagious? Social networks vs. environmental factors in the obesity epidemic", Pages 1382-1387.

Justin G. Trogdon, James Nonnemaker, Joanne Pais, "Peer effects in adolescent overweight", Pages 1388-1399.

J.H. Fowler, N.A. Christakis, "Estimating peer effects on health in social networks: A response to Cohen-Cole and Fletcher; and Trogdon, Nonnemaker, and Pais", Pages 1400-1405.

P.s. My friend and I have successfully induced several of our friends who originally do not play basketball to play more. But hopefully they can gain some weight rather than losing weight so that we can play more strongly and better.

Posted by Weihua An at 9:01 AM

February 15, 2009

Bayesian Propensity Score Matching

Many people have realized that conventional propensity score matching (PSM) method does not take into account the uncertainties of estimating propensity scores. In other words, for each observation, PSM assumes that there is only one fixed propensity score. In contrast, Bayesian methods can generate a sample of propensity scores for any observation, by either monitoring the posterior distributions of the estimated propensity scores directly or predicting propensity scores from the posterior samples of the parameters of the propensity score model.

Then matching on thus obtained propensity scores, we should expect to get a distribution of estimated treatment effects. This will also provide us with an estimation of the standard error of the treatment effect. The Bayesian S.E. will be larger than the S.E. based on PSM estimate, as it takes into account more uncertainties. This conjecture is indeed confirmed by a recent paper written by Lawrence C. McCandless, Paul Gustafson and Peter C. Austin, "Bayesian propensity score analysis for observational data", which appears in Statistics in Medicine (2009; 28:94-112). The authors show that, the Bayesian 95% credible interval for the treatment effect is 10% wider than conventional propensity score C.I.

It seems that we should expect Bayesian propensity score matching (BPSM) perform better than PSM in cases where there are a lot of uncertainties in estimating the propensity scores. Before running into any simulations, however, the question is: what are the sources of the uncertainties in estimating propensity scores? From my point of view, there is at least one source of uncertainties, the uncertainties due to omitted variables. I do not think BPSM can do any better than PSM in solving this issue. But maybe, BPSM can model the error terms and so provide better estimations of the propensity scores? The above authors argue that when the association between treatment and covariates is weak (i.e., when the betas are smaller), the uncertainties in estimating propensity scores are higher. Weak association means smaller R-square or larger AIC, etc. Is this equivalent to larger bias due to omitted variables?

Another type of uncertainty related to BPSM, but not to propensity scores, is the uncertainty due to matching procedure. This is avoidable or negligible. Radically, we can just abandon the matching method and resort to linear regression model to predict the outcomes. Or we can neglect the bias from matching procedure, because when we only care about ATT and there is sufficient number of control cases, the bias is negligible, according to Abadie and Imbens 2006. ("Large Sample Properties of Matching Estimators for Average Treatment Effects." Econometrica 74 (1): 235 - 267.)

Of course, the logit model for the propensity scores could be wrong as well. But this can be manipulated in the simulations. Now my question is: how should we do the simulations to evaluate the performance of BPSM vs. that of conventional PSM?

Posted by Weihua An at 12:06 AM

January 30, 2009

Workshop on Random Network Models

Professor Joseph Blitzstein will open a course this spring, "Statistics 340. Random Network Models". Those who are interested in the booming network industry should definitely come have a look. The course will be reading and discussion based and involves no exams. It will meet regularly from 1 to 2:30 on Fridays at science center 706. I took a probability course from Joe and found he was a hilarious, encouraging and patient teacher. Some kids from 110 posted some clips of his teaching on the youtube. You will like the musical interludes and the Markov ball game.

http://www.youtube.com/watch?v=iAwS7vzvLnY

http://www.youtube.com/watch?v=TQvVLhWOiis

If you want to watch Joe's performance live, better come today.

Posted by Weihua An at 10:48 AM

December 5, 2008

Empirical Implications of Theoretical Models

From my point of view, an applied quantitative social science study is usually a process containing three parts. The first part is about theoretical/formal modeling (with either explicit or implicit assumptions), the second about deriving empirical implications from the model and the last about applying (or inventing in some cases) appropriate statistical methods to collect evidence and evaluate the derived empirical implications.

Professor Liberson and his coauthor in a recent article that I will point out below called this entire process as implication analysis, while previously for me, I tend to think implication analysis is only the second part of this process, something like comparative static and dynamic analysis, etc. But given that some of us and probably more of us are increasingly interested in producing works by integrating the above process, it seems natural to give a name to this integrated approach, as compared to formal analysis and empirical/statistical analysis.

Certainly, the integrated approach increases the complexity of research, as there are many things can go wrong between theory and data. A symposium on implication analysis, started with Stanley Lieberson and Joel Horwich's paper, "Implication Analysis: A Pragmatic Proposal for Linking Theory and Data in the Social Sciences," and followed by five response papers in the latest Sociological Methodology (Volume 38 Issue 1, December 2008), tries to address some of these issues, including specification of testable hypotheses, assessment of data quality, validation of estimates in different contexts, dealing with inconsistent evidence, etc.

Just FYI, Washington University's Weidenbaum Center and Department of Political Science will sponsor a new summer institute on Empirical Implications of Theoretical Models in politics in 2009.

Here is the institute's website. http://wc.wustl.edu/eitm.html

Posted by Weihua An at 2:28 PM

November 21, 2008

Estimation of the Stereotyped Ordered Regression Model

While reading Xiaogang Wu and Donald Treiman's paper entitled "Inequality and Equality under Chinese Socialism: The Hukou System and Intergenerational Occupational Mobility" in American Journal of Sociology (2007, 113: 415-445) , I was directed to a technical paper written by John Hendrickx (2000), describing how to use "mclgen" and "mclest" in Stata to estimate the Stereotyped Ordered Regression Model (SOR) in social mobility studies.

SOR is similar to conventional ordinal Logit models, but with a scaling metric to scale the effects of the independent variables on the dependent variables so that the effects of an independent variable vary by the values of the dependent variable. In addition, SOR does not assume strict ordering among values of the dependent variable, which is perfect for studying occupational mobility as occupation is orderable but without strict order. Another desirable property that SOR has is that it specifies an inheritance parameter measuring intergenerational occupational immobility, i.e., the extent to which father and son have the same occupation.

These features make SOR appear to outperform ordinal Logit models in social mobility studies.

Click here to consult Hendrickx' paper for more details of the SOR model and the syntax of using "mclgen" and "mclest" in Stata.

Posted by Weihua An at 9:40 PM

November 10, 2008

Conditions under Which Observational Studies Produce Comparable Causal Estimates

In the latest issue of Journal of Policy Analysis and Management, Thomas Cook, William Shadish and Vivian Wong wrote a paper proposing three conditions under which experiments and observational estimates are comparable based on their review of 12 recent within-study comparison studies. It is a little bit confusing, at least to me at first glance, to use "conditions" rather than "designs" here, as what the authors are really arguing is under three different types of research designs estimates from observational studies are comparable to causal estimates. More specifically, they suggest that:

1) regression discontinuity (RD) estimator produces similar effect estimates to experimental ones;

2) when intact group matching is used to minimize pre-test differences in at least outcome measures between the experiment and comparison populations, estimates from observational studies are trustworthy; and

3) when selection process into treatment is completely or plausibly known and could be properly measured, statistical procedures like propensity score matching can provide unbiased estimates.

As you can see, these three claims are based on selected published or to-be-published studies. But publication bias may lead them to overstate these claims, which in this case means observational studies with estimates comparable to experimental ones are disproportionally likely to be published than those without comparable estimates, and so how accurately or confidently we can rely on these claims to evaluate the comparability of estimates from observations studied remains ambiguous. In addition, this issue also relates to what standards we are using to judge comparability. If the standards are fuzzy, our judgment will be fuzzy to some extent as well. But overall, I appreciate the authors' enormous efforts on tracing recent literature on this topic and the resulted paper is full of wisdom.

When I am finishing this post, I realize that this paper was actually presented at our applied statistics workshop last October. But here comes the official version of the paper. And I think this is a very important topic that is worth a revisit.

Source:
Thomas Cook, William Shadish and Vivian Wong. 2008. "Three Conditions under Which Experiments and Observational Studies Produce Comparable Causal Estimates: Findings from Within-Study Comparisons", Journal of Policy Analysis and Management, Vol. 27, No 4, 724-750.

A previous paper distributed at the applied statistics workshop:
http://www.iq.harvard.edu/blog/sss/archives/2007/10/tom_cook_on_whe.shtml

Posted by Weihua An at 11:23 AM

October 25, 2008

A General Inequality Parameter

There is an interesting paper by Guillermina Jasso and Samuel Kotz in Sociological methods and Research in which they analyzed the mathematical connections between two kinds of inequality: inequality between persons and inequality between subgroups. They showed that a general inequality parameter (a shape parameter c of a two-parameter continuous univariate distribution), or a deep structure of inequality, governs both types of inequality. More concretely, they demonstrated convenient measures of personal inequality like Gini coefficient, Arkinson's measure, Theil's MLD and Pearson's coefficient of variation, and measures of inequality between subgroup are nothing but functions of this general inequality parameter c. The c parameter, according to the authors, also governs the shape of Lorenz curve, a conventional graph tool to express inequality.

Given the unitary operation of this inequality parameter, the authors concluded there is a monotonic connection between personal inequality and between-group inequality, namely, as personal inequality increases, so does between-group inequality. This conclusion is kind of surprising and even contradictory to our intuition that it is very plausible, if not usual, that personal inequality can change due to within-group transfers while between-group inequality still keeps the same. The authors admitted that their conclusion hold only under certain set of conditions. For example, the derived relation between the two types of inequality assumes two-parameter distribution and non-intersecting Lorenz curves. You may consult the full article to obtain more technical details if interested.

Source:
Jasso, Guillermina and Samuel Kotz. 2008. "Two Types of Inequality: Inequality Between Persons and Inequality Between Subgroups." Sociological Methods & Research 37: 31-74.

click here to get a working paper version of that from IDEAS

Posted by Weihua An at 2:40 PM

October 10, 2008

Ten Most Wanted: Culprits of the Collapse

Hiding in the Ivory tower, I did not feel any impacts of this financial collapse until several friends in Los Angels told me that they were laid off and looking for new jobs. In contrast, another friend who owned a real estate appraisal firm said that his business actually became better, because more people came to re-evaluate their housing values. I am not sure if there is a direct causal effect of the financial collapse on my friends' unemployment. But a more interesting question, which arises very often these days, is what are the causes of the financial collapse, or say who shall be responsible for the financial collapse. Mortgage lenders, greedy wall-Street investment banks, the government with loose regulations, home buyers or any others? As quantitative social scientists, are we able and how able are we to answer to this kind of question? Before building any models, whether formal or empirical, let's see what Anderson Copper has gotten for us.

TEN Most Wanted: Culprits of the Collapse
Over the last couple of weeks we've heard politicians tell us that now is not the time to point fingers and blame people for the financial crisis. I remember them saying that in the days after Hurricane Katrina as well. The truth is that's what politicians always say. They mean that now is the time to fix the problem, but once the world's attention moves on, the time for hold people accountable never seems to arrive. Politicians point fingers at members of the opposite party, but no one ever seems to take real responsibility.

So who is to blame for this financial fiasco? That's the question we've begun investigating. We've put together a list of the Ten Most Wanted: Culprits of the Collapse. This week and next week, every night, we will be adding a name to the list and telling you what they have done, and how much it's costing you. It's a rogues gallery of Wall Street executives, politicians, and government officials who did not do their jobs. It's time you know their names, their faces, it's time they be asked to account for their actions. (Excerpt from AC360)

Think about the models while enjoying the videos!

http://ac360.blogs.cnn.com/category/culprits-of-the-collapse/

Posted by Weihua An at 6:56 PM

September 26, 2008

Recommend a Book for Probability Theory

For those of you who want to do some exercises or solve typical problems in probability theory and random processes, I strongly recommend a book by Geoffrey Grimmett and David Stirzaker, One Thousand Exercises in Probability. As the authors said in the preface, there are over three thousands of problems in the book since many exercises include several parts. Personally, I find this book very useful, partly because all exercises come with solutions, which makes it much more readable than many other counterparts, and partly because I realize some faculty here tend to adopt exercises in it and put them in class assignments and exams. (Am I here the first person who realizes this?) So I recommend this book to you and hopefully, it will help you deepen your understanding of those daunting proofs in probability theory and random processes. More luckily, you may learn how to get used to them in von Neumann's sense.


In mathematics you don't understand things, you just get used to them.

John von Neumann

Posted by Weihua An at 7:59 PM

May 22, 2008

Nicholas and James are Featured in the NYT again

Professor Nicholas Christakis and Professor James Fowler's study on social network and smoking cessation is featured in the New York Times, which is also going to appear in the New England Journal of Medicine this Thursday. Congratulations to them!

Their basic findings are that smokers are likely to quit in groups (As Nicholas said, "Whole constellations are blinking off at once.") and that the remaining smokers tend to be socially marginalized.

One interesting question I have for their study is that, if friends tend to quit smoking together, will this partly contribute to the simultaneous weight gains among friends, a result Nicholas and James have found last year using the same dataset? In other words, I totally accept that social ties have important impacts on individuals' wellbeing, but if you try to research a certain outcome of wellbeing and do not control for the "contaminating" effects from other outcomes, the estimation of the social network effects on the former outcome could be biased. For example, the weight gains among friends, from this point of view, could be partially resulted from their simultaneous quitting from smoking. Of course, if smokers only consist of a very small fraction of the participants in the studied sample and their weight changes are not too extreme, the bias of the estimation should not invoke a serious problem.

See the following link for a glimpse of their study.

Study Finds Big Social Factor in Quitting Smoking
http://www.nytimes.com/2008/05/22/science/22smoke.html?partner=rssnyt&emc=rss

Sorry for the duplicate if you have noticed this news.

Posted by Weihua An at 12:01 PM

May 8, 2008

Some Random Notes about the International Network Meeting

Last week we had an International Meeting on Methodology for Empirical Research on Social Interactions, Social Networks, and Health here at the IQ., thanks to the organization by Professor Charles Manski and Professor Nicholas Christakis. Some people told me that the second day of the meeting was much more "dynamic and interactive" than the first day and based on what I have seen, I believe it was true. I saw at least three cliques of speakers were automatically formed on site along the disciplinary lines: statisticians, economists, and sociologists and political scientists. There were even sub-cliques and backfires! Fortunately, nobody was severely wounded. But anyway, it was a great intellectual exchange between disciplines. Below are some brief notes I took at the second day of the meeting, particularly at the last 20 minutes of the meeting when speakers talked about the future directions of network analysis in social sciences. Sorry for that I forgot to jot down exactly who said what, and that I also squeezed into the notes some of my personal thoughts. I took full responsibility for all errors in the notes.

1. Need to combine game theory with social network analysis, particularly evolutionary game theory (and transaction costs theory).

2. Need to further develop social network analysis based on (random) graph theory, typology and random matrix theory.

3. Network studies tend to focus on network structure and typology as dependent variables while social sciences are more concerned with how network positions and features affect node level of problems. To put simply, network studies tend to start from nodes and end at network while social sciences are more like a top-down approach.

4. In either case, however, it is very crucial to understand the data/tie generating mechanism. Especially, think that the formation of ties can go two ways: influence and selection. For example, smokers can become friends either because a person is influenced by his/her smoking friend to start smoking or because they are both smokers and then become friends. For another example, a highly educated person is usually less likely to be nominated by others as the best friend. This could be either because the highly educated person is less trustworthy or incapable to maintain friend ties or because he/she is more independent and less wiling to associate with others.Longitudinal data may help solve the influence vs. selection issue.

5. Network analysis assumes that the probability of forming ties between nodes is the same between any pair of nodes. So start with a meaningful number of nodes to build network so that each node have roughly the same probability to form ties with one another.

6. How the sever of an existing tie and the formation of a new tie will affect the structure of social network? How ties can bring more ties and lead to polarized network? Nonlinear generating processes and dynamics in network can lead to dramatic difference in network structure for any tiny changes at the node level. How network size can affect network structure? (Think about the difference among monopolistic market, oligarchic market and perfect competitive market.)

7. How to define homophyly between friends? One dimension vs. multiple dimensions? Suppose it is one dimension, there are still two approaches: 1) do a mean test between the tie senders and the tie receivers. 2) Use the ratio of the number of ties whose connected nodes are in the same group (e.g., age +/- 5) that you defined to the total number of ties as an alternative measure. What else?

8. Need to think about how to incorporate network analysis into traditional regression framework. We can either include network properties into regression models to study how network affect personal/clique level of phenomena or use regressions to evaluate how network properties are determined by socioeconomic variables.

9. How to deal with the dependence structure among node level of variables since the errors are not iid.? Is it enough to just using correlation matrix to weight the standard errors and get robust SEs?

10. Need to combine network software with traditional statistical software. The stat-net is getting there. But for Stata users, canned programs are needed to generate network data inside of Stata.

Lastly, for those of you who are interested in causal analysis, read Patrick Doreian (2001), "Causality in Social Network Analysis" (Sociological Methods and Research 30: 81-114) and see if you can improve upon his study.

Posted by Weihua An at 10:46 AM

April 24, 2008

FAQs about Statistical Interactions

I am writing a short essay about the connection and distinction between indirect effect and interaction effect for a methodological class and find the following website very helpful to clarify some of the FAQs on that subject. The website is maintained by Professor Regina Branton at the Department of Political Science of Rice University.

http://www.ruf.rice.edu/~branton/interaction/faqshome.htm

Also check out the mediation item at Wikipedia and its great references.

http://en.wikipedia.org/wiki/Mediation_(statistics)

Posted by Weihua An at 11:35 AM

April 10, 2008

How Network Graphs are Generated?

When Professor Nicholas Christakis came by to give a talk on social networks and health two weeks ago, some commentator expressed concern about the sparseness of information contained in network graphs (not specifically regarding Nicholas’ research, which I believe was well-done). I do share the same concern with that commentator. So afterwards I did some preliminary search on the literature about visualization of network data and found several interesting pieces that may help clarify (or even exacerbate) part of the concern some of us are having with network graphs.

The first is the lecture notes Professor Peter V. Marsden wrote about visualization of network graphs in soc275. Here I just want to highlight a few points in his notes. (Words in quotes are taken from Professor Marsden’s lecture notes.)

1) Network graphs can be “referenced to known geographical/spatial/social locations of points”.

2) Aesthetic criteria are used to generate network graphs, for examples, to minimize crossing lines, to make lines shorter, … and “[to] construct plot such that close vertices are connected, positively connected, strongly connected, or connected via short geodesics”.

3) “Location of points reflects ‘social distances’”. … “Spatial configuration differs depending on what 'distance-generating mechanism' is assumed and built in to one’s data.”

4) Some often-used network graph generating algorithms include factor analysis, multidimensional scaling (MDS) and spring embedders, etc.

So the configuration of network graphs seems to a large degree dependent on researchers’ theoretical interests and can change according to the network measures (whether it is the number of clusters within network or overall network connectedness, etc.) that researchers are mostly interested in. In other words, before generating any network graphs, researchers have to be clear about what theoretical themes they aim to present through network graphs and then select corresponding network measures and generating algorithms. For those of you who want to follow up with this topic, there are several pieces recommended by Professor Marsden in his lecture notes that I think are good starting references. See below for more details.


1. Bartholomew, David J., Fiona Steele, Irini Moustaki, and Jane I. Galbraith. 2002. The Analysis and Interpretation of Multivariate Data for Social Scientists. London: Chapman and Hall/CRC. Chapters 3 and 4.

2. Freeman, Linton C. 2005. “Graphic Techniques for Exploring Social Network Data.” Chapter 12 in Carrington, Peter J., John Scott, and Stanley Wasserman. 2005. Models and Methods in Social Network Analysis. New York: Cambridge University Press.

3. Freeman, Linton C. 2000. “Visualizing Social Networks.” Journal of Social Structure 1. (Electronically available at http://www.cmu.edu/joss/content/articles/volindex.html)

Posted by Weihua An at 11:51 AM

March 27, 2008

How 0.05 comes into rule?

Recently I read an article written by Erin Leahey, talking about how the usage of statistical significance testing, the 0.05 cut-off value and the three-star system becomes legitimized and dominant in mainstream sociology. According to Erin, one star stands for p<=.05, two stars p<=.01 and three stars p<=.001. But I feel the cut-off values are something like .01, .05 and .10 respectively. Anyway, Erin attributed the first usage of .05 significance level to R. A. Fisher’s book, Design of Experiments in 1935. Erin noticed that other forms of significance testing besides the .05 test were already very popular in the 1930s, when close to 40 percent of articles published in ASR and AJS applied one or another form of significance testing procedure. Based on the articles she sampled from ASR and AJS, Erin showed that the popularity of the usage of statistical significance testing and the 0.05 cut-off value roughly took an “S” shape. The usage rose firstly from the 1930s to 1950, declined afterwards until 1970 and then revived since then. Currently, around 80 percent of articles published in ASR and AJS employ both practices. The three-star system emerged in the 1950s, but became popular only after 1970. Now there were slightly above 40 percent of articles published in the above top two sociological journals use this procedure.

So what account for the diffusion of such practices? Erin brought out several arguments to answer this question. For examples, she argued that institutional factors like investment in research and computer, graduate training and institution’s academic status, and journal editor’s individual preference, etc., could be some of the most important factors in the diffusion process of these practices. Interestingly, she found that graduating from Harvard had a significant negative “effect” on adopting these statistical practices. :-)

Of course, as it happens to almost all research, Erin’s study can not avoid some minor drawbacks either. For example, her sample is only drawn from the top two sociological journals and hence the generalization power of her findings could be limited. But overall, it is a fun reading. And if you are interested in more historical account of how the statistical practices were introduced to and became legitimized in social sciences in general, Camic and Xie (1994) is a very good start.

Sources:
Leahey, Erin. 2005. Alphas and Asterisks: the Development of Statistical Significance Testing Standards in Sociology. Social Forces 84: 1-24.
Camic, Charles, and Yu Xie. 1994. “The Statistical Turn in American Social Science: Columbia University, 1890-1915.” American Sociological Review 59:773-805.

Posted by Weihua An at 11:57 AM

March 20, 2008

Correlation of Ratios or Difference Scores Having Common Terms

Yesterday I went to Professor Stanley Lieberson’s class, Issue in the Interpretation of Empirical Evidence. We discussed a paper, written by Stan and Glenn Fuguitt, titled Correlation of Ratios or Difference Scores Having Common Terms. The basic argument of this paper is that although ratios and difference scores are often used as dependent variables in traditional regression analysis, if there are some independent variables who share the same common term with those dependent variables, the estimated coefficients could be severely biased due to the spurious correlation brought about by this common term (whether it is in the denominator or numerator). For examples, if dependent variables are in the form of X/Z while independent variables are something like Y/Z, Z, or Z/X, etc., the estimated coefficients between the dependent and independent variable could become statistically significant simply due to chance.

For some concrete examples, criminologist often use crime rate (adjusted by city population size) as dependent variable while at the same time using city population size as independent variable; organizational researchers are interested in the relationship between the relative size of administration of organization and the absolute size of organization; and economists often regress GDP per capita on such variables as population growth rate, and/or even population size, etc. According to Stan and Fuguitt’s research, all the above examples will provide spurious coefficients since the dependent variable and the independent variable include common terms. In their paper, they attributed this finding back to a paper written by Kail Pearson in 1897 in which Pearson presented rigorously how the spurious correlation came from and a proximate formula for computing correlations of ratios, etc.

We were asked to do an experiment to prove the above spurious correlation, in which we generated three sets of random integers (namely, X, Y, Z) ranging from 1 to 99, presented the pairwise correlation matrix among them and found no significant correlations between any pair of variables. But we found significant correlation between Y/X and X, and when we regressed Y/X on X, the coefficient became significant too. So after such manipulations like division or subtraction, we artificially build significant correlation among two originally insignificant correlated random integers.

Why not try the following in Stata to see if the above claims are overstated or not?

set obs 50
gen x=int(99*uniform()+1)
gen y=int(99*uniform()+1)
gen z=int(99*uniform()+1)

pwcorr x y z, sig

gen ydx = y/x
pwcorr x ydx, sig
reg x ydx

gen xdz = x/z
gen ydz = y/z
pwcorr xdz ydz, sig
reg xdz ydz

gen zdy = z/y
pwcorr xdz zdy, sig
reg xdz zdy

Are you convinced by now? If not, please go read the source paper below (or just write back and say what is wrong with Stan and Fuguitt’s argument). If yes, the question now becomes what should we do with the spurious correlation. Shall we just use the original forms of variables? Shall we re-specify the Solow model? But what if our research interest is about ratio or difference? … …


Source:
Stanley Lieberson and Glenn Fuguitt, 1974. Correlation of Ratios or Difference Scores Having Common Terms, in Sociological Methodology (1973-1974), edited by Herbert Costner, San Francisco: Jossey-Rass Publishers.

Posted by Weihua An at 11:17 AM