May 2008
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Authors' Committee

Chair:

Andy Eggers (Gov)

Members:

Weihua An (Soc)
Kevin Bartz (Stats)
Sebastian Bauhoff (HealthPol)
John Graves (HealthPol)
Justin Grimmer (Gov)
Jens Hainmueller (Gov)
Mike Kellermann (Gov)
Ellie Powell (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Kevin Quinn, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 3.34


« End-of-Year Hiatus | Main | Experts and Trials IV: Why? »

3 January 2006

Sampling naturalistic data: how much is enough?

Amy Perfors

An issue inherent in studying language acquisition is the sheer difficulty of acquiring enough accurate naturalistic data. In particular, since many questions hinge on what language input kids hear - and what language mistakes and capabilities kids show - it's important to have an accurate way of measuring both of these things. Unfortunately, short of following a child around all day with a tape recorder (which people have done!), it's hard to get enough data to have an accurate record of low-frequency items and productions; it's also hard to know what would be enough. Typically, researchers will record a child for a few hours at a time for a few weeks and then hope that this represents a good "sample" of their linguistic knowledge.

A paper by Caroline Rowland at the University of Liverpool, presented at the BUCLD conference in early November, attempts to assess the reliability of this sort of naturalistic data by comparing it to diary data. Diary data is obtained by having the caregiver write down every single utterance produced by the child over a period of time; as you can imagine, this is difficult to persuade someone to do! There are clear drawbacks to diary data, of course, not least of which is that as the child speaks more and more it becomes less and less accurate. But because it has a much better likelihood of incorporating low-frequency utterances, it provides a good baseline comparison in that respect to naturalistic, tape-recorded data.

What Rowland and her coauthor found is perfectly in line with what is known about statistical sampling. As the subsets of tape-recorded conversations got smaller, estimates of low-frequency terms became increasingly unreliable, and single segments less than three hours were nearly completely useless (as they said in the talk, they were "rubbish." Oh how I love British English!) It is also more accurate to use, say, four one-hour chunks from different conversations rather than one four-hour segment, as the former avoids "burstiness effects" that come from conversations and settings predisposing to certain topics.

Though this result isn't a surprise from a statistical sampling point of view, it is nice for the field to have some estimates of how little is "too little" (though of course how little depends somewhat on what you are looking for). And the paper highlights important methodological issues for those of us who can't trail after small children with our notebooks 24 hours a day.

Posted by Amy Perfors at January 3, 2006 2:19 AM

Comments

Notification

Enter e-mail address to receive notification of new comments to this entry

Post a comment




Remember Me?

(you may use HTML tags for style)