by Elizabeth Gehrman, Special to the Harvard News Office September 15, 2006
The whiteboard that covers hundreds of feet of the curved hallway at IQSS is not always covered with equations - but lately, it usually is. And most of them are in the haphazard hand of James M. Robins, M.D., a faculty associate at IQSS and a professor of epidemiology and biostatistics at the School of Public Health. "I'm not the most organized person in the world," says Robins, his chair rolling over a splash of papers that spill out of his briefcase and onto the floor of his office. "So the equations usually sit there for a while before I type them into my computer."
The disorder seems to be a metaphor for the path he has taken in life - circuitous but ultimately inevitable, and an object lesson to anyone who has ever been told they couldn't do something. It began when Robins was a junior resident at an occupational-health clinic he and a friend had started at the Yale-New Haven Medical Center. There he saw patients, helped investigate factories for violations, and researched workers' comp cases. "It's hard to figure out whether someone's disease was caused by their exposure at work," he says, "so we started having to read papers filled with statistics and p-values," which help determine whether a particular finding could be due to chance.
Having taken "more abstract stuff" as a Harvard undergrad, he says, "I didn't know what any of these things were. I got interested in how statisticians and epidemiologists decide what causes what and what all of these things mean." He took some statistics courses at Yale's School of Public Health, but found his natural bent for mathematics suddenly challenged by the "cookbook" method of teaching that predominated. "The people weren't really mathematical," he says, "and as somebody who was, I had no idea what they were talking about. I made worse grades on the exams than people who'd never really had math. Whenever you asked why, or asked on what theorem do you base this, they'd have no explanation. So for a while I thought maybe I'd lost my mind."
Hardly. Robins soon discovered a book by UCLA econometrician Ed Leamer. "He said, 'Yes, what they're doing is nonsense.' And he explained the nonsense using subjective probabilities. That is, he told you under what beliefs about the world what they're doing would make sense, and under what other beliefs it would not make sense. This is called Bayesian statistics. It was the first thing that clarified all of this to me." Robins began applying Leamer's ideas to epidemiological concepts, teaching himself the foundations and principles of statistical inference along the way.
"My first research paper used the mathematical foundations of statistics to rigorously justify many of the conventions and assumptions of epidemiology," he says. "I submitted it to a big statistics journal in 1983. They rejected it and I was not to publish in any fancy statistics journal on this or any other topic for 10 years. I knew that I knew some things they didn't, but I didn't have a mentor and didn't know the language. I had paper after paper rejected." His work did eventually come to light, in an "obscure" engineering journal and in epidemiological journals, where he quickly established a reputation as a leading methodologist.
"Epidemiology is a very strange field," he says. "Almost every textbook was called 'Intro to...' because no one understood what to do about data on real exposures that vary over time." As an example, he explains the so-called healthy worker survivor effect. It seems logical that the people with the highest exposure to a harmful chemical would have more disease; but vexing "intermediate variables" create a built-in bias that skews the results of any study: People who start to get sick are likely to leave work - at which point they begin to get less exposure. But data on why someone left is not recorded for analysis, so it's hard to determine which workers left because of illness and which left for other reasons.
"If we could decide how much chemical you were exposed to by flipping a coin, then bias wouldn't be a big deal," he says, "because it would be a randomized experiment and your exposure wouldn't be correlated with how sick you were before being exposed." Random experiments work, in other words, because there are no confounding data - and the real world can be a confounding place.
"It turned out to be a very hard problem," Robins says. "And basically, in various ways, I spent the next 20 years thinking about it. One of the things I figured out was a statistical trick that, when data on the relevant confounders had been collected, it turned the observational data we have into data we would have seen if we had done the study randomly." This statistical legerdemain, he says, creates a new data set in which some patients are copied more than once, depending on their probability of getting the treatment they actually did get based on doctors' patterns; the patients whose treatment is out of the ordinary are copied more times, creating a "pseudo-population" that is essentially the same as a randomized cohort.
This is not the only statistical innovation Robins came up with over the years, but, he says, "It's easiest to explain, believe me."
The model has caught on in statistical circles as high as the FDA - and Robins, meantime, has "gone off in a completely new direction" that is so complicated he presumes it will occupy him for the rest of his life.
"Statistics is sort of divided between nonparametric statistics on the one hand and parametric and semiparametric on the other," he says. "Nonparametric statistics make no a priori assumptions about the shape of a curve, and parametric and semiparametric statistics assume the curve can be described by a simple mathematical function such as a straight line or a parabola. The ways people think about and analyze data are very different, depending on whether they take a nonparametric or more parametric approach. I had a feeling there should be one unified story for everything. So one Christmas I decided to try to figure out how to unify statistics based on higher-order influence functions, which are higher-order U-statistics."
Okay.
"That means -" he hesitates, aware of course that he's getting beyond most people's capacity for understanding. "Most statistics you can, to a good approximation, write as a sum of terms, each term depending on a single person's data. That's a first order U-statistic. If each term depends on several person's data, that's a higher-order U statistic." He pauses to see if it's sinking in, then tries a different tack. "Let's say you and I are in a study cohort. If the first term is a function of my and your data, the second of your and that person's data, et cetera, that's a second-order U-statistic. What order you need on a given problem depends on prior knowledge. In practice it's hard to go beyond the fourth order. It's easy in theory."
Easy for some, as Robins demonstrates one day when he brings LingLing Li and Eric Tchetgen, his graduate assistants, into the hallway to demonstrate an epiphany he had the night before. "It uses the twicing kernel with leave-one-out, right?" he begins, taking up a neon blue marker and slashing at the board. "Which makes the actual statistic a second-order U-statistic, right?" Ling and Tchetgen nod knowingly as passersby eavesdrop bemusedly and Robins launches into a detailed explanation that includes concepts like regularity conditions, unrestricted marginals, and attainable rates of efficiency.
Still, for all his facility at the whiteboard, he says, "I'm a doctor. I'm not really a mathematician. I don't spend a lot of time doing epsilons and deltas. I have very good intuition about what things are, but don't always have the proofs." For the technical mathematics details, he engaged his friend Aad van der Vaart, a professor of stochastics at Vrije Universiteit in Amsterdam. "Aad does beautiful mathematics, so we're doing this together."
The ultimate purpose of the unified theory, he says, is to allow for more accurate estimates and more honest estimates of uncertainty. "At some point I suddenly thought I knew which papers and ideas had the kernel," he continues, "and in what direction I had to go. Once I realized what that was, I realized what I would have to do is incredibly daunting. Usually I'll have one "aha" experience and the main idea's done, but this has hundreds of layers. It's like an onion; we go to the next layer and think we're done, and there's another layer inside it. It's, like, endless."
Whether the theory ends up transforming modern statistics is still unclear. "These days, with the computer era," he says, "we have a data explosion. Bigger data sets, more variables, more subjects. But no fancy statistical analysis is better than the quality of the data. Garbage in, garbage out, as they say. So whether the data is good enough to need this level of improvement, only time will tell."
The Institute
for Quantitative Social Science
at Harvard University
1737 Cambridge St. Cambridge, MA 02138
p: (617) 496-2450 f: (617) 496-5149
© 2003-2008 President & Fellows Harvard University. Found an error or have a suggestion for this site?