| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| 15 | 16 | 17 | 18 | 19 | 20 | 21 |
| 22 | 23 | 24 | 25 | 26 | 27 | 28 |
| 29 | 30 |
May 31, 2008
I'm grateful for the strong response to my original query for quality, free PDF annotation for Linux. In general, there seem to be a few categories.
-Windows-based editors, adaptable through emulators: PDF X-change, Foxit (free version), primopdf
-Linux editors with non-portable annotations: Okular, which has hidden XML files for its annotations (skim, for OS X, has the same scheme)
-early, incomplete solutions that will eventually be good: GNU's PDF project, Xournal
-early, incomplete solutions that aren't user-friendly: pdfedit, Cabaret Stage
-early solutions that are still in progress: evince
Of all of these options, I like Okular the best, mainly because integrating its XML-saved annotations into the PDF is but one plugin away (which might already exist, for all I know), and it's theoretically portable to Windows by installing qt4 binaries. Using an emulator like wine is a hassle big enough that I've avoided it, for the same reason I don't use cygwin on Windows systems.
So we're close to a (more) universal free editing environment. But I'm still not a fan of doing all my work on a screen, and also not willing to print. So I'm trying a middle road.
I bought an iLiad e-paper reader this past week, and so far I'm impressed with how it handles (though its price tag, $600 for the model I bought, definitely isn't for everyone, and was almost not for me). The screen is easily readable, the battery lasts, and I can zoom in and rotate documents to get a half-page display with larger text. More importantly, the device runs Linux and iRex has made a point to try and use open source software as much as possible, in contrast to Amazon and the Kindle (which is half the size, can't read PDFs and can't edit books.)
However, as the project is still in its relative infancy, there are a few functions it has yet to incorporate that I really would like, and they're the same ones I want in a computer-based annotator: highlighting multiple-column text, for example, so that I can extract passages I want later at the push of a button. And like Okular, the annotations made on the iLiad are saved in a companion XML file rather than the original PDF, but the company offers a free program to do the merging.
I'm going to continue to explore what the iLiad can do as far as editing, but it's definitely reassuring that everyone who's seen me used it has oohed and aahed at it.
To sum up, I've now got a free platform for reading, editing and annotating PDFs on a Linux machine, and an auxiliary paper-free method for reading them later which is admittedly not free. And I have more needs as well, but I can at least see them being met soon. What else do people want in paperless work we haven't covered yet?
P.S. If the people from iRex are reading this and want me to shill for them for real, they can let me know directly.
Posted by Andrew C. Thomas at 11:05 PM
May 26, 2008
I'm a Linux user in need of a quality PDF reader with basic annotation tools, and I need it to be available for free. Think I'm asking for too much?
We're at a point where the level of content available online dwarfs our ability to print it all onto paper for examination and notation. As academics, we're expected to sort through volumes of other people's work in order to verify that our own is original, as well as comment, annotate, and on occasion make corrections or forward-references to later works.
But despite a boom in computational power and information bandwidth, the software to do this without resorting to printed or copied matter isn't accessible to most students without paying through the nose. Full software suites like Adobe Acrobat aren't necessary for the kind of work academics need to do. There are a few functions that are essential to the task, currently available in commercial software:
-Adding and reading notes, whether free-floating or attached to highlighted text
-The ability to select and copy multi-column text (none of the free ones seem to be able to get this one right)
-I'd like that when LaTeX creates a link to a footnote or citation, hovering over the displayed link should cause a pop-up box to display the information.
I'm a man with big ideas but no time, and more importantly, no budget, to motivate and drive the development and use of a free PDF reader with mild annotation capabilities. I can't resort to the for-pay software available from the school website because I'm running Linux, and I shouldn't have to go to a virtual machine or another computer to do this kind of annotation. Likewise, others shouldn't have to spend hundreds for software where they only need a few simple functions.
I suppose the issue is that everyone has their own toys they want included in a PDF editor, which is why the commercial package makes sense. But as academics, wouldn't we be happy with "the basics plus"?
Posted by Andrew C. Thomas at 6:34 PM
April 22, 2008
Andy Gelman posted this forwarded item regarding an apparent fallacy with averages and the misunderstanding of uncertainly. Essentially, it boils down to this reversal:
a) 100 students take a class, and 50 pass.
b) Given that next time, 50 students pass the (identical) class, how many students, on average, were enrolled?
The "fallacy" is in assuming that the expected number of original enrollees is 100, when it must necessarily be greater than 100 due to the uncertainty in the estimation of passing the class. The article points out that it's ignorance of the prior distribution of passing students that's at fault for the "fallacy" - I argue that it's the prior distribution of one student passing a test that's the cause of the paradox.
Break the problem in two:
a) 100 students take a class, and 50 pass.
Assume for the moment that a student passes or fails the class independent of their peers (which is a reasonable assumption for the initial problem, dealing with the failure rate of vehicles.) Let's assume the standard noninformative prior case, that "half a student" passes and "half a student" fails (the Jeffreys prior) and that students are basically identical. Then the posterior distribution of the probability of passing the test is equivalent to a Beta(50.5,50.5) distribution.
b) Given 50 students passed, on average how many enrolled?
The number of students enrolled in the class for each one who passed is then 1/p - but the mean of 1/p (in this case, 2.02) is necessarily greater than 1/(the mean of p), 2. So the expected class size must be greater under these assumptions. So roughly 101 students enrolled.
The original authors, however, make a profound overestimation of the average of starting students, choosing a "posterior" distribution that yields a class size of 150. To get an expectation this big with this prior information, we would observe a posterior of Beta(2.0,2.0) - or, 1.5 students passing and 1.5 failing! Putting this in perspective, the most likely way I can see this happening is that students pooled their talents and produced 3 distinct final papers: one good, one bad, and one just good enough to get the professor to flip a coin.
It does, however, seem to explain why Harvard classrooms always seem to overflow chaotically at the beginning of each term.
P.S. The original authors call this the "backwards reasoning fallacy", even though Google says the name is better applied to startling schoolchildren deterministically rather than failing them stochastically. Resolving the namespace collision here, does this problem go by another name, or shall we go via Stigler and call it Gelman's paradox?
-----------------------------------
Update: We recently received this comment from the work's original author, as the comment system failed to post it. I've attached it verbatim. -AT, 8-12-08
I am the author of the original article and a colleague of mine alerted me to your posting on Andy Gellman's blog. You said (about my article):
"An interesting problem with an awful delivery."
You also said:
"I'd normally agree that someone's selling something with this, but the fact that the page was cosponsored by a university makes me wonder about their grossly exaggerated result."
For a start it would not have been too difficult for you to have found out who I was since my name is very clearly stated at the bottom of the article, and the web site provides full information about me. So it would have been nice for you to raise the concerns you have about the article with me directly rather than through the use of insulting comments on a third party web site.
As to the substance of your criticisms, you seem to have misunderstood the particular problem and context and have produced a different model, that does not address the very real example that we had to deal with. You say that
"The original authors ... make a profound overestimation of the average of starting students, choosing a "posterior" distribution that yields a class size of 150."
This is not what I did at all. I made it clear that the crucial assumption was the prior average class size. To illustrate the problem I chose an example in which the prior average was deliberately high, 180. The fact that this gives a posterior average class size of about 153 when the 50 passes is observed is exactly the point I wanted to emphasize. Your comment about us making a "profound overestimation" is quite simply nonsense. Part of the fallacy was to assume that the class size of 100 in the specific example was in any way representative of the average class size.
I suggest you read the article again and pay particular attention to the (real) vehicle example at the end. The model that I produced EXACTLY represented the real data.
You should also be aware that the aim of my probability puzzles/fallacies web page is to raise awareness of probability (and in particular Bayesian reasoning) to as broad an audience as possible. While I am pleased if other professional statisticians read it, it is not they who are the target. This means having to use a language and presentation style that does not fit with the traditional academic approach.
In fact, one thing I have discovered over the years is that too many academic statisticians tend to speak only to other like-minded academic statisticians. The result is that in practice (i.e. in the real world) potentially powerful arguments have been 'lost' or simply ignored due to the failure to present them in a way in which lay people can understand. I have seen this problem extensively first hand in work as an expert witness. For example, in a recent medical negligence case the core dispute was solved by a very straightforward Bayesian argument. However, this had been presented to the defence lawyers and expert physicians in the traditional formulaic way. Neither the lawyers nor the physicians could understand the argument, and the QC was adamant that he could not present it in court. We were brought in to check the validity of the Bayesian results and to provide a user-friendly explanation that would enable the lawyers and doctors to understand it sufficiently well to present it in court. The statisticians simply did not realise that what is simple to them may be incomprehensible to others, and that there are much better (visual) ways to present these arguments. We used a decision tree and all the parties understood it immediately because it was couched in term of real number of patients rather than abstract probabilities. Had we not been involved the (valid) Bayesian argument would simply have never been used.
Norman Fenton
Professor of Computer Science
Head of RADAR (Risk Assessment and Decision Analysis Research)
Computer Science Department
Queen Mary (University of London)
London E1 4NS.
Email: norman@dcs.qmul.ac.uk
www.dcs.qmul.ac.uk/research/radar/
www.dcs.qmul.ac.uk/~norman/
Tel: 020 7882 7860
CEO
Agena Ltd
www.agena.co.uk
London Office:
32-33 Hatton Garden
London EC1N 8DL
Tel: +44 (0) 20 7404 9722
Fax: +44 (0) 20 7404 9723
Posted by Andrew C. Thomas at 10:33 AM
March 18, 2007
It's been in the news that a three-way tie happened on Jeopardy on Friday night. From the AP article:
The show contacted a mathematician who calculated the odds of such a three-way tie happening — one in 25 million.
I have to believe that the mathematician contacted didn't have all the facts (and the AP rushed to meet deadline), because once you're in Final Jeopardy there's little randomness about it. It's all down to game theory.
Suppose we first estimate the odds that all three players are tied at the end of Double Jeopardy.The total dollar value shared by all three is around $30000, or about $10000 each. Since questions have dollar values which are multiples of $200, we could reasonably assume that there are 100 dollar values, between 0 and 20000, where each player can end up. So the odds of a tie at this stage should be no more than one in a million - and this is a very conservative guess, since I assume that the probabilities are all equal (whereas they would likely have a central mode around 10000.)
Breaking a three way tie with a Final Jeopardy question would then require that all three players bet the same amount, and I think the odds are considerably less than 1 in 20 that they'd all bet the farm no matter the category.
But it shouldn't even get that far. The scenario on Friday night had two players tied behind the leader who didn't have a runaway. So we have somewhere around 1 in 20,000 odds that this would happen (the factor of two because the third player could be ahead or behind the tied players.)
The runners-up would both be highly likely to bet everything in order to get past the leader. And the leader, in this case, placed a tying bet for great strategic reasons - getting one more day against known opposition rather than taking the chance of a new superstar appearing the next day - as well as a true demonstration of giving away someone else's money to appear magnanimous.
Even if the leader only had a 10% chance of making that call, and given that the other two players were pressured to bet high, that's still 1 in 200,000 - over 100 times more likely with a fairly conservative estimation process.
Posted by Andrew C. Thomas at 11:14 PM
February 1, 2007
There have been an awful lot of stories lately about the world's oldest person dying; in fact, it seems to have happened about three times in the last month or so. Then again, being the world's oldest person is a dubious honour to be sure, since the winner isn't likely to hold the title for very long and likely isn't even aware of their status. (Full disclosure: my great-grandmother was a centenarian but likely never knew my name.)
These stories have been bouncing in my mind lately and I'm trying to figure out why. I can think of a few scientifically relevant explanations:
1) The life expectancy of a centenarian is on the order of a year, and three successive deaths in a month is a rare event; conditioned on the first one, assuming independence and exponential life span (a reasonable assumption for the tail end), the probability of the next two events coming within a month is roughly 0.0033. And this happened to be the month for it.
2) The events aren't at all rare, and the centenarian death rate is actually dramatically higher, but it's a slow news month, and the stories themselves are floating to the top of the pile.
3) Online news services like Reuters and CNN have dedicated spaces for more `entertaining' and `bizarre' news stories, meaning that no matter how much news there is, people are seeing these stories.
4) Guinness sales are down, despite the "brilliant!" advertising campaign, and the World Record people are seeking out these changing events for the sake of their own discreet advertising.
5) I read this in The Onion and the satire hit me point blank, meaning I'm selecting and remembering the stories more often when they appear.
I'm thinking it's Number 5, but I'd be curious to know if anyone knew the mean centenarian death rate and whether this was a rare occurrence or not.
Posted by Andrew C. Thomas at 9:56 AM
January 9, 2007
Courtesy of Aleks at Columbia, who brought this to my attention:
A very interesting collection of visualizations for projects, proposals and presentations. The periodic table arrangement itself is not at all useful, but the depth and organization sure is.
Posted by Andrew C. Thomas at 2:36 PM
December 11, 2006
From a forthcoming paper on legislative redistricting commissions: Iowa
has used the same scheme for the past three redistricting cycles. A commission draws three maps, and the legislature selects one of those.
The attached seats-votes graphs are for the 2000 and 2004 state house elections, before and after the 2001 cycle. As we can see, responsiveness (the slope of the curve) is high and remains high afterwards, suggesting that the fraction of contested seats is high, and justifying its reputation as a model for redistricting.
However, the curve is definitively below 50% at the median vote, meaning that an equal vote will almost always split the seats unevenly. (In this case, the Republican party gains the advantage.) This suggests that redistricting is less effective in this case.
Given Iowa's reputation as a well-run redistricter, one wonders how much it is deserved. It's also fair to wonder what would happen if this system were applied to another state where voting was racially polarized.
Posted by Andrew C. Thomas at 1:40 PM
September 25, 2006
In the next few weeks, the number of articles posted to this site is set to increase, partly because school's back in session, and partly because we've recruited some new authors for the committee. This is a good thing in general. However, I know I work best on a deadline, so it happens that I tend to post when the flow is slower, and less when a lot of articles are being posted by the other authors.
To bring this back to the realm of science: Am I taking the position of a economic free rider (or "freeloader", if you prefer), if I tend to post less frequently than other authors, or is someone in my position merely acting as a balancing actor, keeping stability?
As for the "art", I doubt that this observation is opera-worthy, but it does tend to happen a lot in social situations I've seen. Certainly in an early episode of Seinfeld where George wanted to split a cab but not have to pay for it because they "were going that way anyway".
Posted by Andrew C. Thomas at 2:00 PM
September 19, 2006
Andrew Fernandes, a fellow Canadian expat and PhD student at NC State, responded to my earlier request for advice on exploring a Dirichlet-type simplex.
Among other places, the idea is presented in the Wikipedia entry for Simplex. He suggests perturbing the cumulative sums, then putting the perturbed sums back in order to draw a time-reversible proposal. This has the advantage of not sending too many parameters below zero - a maximum of one - as opposed to an equal perturbation of each parameter, and not pinning a high-valued parameter in place with a standard Dirichlet proposal.
Posted by Andrew C. Thomas at 11:32 PM
August 7, 2006
I've spent quite a bit of time in the last few weeks - probably too much - thinking about the term 'regression' and its use in statistics, and why I find it so dislikeable. I sincerely doubt any campaign I try to start will have any real effect, so let me lay down the reasons why I feel we as scientists should refer to linear modelling as just such, and not as 'regression'.
One reason is that the word only has a tenuous connection to the actual algorithm - the other is that it far too often implies a causal relationship where none exists.
As the story goes, Francis Galton took a group of tall men and measured the height of their sons, and found that on average, the sons as a group were shorter than their fathers. Drawing on similar work he had done with pea plants, he described this phenomenon as "regression to the mean," recognizing that the sample of fathers was nonrandom. A "regression coefficient" then described the estimated parameter which, when multiplied by the predictor, would produce the mean value.
I can only surmise that "determining regression coefficients through minimizing the least squares difference" was too verbose for Galton and his buddies, and "regression analysis" stuck. Now we have lawyerese terms like "multiple regression analysis," which really should read "multiple parameter regression analysis" since we're only running one algorithm, but we appear stuck with it.
So what's the big deal? Nomenclature isn't an easy business, and two extra syllables in "linear model" might slow things down. But aside from my gripe with using "regress" as a transitive verb (the Latin student in me cringing), even the most generous interpretation of the word's root, and the experiments that revealed it, yield to trouble.
"Regression" literally means "the act of going back." If we accept this definition in this context, we have to have something to which we can return. Clearly, this implies discovering the mean - but chronologically, it can only mean discovering the cause, that which came before.
Linear modelling makes no explicit assumptions about cause and effect, a major source of headache in our discipline, but the word itself, consciously or otherwise, binds us to this fact.
The remedy to this is not simple; after all, I'm talking about trying to break the correlation-is-causation fallacy through words, which is both a difficult task and the sort of behaviour that will keep people from sitting with you at lunch. But we can improve things slowly and subtly in this fashion:
1) If you are confident that your analysis will unveil a causal relationship, say so. Call it "regression-to-cause", or "causal linear model", or something like that.
2) If you're not so sure, call it a (generalized) linear model, or a lin-mod, or a least-squares, or another term that does not necessarily imply causation. Resist the temptation to fall back to the word "regression" until a long time has passed.
This doesn't have to be a completely nerve-wracking exercise; just use a strike-through when necessary, to show that the term regression'linear model' is better suited to describe what we're trying to build here.
Posted by Andrew C. Thomas at 11:30 PM
July 1, 2006
A letter I wrote in reaction to the Texas decision made it into today's New York Times. It even has a nice little plug for IQSS at the bottom.
Posted by Andrew C. Thomas at 2:52 PM
June 28, 2006
The noted Texas redistricting case, known politically for its role involving Tom DeLay and academically for the amici curiae brief filed by Gary King, Andrew Gelman, Jonathan Katz and Bernard Grofman, was ruled on by the Supreme Court today. In short, the party-based gerrymandering was not a problem - nor was the fact that it was done off the traditional calendar - but the composition of districts involving the dilution of Hispanic voters was. The court has ordered that those irregular districts be redrawn. (Note: only the composition of District 23 was considered to be in violation of the Voting Rights Act, but you obviously can't redraw one district without affecting another.)
The nature of this ruling should surprise no one involved in Jim Greiner's Quantitative Social Science and Expert Witnesses class.
Posted by Andrew C. Thomas at 11:23 AM
May 25, 2006
Drew Thomas
A problem I've had come up again and again is the ability to explore a space bound by a Dirichlet prior with a Metropolis-type algorithm. I've yet to find a satisfactory answer and I'm hoping someone else will have some insight.
The research question I have deals with allocating patients to hospitals, considering the effect of the number of beds - one example of the "supply-induced demand" question. (The analysis is being done under Prof. Erol Pekoz, who's visiting Harvard Stats this year.) Conjugate priors for this problem have eluded me, and so the quantity of interest, the probability that a patient will be sent to a particular hospital for inpatient care, is being inferred through a Metropolis algorithm.
Here's the thing: there are at most 64 different hospitals to which a patient can be assigned. Even after assuming that if a hospital has not yet received a patient from a particular area they won't ever, the number of hospitals is extreme.
One suggested proposal has been a Dirichlet distribution with parameters equal to the current values, times a constant. That way the expected value of the proposal will be the same as the last draw. However, when the number is too low, the smallest dimensions will have parameter value less than 1, which leads to trouble, as the value will tend to zero; when it's too high, the biggest parameters don't move at all, and the effect of moving some of its mass is lost.
I've considered implementing a parallel-tempering method but I'd like to keep it cleaner. Does anyone have a better method that's reasonably quick to run, rather than monkeying with each parameter individually?
Posted by Andrew C. Thomas at 6:00 AM
May 18, 2006
Drew Thomas
Harvard School of Public Health doctoral candidate Janet Rosenbaum has been in the news lately, following the publication of her study of virginity pledges in the American Journal of Public Health, as well as her recent IQSS seminar. (Full disclosure: Janet is a friend of mine. I'll address her as Ms. Rosenbaum for this entry.) Since it's certainly a hot topic, it's no surprise how much attention her findings have received; first, the big news agencies picked it up, then the blogosphere took their shift - mainly over the "controversy" resulting from the study. (See pandagon.net for an example.)
But I think the more relevant part of the whole debate is the point Ms. Rosenbaum was trying to make about surveys and self-reporting: we use these data to make broad, sweeping conclusions on social phenomena, and while they are the best we have, they aren't up to the best standard we could achieve.
Posted by Andrew C. Thomas at 6:04 AM
May 10, 2006
From Wikipedia's entry on the t-test:
The t-statistic was invented by William Sealy Gosset for cheaply monitoring the quality of beer brews. "Student" was his pen name. Gosset was statistician for Guinness brewery in Dublin, Ireland, hired due to Claude Guinness's innovative policy of recruiting the best graduates from Oxford and Cambridge for applying biochemistry and statistics to Guinness's industrial processes. Gosset published the t-test in Biometrika in 1908, but was forced to use a pen name by his employer who regarded the fact that they were using statistics as a trade secret. In fact, Gosset's identity was unknown not only to fellow statisticians but to his employer - the company insisted on the pseudonym so that it could turn a blind eye to the breach of its rules. Today, it is more generally applied to the confidence that can be placed in judgements made from small samples.
I like the way they think.
Posted by Andrew C. Thomas at 6:00 AM
April 19, 2006
Drew Thomas
It seems that the difficulty in learning languages isn't always restricted to spoken words. A recent article in the New York Times ("Searching For Dummies", March 26 - here's a link, though it's for pay now) quotes an Israeli study which demonstrates the ineptitude of graduate students in making specific Internet searches in 2002.
Now, I know a lot has happened in the world of search engines in the last 4 years, and I admit my bias in being an MIT undergrad at the time meant that I was waist-deep in Google and its way of sorting information. See if you can't do any of these challenges now, with no time limit:
"A picture of the Mona Lisa; the complete text of either "Robinson Crusoe" or "David Copperfield"; and a recipe for apple pie accompanied by a photograph."
What's the trick to this kind of searching? Unless you have an excellent, selective and disambiguating search engine, knowing search grammar and context is essential.
For example, getting the text of David Copperfield is now a three-hop, one search process: search for it on Google, and select the Wikipedia entry, which has been cleanly separated from the magician and includes not one but
So the technology has gotten better. But the illusion of control remains; I find it more difficult to find other disambiguations that Wikipedia hasn't considered. Moreover, for any meaningful searches, such as to relevant papers in particular areas where I don't know the nomenclature, this feeling of power is challenged.
This is a skill that permeates all levels of society, from kindergarten on up, but there's a definite lack of appreciation for it. To learn it like a language, early on and with constant practice, seems to be the solution; to learn the context, grammar and syntax of the search (and research), and to appreciate that we're trying to communicate our intentions using all the tools we have available; by blaming them, we all typify poor carpenters.
Posted by Andrew C. Thomas at 6:00 AM
February 23, 2006
Drew Thomas
First, apologies for my delay in posting to the blog. I've spent most of the last two months involved in the Canadian federal election as a candidate in my home riding. That I lost wasn't unexpected, nor was winning necessarily my goal. I wanted to talk about ideas that weren't being brought up by other candidates. First and foremost on the list was how an election shapes the debate - and why electoral reform is necessary to allow more ideas into the public forum.
While it's clear to me that, first and foremost, Canadians value our right to vote, how that valuation takes place depends directly on what a vote means. As in many party systems, there are two main interpretations for what a vote represents: a belief in the best candidate for the local job, and a belief in the best national party to lead the country. Quite often these two goals do not coincide.
In addition, "tactical" voting, in which a second-choice candidate is chosen merely to block a (much) less desirable candidate, reflects neither of these qualifications.
These problems, among others, anchor my belief that electoral reform is a must for Canada, as well as any multiparty democracy using single member districts and First Past the Post. But band-aid solutions, like the addition of proportionally allocated at-large seats to a FPTP single-member district scheme, would do little to explore the issue. The question before electoral reform revolves not around which of the two focuses - the candidate or the party - is most important to the voters, but rather whether the public can truly express their will through a system that encourages dishonest voting.
So here is my first quantitative question: How does one measure the "strategic effect" on vote counts alone? Survey data is commonly taken, but in comparison to the Ecological Inference problem, drawing this tactical inference from the data themselves would be a huge step towards determining how to reduce it - and what level we could consider acceptable.
Posted by Andrew C. Thomas at 6:00 AM
January 11, 2006
Drew Thomas
Spatial Statistical methodology is beginning to gain popularity as a methodological tool in the natural and social sciences. At Harvard, Prof. Rima Izem is leading the way towards the use of these techniques across many disciplines. This semester, Prof. Izem debuted her Spatial Statistics seminar, which met Wednesday afternoons in the Statistics Department.
Of those topics discussed in the seminar, lattice data analysis proves to be invaluable to the analysis of well-defined electoral districts. The principle of lattice data is that our land area can be divided into mutually exclusive, complete and contiguous divisions; interactions between the divisions can then be analyzed through various covariance methods.
A full understanding of spatial interaction may prove to be valuable to electoral analysis. Determining the interdependence of districts through means other than traditional covariates may suggest the presence of a true "neighbor effect." How one determines the covariance of districts may prove to be more art than science, but the depth of work yet to be done in this field should give many opportunities for meaningful investigation.
Posted by Andrew C. Thomas at 1:41 AM
January 5, 2006
Drew Thomas
My home country is in chaos - of a sort. With the dissolution of Parliament on November 29, Canada is heading into a federal election.
As a multiparty parliamentary democracy, predicting political outcomes in Canada isn't simply a matter of reading a thermometer. Of course, it isn't even that simple in a two-party system, but it gets me thinking about prediction methods.
I've been working with Gary on JudgeIt, a program used to evaluate voting districts for a variety of conditions, designed for a two-party system. With an emphasis on Bayesian simulation, its methods make use of uniform partisan swing -- a shift in the percentage of voters moving from one party to the other, and in the same proportion in each district -- to determine the likely outcomes given a set of covariates and a history of behaviour in the particular system.
What caught my attention was a series of election prediction websites, making use only of previous election information, which allows the user to input what they expect to be either the vote shares or swings in support. This by itself is mathematically unremarkable, but may keep political junkies up hours.
The real question of interest remains: by what process can a system predict who will gain whose votes in a shift in support? In most Canadian ridings (districts), seats are contested by three parties: from left to right, the socialist New Democrats, the incumbent Liberals and the opposition Conservatives. For the most part, votes lost by an outer party would naturally flow to the Liberals. In this election, however, a scandal which led to the election call may prove to cost the Liberals a good deal of support.
Since geography -- and hence, demography -- dictate much of the Canadian political climate, I have no doubt that the appropriate covariates are out there, waiting to be measured and/or analyzed. In the meantime, I'm keeping my head away from election speculation and looking to see if this problem has already been solved. Anyone out there have any suggestions?
Posted by Andrew C. Thomas at 3:57 AM
November 28, 2005
Drew Thomas
An MIT tag team of Prof. Josh Tenenbaum and his graduate student Charles Kemp presented their research to the IQSS Research Workshop on Wednesday, October 19. The overlaying topic of Prof. Tenenbaum's research is machine learning; one major aspect of this is their method of categorizing the structure of the field to be learned.
For example, it has made sense for hundreds of years that forms of life could be taxonomically identified according to a tree structure so as to compare the closeness of two species, and it also makes some sense to rank them on an ordered scale by some other characteristic (one example presented was how jaw strength could be used to generalize to total strength.) The presenters then showed how Bayesian inference could be used to determine what organizational structures are best suited to which systems, based on a set of covariates corresponding to certain observable features, which could then be used to make other comparisons that might not be as evident, such as immune system behaviour.
What confused me for much of the time was their insistence that they could use the data to decide on a prior distribution, an idea that set some alarms off. I have been under the strongest of directives from professors to keep the prior distribution limited to prior knowledge. My current understanding is that the following method is used:
1. Choose a family to examine, such as the tree, ring or clique structure (all of which, notably, can be learned by kindergarteners rather quickly.)
2. Conduct an analysis where the prior distribution is an equal likelihood of structure corresponding to all possible formations of this type.
3. Repeat this with the other relevant families. Those analyses with the most favorable results would then correspond to the most likely structure.
4. Conduct further research on the system with the knowledge that one structure family is superior for this description.
While I'm not as comfortable with their use of a data-driven prior distribution as they'd like, it seems that the authors are sensitive enough to this concern to keep actual structures separate, and using the data only to confirm their heuristic interpretations of the structures at hand, which sets me more at ease.
Now, the key to this research is that this is a model for human learning - and wouldn't you know it, we're better at it than computers. But I'm still very encouraged at the direction in which this is heading, and am looking forward to later reports from the Tenenbaum group.
Posted by Andrew C. Thomas at 2:23 AM
November 14, 2005
Drew Thomas
Last year during Prof. Rima Izem's Spatial Statistics course, I started to wonder about different analytical techniques for comparing lattice data (say voting results, epidemiological information, or the prevalence of basketball courts) on a map with distinct spatial units such as counties.
A set of techniques had been demonstrated to determine spatial autocorrelation through the use of a fixed-value neighbour matrix, with one parameter determining the strength of the autocorrelation. The use of the fixed neighbour matrix perturbed me somewhat, since the practice of geostatistics uses a tool called the empirical variogram - a functional estimate of variance between sample sites through a regression, based on taking each possible pair of points and computing the difference between squared values - which might give a more reasonable estimate of autocorrelation than a simpler model.
As it turned out, this same question was asked by Prof. Melanie Wall from the Biostatistics Department at the University of Minnesota about a year before I got around to it. In her paper "A close look at the spatial structure implied by the CAR and SAR models" (J. Stat. Planning and Inference, v121, no.2), Prof. Wall tests the idea of using a variogram approach to model spatial structure on SAT data against more common lattice models. And what do you know - the variogram approach holds up to scrutiny. In some cases it outperforms the lattice model, such as in the extreme case of Tennessee and Missouri, which have a bizarrely low correlation due to the fact that each state has eight neighbours.
As well as feeling relief that this difficulty with the model wasn't just in my imagination, I'm glad to see that this type of inference crosses so many borders.
Posted by Andrew C. Thomas at 3:37 AM