| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | ||||
| 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| 25 | 26 | 27 | 28 | 29 | 30 | 31 |
« Data from China: Land of Plenty? (II) | Main | 999 »
26 April 2006
Sebastian Bauhoff
A group at the Indiana School of Informatics has developed a software to detect whether a document is "human written and authentic or not." The idea was inspired by the successful attempt of MIT students in 2004 to place a computer-generated document at a conference (see here). Their program collated random fragments of computer science speak into a short paper that was accepted at a major conference without revision. (That program is online and you can generate your own paper, though unfortunately it only writes computer science articles).
The new tool lets users paste pieces of text and then assesses whether the content is likely to be authentic or just gibberish. The program tries to identify human-style writing that is characterized by certain repition patterns and apparently does rather well. It is not clear whether this works well for social science type articles. The first paragraphs of a recent health economics article (to remain unnamed) only have a 35.5% chance of being authentic. Hmm...
So is this just a joke or useful programming? The authors say it could be used to differentiate whether a website is authentic or bogus, or to identify different types of texts (articles vs blogs, for example). I wonder what the algorithms behind such technology are, and whether this will lead to an arms race between fakers and detectors? If one of them can recognize a human-written text could this be used by the faking software?
If further tweaked, could this have an application in the social sciences? Maybe we could use the faking software to search existing papers, collate them smartly and use that to identify patterns and get new ideas? Maybe everyone should run their papers through a detector software before submitting it to a journal or presenting at a workshop? And students watch out! No more random collating at 3am to meet the next day deadline!
PS: this blog entry has been classified as "inauthentic with a 26.3% chance of being an authentic text"...
Posted by Sebastian Bauhoff at April 26, 2006 2:41 PM
In developing this line of work, we were actually struck by ability of the search engines to find these meaningless documents, yet provide no indication whether they were meaningful (along the same lines, there has been attention paid to the problems of malicious placment of bogus information in Wikipedia.) But what's uncomfortable is realizing that search results are affected by this bogus data. And with the growing tacit agreement that the order of the list is somehow related to a document's worth or meaning, e.g., citations, popularity, there seems to be a significant problem here. There are also technical problems faced by groups who's main vocation is working on and through the web--most of the significant problems are computer-mediate--automation. We thought of taking a first strike by seeing whether we could distinguish between bogus scientific papers that are generated by a computer from simply human-authored texts. We were successful in this aspect. We actually devised several protocols for generating this bogus scientific texts. The application in its current form is not meant to be able to discern human-authored bogus texts from authentic ones. We have begun looking into this now. We have been asked, interestingly, to consider some of the social problems you've posed here. And we've already been made of aware of an "arm's race" as it were. A good deal of spam is utilizing this automated letter writing to fool classfiers--but as many of you likely have noticed, the sentences don't make sense. I'm happy to discuss any questions or problems we might find interesting. It's unclear why this procedure works--we are trying to formalize this structure as well. Lastly, the application performs poorly on short texts--entropy is too low likely.
Best,
Memo Dalkilic (School of Informatics/Center for Genomics and Bioinformatics [Indiana University])
Posted by: Mehmet Dalkilic at May 3, 2006 8:38 PM