Words by the Millions, Sorted by Software - NYTimes.com

abernard102@gmail.com 2012-08-20


“IT just keeps growing — the vast electronic archive of books, journals and scholarly literature stored on the Web. But scientists are aiming to keep up with this trove of collective knowledge by devising computer-based tools to winnow and quantify it. David M. Blei of Princeton University is among those who are teaching computers to sift through the digital pages of books and articles and categorize the contents by subject, even when that subject isn’t stated explicitly. For decades, of course, librarians and many others have labeled books and documents with keywords. ‘But human categorization can only go so far,’ said Dr. Blei, an associate professor in computer science. ‘We don’t have the human power to read and tag all this information.’ To cope with the information explosion, Dr. Blei and other researchers write algorithms so that computers can sift through millions of works and find their common themes by sorting related words into categories. It’s a field called probabilistic topic modeling. Other research tools identify shifts in language over time that could signal important cultural, scientific or historical changes. At Harvard, Erez Lieberman Aiden and Jean-Baptiste Michel, who jointly lead a group there called the Cultural Observatory, will soon inaugurate a browser that searches for such language changes in a large online repository of scientific papers known as arXiv (pronounced like ‘archive’). Users will be able to type in one or two words at the site, called Bookworm-arXiv, and immediately see a graph showing the ups and downs of the phrase’s use in the archive, Dr. Michel said. (A test version is at arxiv.culturomics.org.) Users can then click on the graph and drill down to read the original papers in which the terms appear, tracing ideas back toward their roots, or to spots where scientific ideas spread from one field to another. ... BOOKWORM-ARXIV will burrow its way through data stored in roughly 743,000 or so papers that have been uploaded by scientists, said Paul Ginsparg, founder of arXiv. Authors typically send their papers to arXiv as ‘preprints’ or unpublished manuscripts before the works appear in journals. Most of the research is in physics, mathematics, computer science, statistics and the quantitative parts of biology and finance, said Dr. Ginsparg, a professor of physics and information science at Cornell. The Bookworm-arXiv interface is the latest in a series of tools developed by the Cultural Observatory. Late in 2010, in collaboration with Google, the lab released the Google n-gram viewer, which lets people search for a phrase of up to five words in Google’s database of scanned books and see the frequency of the words over time in a graph, Dr. Aiden said. The n-gram viewer is a powerful tool, said Dr. Grafton at Princeton. For example, he said, it could trace the disappearance of the names of scientists and artists who were censored by the Nazis in Germany. But the n-gram, however useful, has a disadvantage. It does not let users click through to the original documents, because many books included in the Google database are under copyright. Readers who use the arXiv interface will be able to click through to the original text. ‘The papers are not behind a paywall,’ Dr. Ginsparg said... Dr. Aiden said the worlds of Google’s scanned books and arXiv’s papers were just the beginning for the observatory. ‘We plan on moving on soon to newspapers, blogs, tweets and other aspects of the historical record,’ he said...”




08/16/2012, 06:08

From feeds:

Open Access Tracking Project (OATP) » abernard102@gmail.com


oa.new oa.licensing oa.comment oa.copyright oa.cs oa.libraries oa.google oa.arxiv oa.princeton.u oa.social_media oa.twitter oa.books oa.tools oa.ch oa.gratis oa.harvard.u oa.preprints oa.modeling oa.newspapers oa.digital_humanities oa.cornell.u oa.blogs oa.tags oa.bookworm oa.culturomics oa.n-gram_viewer oa.stem oa.libre oa.versions oa.humanities oa.ssh



Date tagged:

08/20/2012, 18:46

Date published:

03/25/2012, 14:20