Opening up linguistic data at the American National Corpus

Connotea Imports 2012-07-31

Summary:

"The American National Corpus (ANC) project is creating a collection of texts produced by native speakers of American English since 1990. Its goal is to provide at least 100 million words of contemporary language data covering a broad and representative range of genres, including but not limited to fiction, non-fiction, technical writing, newspaper, spoken transcripts of various verbal communications, as well as new genres (blogs, tweets, etc.). The project, which began in 1998, was originally motivated by three major groups: linguists, who use corpus data to study language use and change; dictionary publishers, who use large corpora to identify new vocabulary and provide examples; and computational linguists, who need very large corpora to develop robust language models—that is, to extract statistics concerning patterns of lexical, syntactic, and semantic usage—that drive natural language understanding applications such as machine translation and information search and retrieval (à la Google)...."

Link:

http://blog.okfn.org/2011/01/15/opening-up-linguistic-data-at-the-american-national-corpus/

Updated:

01/29/2011, 13:02

From feeds:

Open Access Tracking Project (OATP) » Connotea Imports

Tags:

oa.new oa.data oa.licensing oa.linguistics oa.libre oa.ssh

Authors:

petersuber

Date tagged:

07/31/2012, 14:48

Date published:

01/16/2011, 16:38