CALPER Corpus Portal | General Corpora

A monolingual corpus: British English

The British National Corpus (BNC) was compiled by a consortium of British publishers of academic institutions such as Oxford University Computing services, Lancaster University’s Centre for Computer Research on the English language and the British Library. Compiled in the late 1980s and early 1990s it is now a 100 million word corpus of modern British English, consisting of 90% written (informative prose and ‘imaginative’ texts) and 10% spoken (speeches, meetings, lectures, etc. and some casual conversation). The complete corpus is available on CD-ROM for research purposes. There is also a smaller subset of the corpus available on CD-ROM, consisting of one million words of both spoken and written texts: the BNC Sampler. This can be purchased over the internet and comes complete with various software packages.

A monolingual corpus: American English

The BNC influenced the creation of other large monolingual corpora such as The American National Corpus (ANC). This corpus is comparable to the British National Corpus (BNC), covering American English. The ANC contains a core corpus of at least 100 million words, comparable across genres to the BNC.

A monolingual international corpus: varieties of English

The International Corpus of English (ICE) will ultimately be a collection of 1,000,000 word corpora from countries or region where English is spoken as a first language. The corpus consists of a written and a spoken component. Some components of the ICE corpus, such as a subcorpus of Phillipine or Singapore English are available free of charge – by either download or on CD-rom

A monolingual corpus: Spanish

To see an example of a general corpus not in English, go to the website for the The CORPUS DEL ESPAÑOL. This is an excellent corpus of Spanish, created by Prof. Mark Davies of Brigham Young University. You can search the corpus online.

An international multilingual corpus

European Corpus Initiative Multilingual Corpus I (ECI/MCI). The European Corpus Initiative (ECI) was founded to oversee the acquisition and preparation of a large multilingual corpus (ECI/MCI) to be made available in digital form for scientific research at a low a cost as possible. The corpus has been available on CD-ROM since 1994, and is being distributed by ELSNET. It contains written texts in languages such as Spanish, German, French, Chinese and Albanian. A complete list of contents is available through the website.