Language | Name of Corpus | Size | Author/Institution | Info |
Arabic | Khaleej-2004 corpus | 3 million words | | |
| Watan-2004 corpus | 20000 articles | | |
Basque | XX Century Basque language corpora | | | |
British Sign Language | British Sign Language Corpus Project | 249 Deaf people were filmed from 8 cities across the United Kingdom | led by staff at the Deafness Cognition and Language Research Centre (DCAL) at University College London, but also included researchers from Bangor University (Wales), Heriot-Watt University (Scotland), Queens University Belfast (Northern Ireland) and the University of Bristol (England) | a collection of video clips showing deaf people using BSL |
Catalan | Corpus del català contemporani | | | a corpus of contemporary colloquial Catalan |
Croatian | Croatian National Corpus (HNK) | 216.8 million tokens | the Institute of Linguistics of the Faculty of Humanities and Social Sciences, University of Zagreb | |
Czech | The Prague Dependency Treebank | 1.8 million words | | drawn from the Czech National Corpus (see section 2.4) which have been annotated morphologically and syntactically |
| Cesky Národní Korpus (CNK) | | | the Czech national corpus |
Danish | Korpus 90 for Danish | 32 million tokens | Society for Danish Language and Literature | |
| Korpus 2000 for Danish | 30 million tokens | Society for Danish Language and Literature | |
| Korpus 2010 for Danish | 45 million tokens | Society for Danish Language and Literature | |
Dutch | The Institute for Dutch Lexicology (INL) - The Words Corpus 1996 | 38 Million | | |
English | British National Corpus (BYU-BNC) | 100 million words | originally created by Oxford University Press | written; texts from 1980s through 1993 |
| Michigan Corpus of Academic Spoken English (MiCASE) | 1, 848 words | University of Michigan | transcripts |
| Corpus of Contemporary American English (COCA) | 520 million words | Mark Davies | written; texts of various genres; created from 1990-2015 |
| The Brown Corpus | 1 million words | W. Nelson Francis and Henry Kučera at Brown University | written; text of edited English prose printed in the U.S. |
| The Open American National Corpus (Second Release) | 22 million words | Nancy Ide, Keith Suderman, Vassar College | written and spoken |
| The Griffith Corpus of Spoken Australian English (GCSAusE) | 32,134 words | Griffith University | Australian English, a collection of forty audio recordings and transcriptions of spoken interaction |
| Anthology Reference Corpus | 49,348,397 words | the Association for Computational Linguistics (ACL) | 10,291 research papers in computational linguistics |
| Cambridge English Corpus | multi-billion words | Cambridge University Press | written, spoken and learner texts |
| CLiC Dickens project | 3,835,807 words | University of Nottingham, University of Birmingham | literary texts |
| The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English | 106,210 words | Susan Pintzuk, Eric Haeberli , Ans van Kemenade, Willem Koopman, and Frank Beths | a selection of texts from the Old English Section of the Helsinki Corpus of English Texts, annotated to facilitate searches on lexical items and syntactic structure |
| The Cambridge and Nottingham Corpus of Discourse in English (CANCODE) | five million words | Mike McCarthy, Ronald Carter | corpus of spoken interaction |
| The edited Polytechnic of Wales (POW) corpus | 65,000 words | O’Donoghue, Tim | spoken English, an individual interview with the same "friendly" adult for each child, in which the child's favourite games or TV programmes were discussed; 120 children |
| The Lancaster-Oslo/Bergen Corpus (LOB Corpus) | 1 million | Geoffrey Leech, Stig Johansson, Knut Hofland, Roger Garside | British English, 500 texts of c. 2,000 words, distributed across 15 text categories, 9 informative and 6 imaginative |
| The Louvain Corpus of Native English Essays (LOCNESS) | totaling 324,304 words | the Centre for English Corpus Linguistics (CECL), Université catholique de Louvain, Belgium | British pupils’ A level essays: 60,209 words; British university students essays: 95,695 words; American university students’ essays: 168,400 words |
| The Manually Annotated Sub-Corpus (MASC) | 500,000 words | Nancy Ide, Keith Suderman | contemporary American English written and spoken data drawn from the Open American National Corpus (OANC) |
| The Penn Parsed Corpora of Historical English | | PPCME2: Kroch, Anthony, and Ann Taylor; PPCEME: Kroch, Anthony, Beatrice Santorini, and Lauren Delfs; PPCMBE2: Kroch, Anthony, Beatrice Santorini, and Ariel Diertani | the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English, second edition (PPCMBE2) |
| The Saarbrücken Corpus of Spoken English (SCoSE) | | the Department of English at Saarland University | Spoken English |
| The Wellington Corpus of Spoken New Zealand English (WSC) | one million words | Janet Holmes, Bernadette Vine | Spoken New Zealand English |
| The Wellington Corpus of Written New Zealand English | 1 million | the Department of Linguistics at Victoria University of Wellington | written New Zealand English collected from writings published in the years 1986 to 1990 |
| The York-Helsinki Parsed Corpus of Old English Poetry | 71,490 words of Old English text samples | Susan Pintzuk, Leendert Plug | a selection of poetic texts from the Old English section of the Helsinki corpus of English texts, annotated to facilitate searches on syntactic structure and lexical items |
| The York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE) | 1.5 million words | Taylor, Ann (ed.); Warner, Anthony (ed.); Pintzuk, Susan (ed.); Beths, Frank (ed.) | syntactically-annotated corpus of Old English prose texts |
| The Intonational Variation in English (IviE) corpus | 36 hours of speech data | Phonetics Laboratory, University of Oxford; Department of Linguistics, University of Cambridge | recordings of nine urban dialects of English spoken in the British Isles |
| A Corpus of Late Modern English Prose | 10,000 words | Denison, David | a collection of five 20,000 word block samples of Late Modern English |
| A Representative Corpus of Historical English Registers | 1.7 million words | a consortium of participants at fourteen universities in seven countries | a multi-genre corpus of British and American English covering the period 1600-1999 |
| AustLit | 4,234,314 words | University of Queensland | Australian English, elect samples of out of copyright poetry, fiction and criticism ranging from 1795 to the 1930s |
| Australian component of the International Corpus of English (ICE-AUS) | 1,055,919 words | Macquarie University | Australian English, transcribed spoken and written Australian English from 1992-1995 |
| Australian Radio Talkback (ART) | 251,677 words | Pam Peters | Australian English, transcribed recordings of samples of national, regional and commercial Australian talkback radio from 2004 to 2006 |
| Braided Channels Research Collection | 363,670 words | Trish FitzSimons | Australian English, 70 hours of oral history interviews with women from Australia's Channel Country, together with archival film, transcripts, photos and music |
| British Academic Written English Corpus (BAWE) | 8336262 tokens | the universities of Warwick, Reading and Oxford Brookes | just under 3000 good-standard student assignments |
| British National Corpus | 100 million words | BNC Consortium | British English, both spoken and written from the late twentieth century; written texts (90%) and transcripts of speech (10%) |
| Corpus of Early English Correspondence Sampler (CEECS) | 450085 words | Terttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin at the Department of Modern Languages, University of Helsinki. | CEECS covers the years 1418-1680 and consists of 1147 letters written by 194 writers. |
| Early English Books Online -TCP | 48,339 books | the Text Creation Partnership, ProQuest and more than 150 libraries | consists of the works represented in the English Short Title Catalogue I and II (based on the Pollard & Redgrave and Wing short title catalogs), as well as the Thomason Tracts and the Early English Books Tract Supplement. Together these trace the history of English thought from the first book printed in English in 1475 through to 1700 |
| Eighteenth Century Collections Online-TCP | over 180,000 titles (200,000 volumes) | the Text Creation Partnership, Gale | includes every significant English-language and foreign-language title printed in the United Kingdom during the 18th century, along with thousands of important works from the Americas |
| Evans Early American Imprint Collection-TCP | 40,000 titles | the Text Creation Partnership, NewsBank/Readex Co., and the American Antiquarian Society | contains the full text of all known existing books, pamphlets, and broadsides printed in the United States (or British American colonies prior to Independence) from 1639 through 1819, some 72,000 titles. |
| FLOB | 1 million | Christian Mair, Albert Ludwigs-Universität Freiburg | British English |
| Frown | 1 million | Christian Mair, Albert Ludwigs-Universität Freiburg | American English |
| Korpus of Early Modern Playtexts in English (KEMPE) | 10.7 million tokens | Lene B. Petersen and Marcus X. Dahl (University of Bristol, UK) in association with the VISL project (University of Southern Denmark) | grammatically annotated with token based tags at the morphological/PoS and syntactic levels |
| La Trobe Corpus of Spoken Australian English (LTCSAusE) | 49,133 words | La Trobe University | Australian English, audio recordings (one recording per conversation) and transcripts |
| Michigan Early Modern English Materials | 36,000 modal verb entries | Richard W. Bailey, Jay L. Robinson, James W. Downer, and Patricia V. Lehman | consist of citations collected for the modal verbs and certain other English words for the Early Modern English Dictionary |
| Mitchell and Delbridge collection | 16 items | University of Sydney | Australian English, audio recordings of spoken wordlists and monologues |
| Monash Corpus of English (MCE) | 95,584 words | Monash University | Australian English, audio recordings and transcripts of spoken interviews |
| Parsed Corpus of Early English Correspondence (PCEEC) | 2159132 | University of Helsinki and University of York | letters |
| RCPCE profession-specific corpora | 8 corpora | the Research Centre for Professional Communication in English of the Hong Kong Polytechnic University | Hong Kong Corpus of Spoken English |
| The Academic Corpus | 3.5 million words | Victoria University of Wellington | 414 academic texts from a variety of subject areas |
| The Aix-MARSEC database | 5 hours of speech data | Cyril Auran, Savoirs, textes et langage, Caroline Bouzon, Savoirs, textes et langage, Céline De Looze, Savoirs, textes et langage, Daniel Hirst, Laboratoire parole et langage | spoken British English, 5 hours of BBC recordings together with annotations at several linguistic levels |
| The Australian Corpus of English/The Macquarie corpus | 757,024 words | Macquarie University | Australian English, published texts taken from 15 different categories of nonfiction and fiction |
| The Bergen Corpus of London Teenage Language | a million words | the University of Bergen | speech of teenagers |
| The British Academic Spoken English (BASE) | 1,644,942 tokens | Hilary Nesi, Paul Thompson | British Academic Spoken English, 160 lectures and 39 seminars recorded in a variety of university departments |
| The British component of the International Corpus of English (ICE-GB) | one million words | Gerald Nelson at the Chinese University of Hong Kong | the British component of the International Corpus of English (ICE) |
| The CHRISTINE corpus | approximately 80,500 words | Geoffrey Sampson | spoken data |
| The Corpus of English Dialogues 1560–1760 (CED) | 1.3 million words | the Arts and Humanities Data Service (AHDS) , University of Oxford | dialogues from 1560 to 1760 |
| The Corpus of Middle English Prose and Verse | sixty-two texts | The Humanities Text Initiative | Middle English texts |
| The Corpus of Oz Early English (COOEE) | 1,545,163 words | Clemens Fritz | texts written in Australia, New Zealand or Norfolk Island, or by native Australians on travels, between 1788 and 1900 |
| The Corpus of Professional English | 17 million words | Shogakukan Corpus Network | English academic journal texts in science, engineering, technology and other fields |
| The Corpus of Spoken, Professional American-English (CSPA) | two million words of speech | Michael Barlow | a selection of existing transcripts of interactions in professional settings |
| The Dictionary of Old English Corpus in Electronic Form | over three million words | Centre for Medieval Studies, University of Toronto | a complete record of surviving Old English except |
| The English language of the north-west in the late Modern English period: a Corpus of late 18c Prose | 30,000 words | David Denison in collaboration with Linda van Bergen (from 1998) and Joana Soliva (formerly Proud) (from 1999) | About 300,000 words of local English letters on practical subjects, dated 1761-90 |
| The Helsinki Corpus of English Texts | 1.5 million words | Matti Rissanen | covers a thousand years of English texts, from the eighth to the beginning of the eighteenth century |
| The Innsbruck Computer Archive of Machine-Readable English Texts (ICAMET) | 7.8 million words | Markus, Manfred | three subsections, namely the Prose Corpus 1100-1500 (a full-text database), the Letter Corpus 1386-1688 (containing 254 complete letters from different sources, arranged diachronically), and the Prose Varia Corpus (a mixture of tagged, normalized, translated and otherwise manipulated or synopsized texts) |
| The International Corpus of English (ICE) | long-term aim: twenty one million words | twenty-three research teams around the world | All ICE corpora contain 500 texts of approximately 2,000 words each, sampled from a wide range of spoken (60%) and written (40%) genres |
| The Lampeter Corpus of Early Modern English Tracts | 1.1 million words | the English Department at HelsinkiUniversity and the Department of Linguistics & Modern English Languageat Lancaster University | a collection of texts on various subject matter published between 1640 and 1740 |
| The London-Lund Corpus of Spoken English (LLC) | 500,000 words | Jan Svartvik, Lund University | Spoken British English |
| The LUCY corpus | 165,000 words | Geoffrey Sampson | present-day British written English |
| The Machine Readable Spoken English Corpus | | the School of Linguistics at Reading University | |
| The Penn Treebank (PTB) | over 4.5 million words | Mitchell Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor | selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation |
| The Santa Barbara Corpus of Spoken American English (SBCSAE) | approximately 249,000 words | the Linguistics Department of the University of California, Santa Barbara | recordings of naturally occurring spoken interaction from all over the United States |
| The SUSANNE (an acronym for “surface and underlying structural analysis of natural English”) | a 130,000 word sub-sample | Geoffrey Sampson | a subset of the Brown Corpus of American English |
| The Switchboard Corpus (SWB) | three million words (over 240 hours of recordings) | John J Godfrey, Edward Holliman | Spoken American English, approximately 2,400 telephone conversations between unacquainted adults |
| The Zurich English Newspaper Corpus | 1.6 million words | University of Zurich | English newspapers published between 1661 and 1791 |
| The Reading Academic Text (RAT) corpus | a million words | University of Reading | a collection of academic texts, written by academic staff or students at the University of Reading |
French | The Project for American and French Research on the Treasury of the French Language (ARTFL) | 150 million words | a cooperative project by the Centre National de la Recherche Scientifique and the University of Chicago. | |
| Un corpus d’entretiens spontanés | 95 conversations/speakers | Kate Beeching, University of the West of England | |
German | COSMAS II (Corpus Search, Management and Analysis System) | two billion words, over 1.1 billion words is publicly available free of charge | | |
| The core corpus of the 20th century | 100 million tokens | Berlin-Brandenburg Academy of Sciences and Humanities (BBAW), Berlin | |
| The Berliner Zeitung corpus | 252 million tokens | | the complete set of articles which have ben published online between January 1994 and december 2005 |
| Willkommen beim Wortschatz-Portal | | | |
| Willkommen bei LIMAS | | | |
| The Hamburg Dependency Treebank | part A, 101,999 sentences; part B, 104,795 sentences; part C, 55,027 sentences | Wolfgang Menzel | the largest dependency treebank available; consists of dependency annotations, based on sentences sourced from the German news site heise.de, from articles published between 1996 and 2001. |
| Korpus Südtirol | | | text corpus of South Tyrolean German |
Hungarian | The Hungarian National Corpus (HNC) | 187.6 million words | Department of Corpus Linguistics of the Research Institute for Linguistics of the Hungarian Academy of Sciences, Hungarian Language Offices, | 5 subcopora: Hungary, Slovakia, Subcarpathia, Transylvania, Vojvodina; 5 text genres: press, literature, science, official, personal |
| Hungarian Webcorpus | 1.48 billion words unfiltered | | The largest Hungarian language corpus, available in its entirety under a permissive Open Content license. |
Icelandic | The Malromur corpus | About 120,000 voice samples from 592 individuals | Sigrún Helgadóttir | an open source corpus of Icelandic voice samples |
| The Tagged Icelandic Corpus (MÍM) | 25 million tokens | Sigrún Helgadóttir | contemporary Icelandic texts |
Irish | Tobar na Gaedhilge (‘The source of Irish’) | 2.5 million word | Ciarán Ó Duibhín | |
| Corpus Náisiúnta na Gaeilge / The National Corpus of Irish | 8 million words | the Royal Irish Academy | |
Italian | CORIS/CODIS - corpus of written italian | 130 million words | R. Rossini Favretti | CORIS is a corpus of written Italian; CODIS is a further corpus aimed at specialist needs that allows the selection of the subcorpora which are pertinent to a specific research project and also the size of every single sub-corpus. |
| Banca dati dell'italiano parlato (BADIP) | 490,000 words | a group of linguists under the direction of Tullio De Mauro, in collaboration with IBM Italy | spoken Italian |
| Link to a list of Italian corpora | | by Institute of Cognitive Sciences and Technologies | |
Korean | Korean National Corpus | goal: 200 million eojuls | | |
Mandarin Chinese | Sinica Treebank | 361,834 words | Academia Sinica | Mandarin Chinese as used in Taiwan, extracted from the Sinica Corpus |
| Chinese Treebank 9.0 | 3,247,331 characters (hanzi or foreign) | Nianwen Xue, Xiuhong Zhang, Zixin Jiang, Martha Palmer, Fei Xia, Fu-Dong Chiou, Meiyu Chang | annotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups, weblogs, discussion forums, chat messages and transcribed conversational telephone speech |
| The Lancaster Corpus of Mandarin Chinese | 1 million | McEnery, A.M. (ed.); Xiao, Richard (ed.) | |
| Academia Sinica Balanced Corpus (ASBC) of Modern Chinese/Sinica Corpus | 10 million words | Academia Sinica | Mandarin Chinese as used in Taiwan, texts published from 1981 to 2007 |
| The Modern Chinese Language Corpus (MCLC)/ Sinica Corpus | 11,245,330 word tokens | Academia Sinica | |
| CCL (Center for Chinese Linguistics PKU) Corpus | 783463175 tokens | Center for Chinese Linguistics PKU | |
| cncorpus | 19455328 tokens | State Language Commission | |
| The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC) | | Dr. Richard Xiao (UCREL of Lancaster University) and Professor Hongyin Tao (University of California Los Angeles) | a corpus of spoken Mandarin Chinese. The corpus is composed of 1,002,151 words of dialogues and monologues, both spontaneous and scripted, in 73,976 sentences and 49,670 utterance units (paragraphs) |
Modern Greek | The Hellenic National Corpus | 34 million words | The Institute for Language and Speech Processing | written texts |
Persian | Uppsala Persian Corpus (UPC) | 2,704,028 tokens | Mojgan Seraji | a modified version of the Bijankhan corpus (Bijankhan, 2004) with additional sentence segmentation and consistent tokenization |
Polish | The Polish National Corpus | 1.5 billion words | Institute of Computer Science at the Polish Academy of Sciences (coordinator), Institute of Polish Language at the Polish Academy of Sciences, Polish Scientific Publishers PWN, and the Department of Computational and Corpus Linguistics at the University of Łódź | on the PELCRA (Polish and English Language Corpora for Research and Application) project |
Portuguese | The CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público) | 180 million words | | corpus of newspaper text from the daily Portuguese newspaper Público |
| Linguateca | | the Portuguese Ministry of Science and Technology | 96 corpora |
| O Corpus do Portuguese | two subcorpora: 45 million words and 1 billion words | Mark Davies, BYU | two different parts:the (original, smaller) corpus that allows you to look at historical changes and genre-based variation; the (new, much larger) corpus that you can use to look at dialectal variation (and have 50x as much data for Modern Portuguese) |
| Tycho Brahe Parsed Corpus of Historical Portuguese | 76 texts ( 3,303,196 words) are available | Galves, Charlotte, and Pablo Faria | texts written in Portuguese by authors born between 1380 and 1881 |
Russian | Russian National Corpus | 300 million words | Institute of Russian language, Russian Academy of Sciences | 6 subcorpora: The Deeply Annotated corpus, The Parallel Corpora, The Dialectal corpus, The Poetry corpus, The Educational corpus, The Corpus of Spoken Russian |
| The Helsinki Annotated Corpus of Russian Texts HANCO | 100, 000 running words | the Department of Slavonic and Baltic Languages and Literatures at the University of Helsink | extracted from a modern Russian magazine |
| Open Corpus of the Russian language | | | |
| Stories about dreams and other corpora of spoken language | four subcorpora range from 5,000 to 14,000 | | spontaneous informal spoken discourse |
| Corpus of Standard Written Russian | | | |
| Computer corpus of texts | | | retrieved from newspapers of the late 20th century |
Tatar | Corpus of Written Tata | 116 million word occurrences | | |
Turkish | Turkish National Corpus | 50 million words | | samples of textual data across a wide variety of genres covering a period of 20 years (1990-2009); both written data and transcription from spoken data |
Scottish English | The Scottish Corpus of Texts and Speech (SCOTS) | 4.6 million words | John Corbett | Written and Spoken, with audio recordings to accompany many of the spoken texts |
| The Corpus of Modern Scottish Writing (CMSW) | 5.5 million words | John Corbett, Jeremy Smith | written and printed texts from the period 1700-1945 |
Slovak | Corpus of Slovak Wikipédia and Necyklopédia | 42 615 597 tokens | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | written |
| Corpus of Spoken Slovak | 5.72 million tokens | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | spoken |
| Corpus of Copywrighting Texts on the Web | 1 648 229 tokens | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | |
| Corpus of Economic Texts | 165 million tokens | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | |
| Corpus of Legal Texts | 146 million tokens | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | |
| Corpus of Religious Texts | 66 million tokens | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | |
| Corpus of Social Science Texts | 38 616 514 tokens | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | |
| The Slovak National Corpus | publicly available 1250 million tokens | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | |
| Corpus of Spoken Slovak | 5.72 million tokens | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | |
| Slovak Terminology Database | 6000 terms | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | |
Spanish | Corpus de Referencia del Español Actual | 133 million | Real Academia Española | written (90%) and spoken (10%) |
| COLA (Corpus Oral de Lenguaje Adolescente Resource) | 751168 tokens | University of Bergen | a corpus of recorded, spontaneous speech among teenagers from different schools and youth clubs in Madrid, Buenos Aires and Santiago de Chile |
| the Corpus del Español | two subcorpora: 100 million words and 2 billion words | Mark Davies, BYU | two parts: the (original, smaller) corpus that allows you to look at historical changes and genre-based variation; the (new, much larger) corpus that you can use to look at dialectal variation (and have 100x as much data for Modern Spanish). |
Swedish | The Bank of Swedish | about 12.5 billion tokens | | a linguistic reference databank at the University of Gothenburg |
Welsh | The CEG (Cronfa Electroneg o Gymraeg) corpus | one million words | Ellis, O'Dochartaigh & Hicks of the Welsh IT Unit and the School of Psychology, University of Wales, Bangor | modern (mainly post 1970) Welsh prose writing |
Turkish | TS Wikipedia | approximately 1.6 million | | processed Turkish Wikipedia pages |
Multilingual | European Corpus Initiative Multilingual Corpus I (ECI/MCI) | 98 million words | European Corpus Initiative (ECI) | 46 subcorpora in 27 (mainly European) languages |
| MULTEXT JOC Corpus | 5 million words | the European Community | English, French, German, Italian and Spanish |
| MULTEXT-East "1984" annotated corpus 4.0 | 100,000 English words and translations in 9 languages | Erjavec, Tomaž; Barbu, Ana-Maria; Derzhanski, Ivan; Dimitrova, Ludmila; Garabík, Radovan; Ide, Nancy; Kaalep, Heiki-Jaan; Kotsyba, Natalia; Krstev, Cvetana; Oravecz, Csaba; Petkevič, Vladimír; Priest-Dorman, Greg; QasemiZadeh, Behrang; Radziszewski, Adam; Simov, Kiril; Tufiş, Dan; Zdravkova, Katerina | The Multext-East parallel corpus consists of the English original of George Orwell's novel '1984' together with its translations into the nine project languages: Bulgarian, Czech, Estonian, Hungarian, Lithuanian, Romanian, Russian, Serbian, and Slovene. |
| Multilingual Corpora for Cooperation (MLCC) | totaling approximately 10.2 million words | LTG, Edinburgh and ISSCO with coordination by CNR,Pisa | two main components: the Polylingual Document Collection, and a Multilingual Parallel Corpus consisting of translated data in nine European languages |
| The Child Language Data Exchange System (CHILDES) | 180 million characters (ca. 20 million words) | | a system for sharing and studying conversational interactions |
| The CLUVI (Linguistic Corpus of the University of Vigo) parallel corpus | 25 million words | SLI (Computational Linguistics Group of the University of Vigo) | main components are the TECTRA Corpus of English-Galician literary texts, the FEGA Corpus of French-Galician literary texts, the LEGA Corpus of Galician-Spanish legal texts, the UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation texts, the LOGALIZA Corpus of English-Galician software localization, and the CONSUMER Corpus of Spanish-Galician-Catalan-Basque consumer information |
| The EMILLE Corpus | monolingual corpora 92,799,000 words; parallel corpus 200,000 words of text in English and its accompanying translations | Lancaster University, UK, and the Central Institute of Indian Languages (CIIL), Mysore, India | three components: monolingual, parallel and annotated corpora |
| The Oslo Multilingual Corpus (OMC) | 2.6 million words | University of Oslo, the University of Bergen | original texts and translations from several languages: Norwegian, English, French, German, Dutch, Portuguese, Swedish and Finnish |
| The Czech National Corpus | 26 corpora, in which the syn series acieves 2232 mil | Institute of the Czech National Corpus (ICNC), Faculty of Arts, Charles University in Prague | written corpora, spoken corpora, parallel corpus, specialized corpora |
| CELEX2 | | | English, German, Dutch |
| European Parliament Proceedings Parallel Corpus 1996-2011 | | Philipp Koehn | |
| Corpora Collection | 425,703,278 tokens | | |
| The RuN-Euro corpus | 8,763,402 words | the RuN project (2008-2010) at the University of Oslo | a parallel corpus originally consisting of Norwegian and Russian texts, and other European languages are currently being added |
Bilingual Parallel | Hong Kong Parallel Text | approximately 59 million English words and 49 million Chinese words (or 98 million Chinese characters) | Xiaoyi Ma | three sub-corpora, namely Hong Kong Hansards, Hong Kong Laws and Hong Kong News |
| Parallelum Slovaco-Latinum Corpus | 32000 sentences in Slovak and 29000 sentences in Latin | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | Slovaco-Latin |
| Slovak-Bulgarian Parallel Corpus | 163 million tokens, 78 million in the Slovak half, 85 million in the Bulgarian one | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | Slovak-Bulgarian |
| Slovak-Czech Parallel Corpus | 418.5 million tokens, 209.2 million in the Slovak half, 209.3 million in the Czech half | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | Slovak-Czech |
| Slovak-English parallel corpora | 556 million tokens, 261 million tokens in the Slovak half, 295 million tokens in the English one | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | |
| Slovak-French parallel corpora | 441.5 million tokens, 213.3 million in the Slovak part and 228.2 million in the French part | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | Slovak-French |
| Slovak-German Parallel Corpus | 446.2 million tokens (219.8 million tokens in the Slovak half, 226.4 million tokens in the German half) | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | Slovak-German |
| Slovak-Hungarian Parallel Corpus | 99 million tokens (51 million in the Slovak half, 48 million in the Hungarian half) | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | Slovak-Hungarian |
| Slovak-Russian parallel corpora | 8.45 million tokens, 4.2 million in the Slovak part and 4.25 million in the Russian part | Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences | Slovak-Russian |
| TED English Chinese parallel corpus of speeches | 6,187,849 English words and Chinese characters | Jiajin Xu | |
| The Babel Chinese-English Parallel Corpus | 20 million Chinese characters and 10 million English words | the Institute of Computational Linguistics of Beijing University | written, 327 English articles and their translations into Mandarin Chinese |
| The Canadian Hansard Corpus - USC version | 1.3 million pairs of aligned text chunks, 2 million words in English and French each | Ulrich Germann | spoken and written texts in English and French from the Canadian Parliament |
| The English-Norwegian Parallel Corpus (ENPC) | 2.6 million words | Stig Johansson, Knut Hofland | original texts and their translations (English to Norwegian and Norwegian to English) and includes both fiction and non-fiction |
| The English-Swedish Parallel Corpus (ESPC) | 2.8 million words | the Departments of English at the Universities of Lund and Gothenburg | original texts and their translations (English to Swedish and Swedish to English); both fictional and non-fictional texts are included |
| The IJS-ELAN Slovene-English Parallel Corpus (IJS-ELAN) | one million words | the Dept. of Knowledge Technologies, Jožef Stefan Institute | 15 parallel Slovene-English/English-Slovene texts |
Others | Romance Phonetics Database | | | an on-line research and teaching tool containing tagged sound samples (both individual words and passages) illustrative of various segmental and prosodic aspects of Romance phonetics and phonology |
| links to non-English language corpora | | Stanford University | |
| corpora list | | The Humboldt University of Berlin | |