General

LanguageName of CorpusSizeAuthor/InstitutionInfo
ArabicKhaleej-2004 corpus3 million words
Watan-2004 corpus20000 articles
BasqueXX Century Basque language corpora
British Sign LanguageBritish Sign Language Corpus Project249 Deaf people were filmed from 8 cities across the United Kingdomled by staff at the Deafness Cognition and Language Research Centre (DCAL) at University College London, but also included researchers from Bangor University (Wales), Heriot-Watt University (Scotland), Queens University Belfast (Northern Ireland) and the University of Bristol (England)a collection of video clips showing deaf people using BSL
CatalanCorpus del català contemporani a corpus of contemporary colloquial Catalan
CroatianCroatian National Corpus (HNK)216.8 million tokens the Institute of Linguistics of the Faculty of Humanities and Social Sciences, University of Zagreb
CzechThe Prague Dependency Treebank 1.8 million wordsdrawn from the Czech National Corpus (see section 2.4) which have been annotated morphologically and syntactically
Cesky Národní Korpus (CNK) the Czech national corpus
DanishKorpus 90 for Danish32 million tokensSociety for Danish Language and Literature
Korpus 2000 for Danish30 million tokensSociety for Danish Language and Literature
Korpus 2010 for Danish45 million tokensSociety for Danish Language and Literature
DutchThe Institute for Dutch Lexicology (INL) - The Words Corpus 199638 Million
EnglishBritish National Corpus (BYU-BNC)100 million wordsoriginally created by Oxford University Presswritten; texts from 1980s through 1993
Michigan Corpus of Academic Spoken English (MiCASE) 1, 848 wordsUniversity of Michigantranscripts
Corpus of Contemporary American English (COCA)520 million wordsMark Davieswritten; texts of various genres; created from 1990-2015
The Brown Corpus1 million wordsW. Nelson Francis and Henry Kučera at Brown Universitywritten; text of edited English prose printed in the U.S.
The Open American National Corpus (Second Release)22 million wordsNancy Ide, Keith Suderman, Vassar Collegewritten and spoken
The Griffith Corpus of Spoken Australian English (GCSAusE)32,134 wordsGriffith UniversityAustralian English, a collection of forty audio recordings and transcriptions of spoken interaction
Anthology Reference Corpus49,348,397 wordsthe Association for Computational Linguistics (ACL)10,291 research papers in computational linguistics
Cambridge English Corpusmulti-billion wordsCambridge University Presswritten, spoken and learner texts
CLiC Dickens project3,835,807 wordsUniversity of Nottingham, University of Birminghamliterary texts
The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English106,210 wordsSusan Pintzuk, Eric Haeberli , Ans van Kemenade, Willem Koopman, and Frank Beths a selection of texts from the Old English Section of the Helsinki Corpus of English Texts, annotated to facilitate searches on lexical items and syntactic structure
The Cambridge and Nottingham Corpus of Discourse in English (CANCODE) five million wordsMike McCarthy, Ronald Cartercorpus of spoken interaction
The edited Polytechnic of Wales (POW) corpus65,000 wordsO’Donoghue, Timspoken English, an individual interview with the same "friendly" adult for each child, in which the child's favourite games or TV programmes were discussed; 120 children
The Lancaster-Oslo/Bergen Corpus (LOB Corpus)1 millionGeoffrey Leech, Stig Johansson, Knut Hofland, Roger GarsideBritish English, 500 texts of c. 2,000 words, distributed across 15 text categories, 9 informative and 6 imaginative
The Louvain Corpus of Native English Essays (LOCNESS)totaling 324,304 wordsthe Centre for English Corpus Linguistics (CECL), Université catholique de Louvain, BelgiumBritish pupils’ A level essays: 60,209 words; British university students essays: 95,695 words; American university students’ essays: 168,400 words
The Manually Annotated Sub-Corpus (MASC)500,000 wordsNancy Ide, Keith Sudermancontemporary American English written and spoken data drawn from the Open American National Corpus (OANC)
The Penn Parsed Corpora of Historical EnglishPPCME2: Kroch, Anthony, and Ann Taylor; PPCEME: Kroch, Anthony, Beatrice Santorini, and Lauren Delfs; PPCMBE2: Kroch, Anthony, Beatrice Santorini, and Ariel Diertanithe Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English, second edition (PPCMBE2)
The Saarbrücken Corpus of Spoken English (SCoSE)the Department of English at Saarland UniversitySpoken English
The Wellington Corpus of Spoken New Zealand English (WSC)one million wordsJanet Holmes, Bernadette VineSpoken New Zealand English
The Wellington Corpus of Written New Zealand English1 millionthe Department of Linguistics at Victoria University of Wellingtonwritten New Zealand English collected from writings published in the years 1986 to 1990
The York-Helsinki Parsed Corpus of Old English Poetry71,490 words of Old English text samplesSusan Pintzuk, Leendert Pluga selection of poetic texts from the Old English section of the Helsinki corpus of English texts, annotated to facilitate searches on syntactic structure and lexical items
The York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE) 1.5 million words Taylor, Ann (ed.); Warner, Anthony (ed.); Pintzuk, Susan (ed.); Beths, Frank (ed.)syntactically-annotated corpus of Old English prose texts
The Intonational Variation in English (IviE) corpus36 hours of speech data Phonetics Laboratory, University of Oxford; Department of Linguistics, University of Cambridgerecordings of nine urban dialects of English spoken in the British Isles
A Corpus of Late Modern English Prose10,000 wordsDenison, David a collection of five 20,000 word block samples of Late Modern English
A Representative Corpus of Historical English Registers1.7 million wordsa consortium of participants at fourteen universities in seven countriesa multi-genre corpus of British and American English covering the period 1600-1999
AustLit4,234,314 wordsUniversity of QueenslandAustralian English, elect samples of out of copyright poetry, fiction and criticism ranging from 1795 to the 1930s
Australian component of the International Corpus of English (ICE-AUS) 1,055,919 wordsMacquarie UniversityAustralian English, transcribed spoken and written Australian English from 1992-1995
Australian Radio Talkback (ART) 251,677 wordsPam PetersAustralian English, transcribed recordings of samples of national, regional and commercial Australian talkback radio from 2004 to 2006
Braided Channels Research Collection363,670 wordsTrish FitzSimonsAustralian English, 70 hours of oral history interviews with women from Australia's Channel Country, together with archival film, transcripts, photos and music
British Academic Written English Corpus (BAWE)8336262 tokensthe universities of Warwick, Reading and Oxford Brookesjust under 3000 good-standard student assignments
British National Corpus100 million words BNC ConsortiumBritish English, both spoken and written from the late twentieth century; written texts (90%) and transcripts of speech (10%)
Corpus of Early English Correspondence Sampler (CEECS)450085 wordsTerttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin at the Department of Modern Languages, University of Helsinki.CEECS covers the years 1418-1680 and consists of 1147 letters written by 194 writers.
Early English Books Online -TCP 48,339 booksthe Text Creation Partnership, ProQuest and more than 150 librariesconsists of the works represented in the English Short Title Catalogue I and II (based on the Pollard & Redgrave and Wing short title catalogs), as well as the Thomason Tracts and the Early English Books Tract Supplement. Together these trace the history of English thought from the first book printed in English in 1475 through to 1700
Eighteenth Century Collections Online-TCPover 180,000 titles (200,000 volumes)the Text Creation Partnership, Galeincludes every significant English-language and foreign-language title printed in the United Kingdom during the 18th century, along with thousands of important works from the Americas
Evans Early American Imprint Collection-TCP40,000 titlesthe Text Creation Partnership, NewsBank/Readex Co., and the American Antiquarian Societycontains the full text of all known existing books, pamphlets, and broadsides printed in the United States (or British American colonies prior to Independence) from 1639 through 1819, some 72,000 titles.
FLOB1 millionChristian Mair, Albert Ludwigs-Universität FreiburgBritish English
Frown1 millionChristian Mair, Albert Ludwigs-Universität FreiburgAmerican English
Korpus of Early Modern Playtexts in English (KEMPE)10.7 million tokensLene B. Petersen and Marcus X. Dahl (University of Bristol, UK) in association with the VISL project (University of Southern Denmark)grammatically annotated with token based tags at the morphological/PoS and syntactic levels
La Trobe Corpus of Spoken Australian English (LTCSAusE) 49,133 wordsLa Trobe UniversityAustralian English, audio recordings (one recording per conversation) and transcripts
Michigan Early Modern English Materials36,000 modal verb entriesRichard W. Bailey, Jay L. Robinson, James W. Downer, and Patricia V. Lehman consist of citations collected for the modal verbs and certain other English words for the Early Modern English Dictionary
Mitchell and Delbridge collection16 itemsUniversity of SydneyAustralian English, audio recordings of spoken wordlists and monologues
Monash Corpus of English (MCE) 95,584 wordsMonash UniversityAustralian English, audio recordings and transcripts of spoken interviews
Parsed Corpus of Early English Correspondence (PCEEC)2159132University of Helsinki and University of Yorkletters
RCPCE profession-specific corpora8 corporathe Research Centre for Professional Communication in English of the Hong Kong Polytechnic UniversityHong Kong Corpus of Spoken English
The Academic Corpus3.5 million wordsVictoria University of Wellington414 academic texts from a variety of subject areas
The Aix-MARSEC database5 hours of speech dataCyril Auran, Savoirs, textes et langage, Caroline Bouzon, Savoirs, textes et langage, Céline De Looze, Savoirs, textes et langage, Daniel Hirst, Laboratoire parole et langagespoken British English, 5 hours of BBC recordings together with annotations at several linguistic levels
The Australian Corpus of English/The Macquarie corpus757,024 wordsMacquarie UniversityAustralian English, published texts taken from 15 different categories of nonfiction and fiction
The Bergen Corpus of London Teenage Languagea million words the University of Bergenspeech of teenagers
The British Academic Spoken English (BASE)1,644,942 tokensHilary Nesi, Paul ThompsonBritish Academic Spoken English, 160 lectures and 39 seminars recorded in a variety of university departments
The British component of the International Corpus of English (ICE-GB) one million wordsGerald Nelson at the Chinese University of Hong Kongthe British component of the International Corpus of English (ICE)
The CHRISTINE corpus approximately 80,500 wordsGeoffrey Sampsonspoken data
The Corpus of English Dialogues 1560–1760 (CED)1.3 million wordsthe Arts and Humanities Data Service (AHDS) , University of Oxford dialogues from 1560 to 1760
The Corpus of Middle English Prose and Verse sixty-two textsThe Humanities Text Initiative Middle English texts
The Corpus of Oz Early English (COOEE)1,545,163 wordsClemens Fritztexts written in Australia, New Zealand or Norfolk Island, or by native Australians on travels, between 1788 and 1900
The Corpus of Professional English17 million words Shogakukan Corpus NetworkEnglish academic journal texts in science, engineering, technology and other fields
The Corpus of Spoken, Professional American-English (CSPA)two million words of speech Michael Barlowa selection of existing transcripts of interactions in professional settings
The Dictionary of Old English Corpus in Electronic Formover three million wordsCentre for Medieval Studies, University of Torontoa complete record of surviving Old English except
The English language of the north-west in the late Modern English period: a Corpus of late 18c Prose30,000 words David Denison in collaboration with Linda van Bergen (from 1998) and Joana Soliva (formerly Proud) (from 1999)About 300,000 words of local English letters on practical subjects, dated 1761-90
The Helsinki Corpus of English Texts1.5 million words Matti Rissanencovers a thousand years of English texts, from the eighth to the beginning of the eighteenth century
The Innsbruck Computer Archive of Machine-Readable English Texts (ICAMET)7.8 million wordsMarkus, Manfredthree subsections, namely the Prose Corpus 1100-1500 (a full-text database), the Letter Corpus 1386-1688 (containing 254 complete letters from different sources, arranged diachronically), and the Prose Varia Corpus (a mixture of tagged, normalized, translated and otherwise manipulated or synopsized texts)
The International Corpus of English (ICE)long-term aim: twenty one million wordstwenty-three research teams around the world All ICE corpora contain 500 texts of approximately 2,000 words each, sampled from a wide range of spoken (60%) and written (40%) genres
The Lampeter Corpus of Early Modern English Tracts1.1 million words the English Department at HelsinkiUniversity and the Department of Linguistics & Modern English Languageat Lancaster Universitya collection of texts on various subject matter published between 1640 and 1740
The London-Lund Corpus of Spoken English (LLC)500,000 words Jan Svartvik, Lund UniversitySpoken British English
The LUCY corpus165,000 wordsGeoffrey Sampsonpresent-day British written English
The Machine Readable Spoken English Corpusthe School of Linguistics at Reading University
The Penn Treebank (PTB) over 4.5 million wordsMitchell Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylorselected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation
The Santa Barbara Corpus of Spoken American English (SBCSAE) approximately 249,000 wordsthe Linguistics Department of the University of California, Santa Barbararecordings of naturally occurring spoken interaction from all over the United States
The SUSANNE (an acronym for “surface and underlying structural analysis of natural English”)a 130,000 word sub-sampleGeoffrey Sampson a subset of the Brown Corpus of American English
The Switchboard Corpus (SWB) three million words (over 240 hours of recordings)John J Godfrey, Edward HollimanSpoken American English, approximately 2,400 telephone conversations between unacquainted adults
The Zurich English Newspaper Corpus 1.6 million wordsUniversity of ZurichEnglish newspapers published between 1661 and 1791
The Reading Academic Text (RAT) corpus a million words University of Readinga collection of academic texts, written by academic staff or students at the University of Reading
FrenchThe Project for American and French Research on the Treasury of the French Language (ARTFL) 150 million words a cooperative project by the Centre National de la Recherche Scientifique and the University of Chicago.
Un corpus d’entretiens spontanés 95 conversations/speakersKate Beeching, University of the West of England
GermanCOSMAS II (Corpus Search, Management and Analysis System) two billion words, over 1.1 billion words is publicly available free of charge
The core corpus of the 20th century 100 million tokens Berlin-Brandenburg Academy of Sciences and Humanities (BBAW), Berlin
The Berliner Zeitung corpus252 million tokensthe complete set of articles which have ben published online between January 1994 and december 2005
Willkommen beim Wortschatz-Portal
Willkommen bei LIMAS
The Hamburg Dependency Treebankpart A, 101,999 sentences; part B, 104,795 sentences; part C, 55,027 sentencesWolfgang Menzelthe largest dependency treebank available; consists of dependency annotations, based on sentences sourced from the German news site heise.de, from articles published between 1996 and 2001.
Korpus Südtiroltext corpus of South Tyrolean German
HungarianThe Hungarian National Corpus (HNC)187.6 million wordsDepartment of Corpus Linguistics of the Research Institute for Linguistics of the Hungarian Academy of Sciences, Hungarian Language Offices, 5 subcopora: Hungary, Slovakia, Subcarpathia, Transylvania, Vojvodina; 5 text genres: press, literature, science, official, personal
Hungarian Webcorpus 1.48 billion words unfilteredThe largest Hungarian language corpus, available in its entirety under a permissive Open Content license.
IcelandicThe Malromur corpus About 120,000 voice samples from 592 individualsSigrún Helgadóttiran open source corpus of Icelandic voice samples
The Tagged Icelandic Corpus (MÍM) 25 million tokensSigrún Helgadóttir contemporary Icelandic texts
IrishTobar na Gaedhilge (‘The source of Irish’) 2.5 million word Ciarán Ó Duibhín
Corpus Náisiúnta na Gaeilge / The National Corpus of Irish 8 million wordsthe Royal Irish Academy
ItalianCORIS/CODIS - corpus of written italian130 million wordsR. Rossini FavrettiCORIS is a corpus of written Italian; CODIS is a further corpus aimed at specialist needs that allows the selection of the subcorpora which are pertinent to a specific research project and also the size of every single sub-corpus.
Banca dati dell'italiano parlato (BADIP)490,000 wordsa group of linguists under the direction of Tullio De Mauro, in collaboration with IBM Italyspoken Italian
Link to a list of Italian corpora by Institute of Cognitive Sciences and Technologies
KoreanKorean National Corpusgoal: 200 million eojuls
Mandarin ChineseSinica Treebank361,834 wordsAcademia SinicaMandarin Chinese as used in Taiwan, extracted from the Sinica Corpus
Chinese Treebank 9.0 3,247,331 characters (hanzi or foreign)Nianwen Xue, Xiuhong Zhang, Zixin Jiang, Martha Palmer, Fei Xia, Fu-Dong Chiou, Meiyu Changannotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups, weblogs, discussion forums, chat messages and transcribed conversational telephone speech
The Lancaster Corpus of Mandarin Chinese1 millionMcEnery, A.M. (ed.); Xiao, Richard (ed.)
Academia Sinica Balanced Corpus (ASBC) of Modern Chinese/Sinica Corpus10 million wordsAcademia SinicaMandarin Chinese as used in Taiwan, texts published from 1981 to 2007
The Modern Chinese Language Corpus (MCLC)/ Sinica Corpus11,245,330 word tokensAcademia Sinica
CCL (Center for Chinese Linguistics PKU) Corpus783463175 tokensCenter for Chinese Linguistics PKU
cncorpus19455328 tokensState Language Commission
The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC) Dr. Richard Xiao (UCREL of Lancaster University) and Professor Hongyin Tao (University of California Los Angeles)a corpus of spoken Mandarin Chinese. The corpus is composed of 1,002,151 words of dialogues and monologues, both spontaneous and scripted, in 73,976 sentences and 49,670 utterance units (paragraphs)
Modern GreekThe Hellenic National Corpus34 million words The Institute for Language and Speech Processing written texts
PersianUppsala Persian Corpus (UPC) 2,704,028 tokensMojgan Seraji a modified version of the Bijankhan corpus (Bijankhan, 2004) with additional sentence segmentation and consistent tokenization
PolishThe Polish National Corpus 1.5 billion wordsInstitute of Computer Science at the Polish Academy of Sciences (coordinator), Institute of Polish Language at the Polish Academy of Sciences, Polish Scientific Publishers PWN, and the Department of Computational and Corpus Linguistics at the University of Łódźon the PELCRA (Polish and English Language Corpora for Research and Application) project
PortugueseThe CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público) 180 million wordscorpus of newspaper text from the daily Portuguese newspaper Público
Linguatecathe Portuguese Ministry of Science and Technology96 corpora
O Corpus do Portuguese two subcorpora: 45 million words and 1 billion words Mark Davies, BYUtwo different parts:the (original, smaller) corpus that allows you to look at historical changes and genre-based variation; the (new, much larger) corpus that you can use to look at dialectal variation (and have 50x as much data for Modern Portuguese)
Tycho Brahe Parsed Corpus of Historical Portuguese76 texts ( 3,303,196 words) are availableGalves, Charlotte, and Pablo Fariatexts written in Portuguese by authors born between 1380 and 1881
RussianRussian National Corpus300 million wordsInstitute of Russian language, Russian Academy of Sciences6 subcorpora: The Deeply Annotated corpus, The Parallel Corpora, The Dialectal corpus, The Poetry corpus, The Educational corpus, The Corpus of Spoken Russian
The Helsinki Annotated Corpus of Russian Texts HANCO100, 000 running wordsthe Department of Slavonic and Baltic Languages and Literatures at the University of Helsinkextracted from a modern Russian magazine
Open Corpus of the Russian language
Stories about dreams and other corpora of spoken languagefour subcorpora range from 5,000 to 14,000spontaneous informal spoken discourse
Corpus of Standard Written Russian
Computer corpus of textsretrieved from newspapers of the late 20th century
TatarCorpus of Written Tata116 million word occurrences
TurkishTurkish National Corpus50 million wordssamples of textual data across a wide variety of genres covering a period of 20 years (1990-2009); both written data and transcription from spoken data
Scottish EnglishThe Scottish Corpus of Texts and Speech (SCOTS) 4.6 million wordsJohn CorbettWritten and Spoken, with audio recordings to accompany many of the spoken texts
The Corpus of Modern Scottish Writing (CMSW) 5.5 million wordsJohn Corbett, Jeremy Smithwritten and printed texts from the period 1700-1945
SlovakCorpus of Slovak Wikipédia and Necyklopédia42 615 597 tokens Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences written
Corpus of Spoken Slovak5.72 million tokensDepartment of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences spoken
Corpus of Copywrighting Texts on the Web1 648 229 tokensDepartment of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
Corpus of Economic Texts 165 million tokensDepartment of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
Corpus of Legal Texts146 million tokensDepartment of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
Corpus of Religious Texts 66 million tokensDepartment of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
Corpus of Social Science Texts 38 616 514 tokensDepartment of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
The Slovak National Corpuspublicly available 1250 million tokensDepartment of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
Corpus of Spoken Slovak 5.72 million tokens Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
Slovak Terminology Database 6000 termsDepartment of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
SpanishCorpus de Referencia del Español Actual133 millionReal Academia Española written (90%) and spoken (10%)
COLA (Corpus Oral de Lenguaje Adolescente Resource)751168 tokensUniversity of Bergena corpus of recorded, spontaneous speech among teenagers from different schools and youth clubs in Madrid, Buenos Aires and Santiago de Chile
the Corpus del Español two subcorpora: 100 million words and 2 billion wordsMark Davies, BYUtwo parts: the (original, smaller) corpus that allows you to look at historical changes and genre-based variation; the (new, much larger) corpus that you can use to look at dialectal variation (and have 100x as much data for Modern Spanish).
SwedishThe Bank of Swedish about 12.5 billion tokensa linguistic reference databank at the University of Gothenburg
Welsh The CEG (Cronfa Electroneg o Gymraeg) corpusone million wordsEllis, O'Dochartaigh & Hicks of the Welsh IT Unit and the School of Psychology, University of Wales, Bangormodern (mainly post 1970) Welsh prose writing
TurkishTS Wikipediaapproximately 1.6 million processed Turkish Wikipedia pages
MultilingualEuropean Corpus Initiative Multilingual Corpus I (ECI/MCI) 98 million wordsEuropean Corpus Initiative (ECI)46 subcorpora in 27 (mainly European) languages
MULTEXT JOC Corpus5 million wordsthe European CommunityEnglish, French, German, Italian and Spanish
MULTEXT-East "1984" annotated corpus 4.0100,000 English words and translations in 9 languagesErjavec, Tomaž; Barbu, Ana-Maria; Derzhanski, Ivan; Dimitrova, Ludmila; Garabík, Radovan; Ide, Nancy; Kaalep, Heiki-Jaan; Kotsyba, Natalia; Krstev, Cvetana; Oravecz, Csaba; Petkevič, Vladimír; Priest-Dorman, Greg; QasemiZadeh, Behrang; Radziszewski, Adam; Simov, Kiril; Tufiş, Dan; Zdravkova, KaterinaThe Multext-East parallel corpus consists of the English original of George Orwell's novel '1984' together with its translations into the nine project languages: Bulgarian, Czech, Estonian, Hungarian, Lithuanian, Romanian, Russian, Serbian, and Slovene.
Multilingual Corpora for Cooperation (MLCC)totaling approximately 10.2 million wordsLTG, Edinburgh and ISSCO with coordination by CNR,Pisatwo main components: the Polylingual Document Collection, and a Multilingual Parallel Corpus consisting of translated data in nine European languages
The Child Language Data Exchange System (CHILDES) 180 million characters (ca. 20 million words)a system for sharing and studying conversational interactions
The CLUVI (Linguistic Corpus of the University of Vigo) parallel corpus25 million wordsSLI (Computational Linguistics Group of the University of Vigo)main components are the TECTRA Corpus of English-Galician literary texts, the FEGA Corpus of French-Galician literary texts, the LEGA Corpus of Galician-Spanish legal texts, the UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation texts, the LOGALIZA Corpus of English-Galician software localization, and the CONSUMER Corpus of Spanish-Galician-Catalan-Basque consumer information
The EMILLE Corpusmonolingual corpora 92,799,000 words; parallel corpus 200,000 words of text in English and its accompanying translationsLancaster University, UK, and the Central Institute of Indian Languages (CIIL), Mysore, Indiathree components: monolingual, parallel and annotated corpora
The Oslo Multilingual Corpus (OMC) 2.6 million words University of Oslo, the University of Bergenoriginal texts and translations from several languages: Norwegian, English, French, German, Dutch, Portuguese, Swedish and Finnish
The Czech National Corpus 26 corpora, in which the syn series acieves 2232 milInstitute of the Czech National Corpus (ICNC), Faculty of Arts, Charles University in Praguewritten corpora, spoken corpora, parallel corpus, specialized corpora
CELEX2English, German, Dutch
European Parliament Proceedings Parallel Corpus 1996-2011Philipp Koehn
Corpora Collection 425,703,278 tokens
The RuN-Euro corpus8,763,402 wordsthe RuN project (2008-2010) at the University of Osloa parallel corpus originally consisting of Norwegian and Russian texts, and other European languages are currently being added
Bilingual ParallelHong Kong Parallel Textapproximately 59 million English words and 49 million Chinese words (or 98 million Chinese characters)Xiaoyi Mathree sub-corpora, namely Hong Kong Hansards, Hong Kong Laws and Hong Kong News
Parallelum Slovaco-Latinum Corpus32000 sentences in Slovak and 29000 sentences in LatinDepartment of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences Slovaco-Latin
Slovak-Bulgarian Parallel Corpus163 million tokens, 78 million in the Slovak half, 85 million in the Bulgarian oneDepartment of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences Slovak-Bulgarian
Slovak-Czech Parallel Corpus 418.5 million tokens, 209.2 million in the Slovak half, 209.3 million in the Czech halfDepartment of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences Slovak-Czech
Slovak-English parallel corpora 556 million tokens, 261 million tokens in the Slovak half, 295 million tokens in the English oneDepartment of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
Slovak-French parallel corpora 441.5 million tokens, 213.3 million in the Slovak part and 228.2 million in the French partDepartment of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences Slovak-French
Slovak-German Parallel Corpus 446.2 million tokens (219.8 million tokens in the Slovak half, 226.4 million tokens in the German half)Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences Slovak-German
Slovak-Hungarian Parallel Corpus 99 million tokens (51 million in the Slovak half, 48 million in the Hungarian half)Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences Slovak-Hungarian
Slovak-Russian parallel corpora 8.45 million tokens, 4.2 million in the Slovak part and 4.25 million in the Russian partDepartment of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences Slovak-Russian
TED English Chinese parallel corpus of speeches6,187,849 English words and Chinese charactersJiajin Xu 
The Babel Chinese-English Parallel Corpus20 million Chinese characters and 10 million English wordsthe Institute of Computational Linguistics of Beijing Universitywritten, 327 English articles and their translations into Mandarin Chinese
The Canadian Hansard Corpus - USC version 1.3 million pairs of aligned text chunks, 2 million words in English and French eachUlrich Germannspoken and written texts in English and French from the Canadian Parliament
The English-Norwegian Parallel Corpus (ENPC) 2.6 million words Stig Johansson, Knut Hoflandoriginal texts and their translations (English to Norwegian and Norwegian to English) and includes both fiction and non-fiction
The English-Swedish Parallel Corpus (ESPC)2.8 million wordsthe Departments of English at the Universities of Lund and Gothenburgoriginal texts and their translations (English to Swedish and Swedish to English); both fictional and non-fictional texts are included
The IJS-ELAN Slovene-English Parallel Corpus (IJS-ELAN)one million wordsthe Dept. of Knowledge Technologies, Jožef Stefan Institute15 parallel Slovene-English/English-Slovene texts
OthersRomance Phonetics Databasean on-line research and teaching tool containing tagged sound samples (both individual words and passages) illustrative of various segmental and prosodic aspects of Romance phonetics and phonology
links to non-English language corporaStanford University
corpora list The Humboldt University of Berlin
Print Friendly, PDF & Email