CALPER Corpus Portal

Language	Name of Corpus	Size	Author/Institution	Info
Arabic	Khaleej-2004 corpus	3 million words
	Watan-2004 corpus	20000 articles
Basque	XX Century Basque language corpora
British Sign Language	British Sign Language Corpus Project	249 Deaf people were filmed from 8 cities across the United Kingdom	led by staff at the Deafness Cognition and Language Research Centre (DCAL) at University College London, but also included researchers from Bangor University (Wales), Heriot-Watt University (Scotland), Queens University Belfast (Northern Ireland) and the University of Bristol (England)	a collection of video clips showing deaf people using BSL
Catalan	Corpus del català contemporani			a corpus of contemporary colloquial Catalan
Croatian	Croatian National Corpus (HNK)	216.8 million tokens	the Institute of Linguistics of the Faculty of Humanities and Social Sciences, University of Zagreb
Czech	The Prague Dependency Treebank	1.8 million words		drawn from the Czech National Corpus (see section 2.4) which have been annotated morphologically and syntactically
	Cesky Národní Korpus (CNK)			the Czech national corpus
Danish	Korpus 90 for Danish	32 million tokens	Society for Danish Language and Literature
	Korpus 2000 for Danish	30 million tokens	Society for Danish Language and Literature
	Korpus 2010 for Danish	45 million tokens	Society for Danish Language and Literature
Dutch	The Institute for Dutch Lexicology (INL) - The Words Corpus 1996	38 Million
English	British National Corpus (BYU-BNC)	100 million words	originally created by Oxford University Press	written; texts from 1980s through 1993
	Michigan Corpus of Academic Spoken English (MiCASE)	1, 848 words	University of Michigan	transcripts
	Corpus of Contemporary American English (COCA)	520 million words	Mark Davies	written; texts of various genres; created from 1990-2015
	The Brown Corpus	1 million words	W. Nelson Francis and Henry Kučera at Brown University	written; text of edited English prose printed in the U.S.
	The Open American National Corpus (Second Release)	22 million words	Nancy Ide, Keith Suderman, Vassar College	written and spoken
	The Griffith Corpus of Spoken Australian English (GCSAusE)	32,134 words	Griffith University	Australian English, a collection of forty audio recordings and transcriptions of spoken interaction
	Anthology Reference Corpus	49,348,397 words	the Association for Computational Linguistics (ACL)	10,291 research papers in computational linguistics
	Cambridge English Corpus	multi-billion words	Cambridge University Press	written, spoken and learner texts
	CLiC Dickens project	3,835,807 words	University of Nottingham, University of Birmingham	literary texts
	The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English	106,210 words	Susan Pintzuk, Eric Haeberli , Ans van Kemenade, Willem Koopman, and Frank Beths	a selection of texts from the Old English Section of the Helsinki Corpus of English Texts, annotated to facilitate searches on lexical items and syntactic structure
	The Cambridge and Nottingham Corpus of Discourse in English (CANCODE)	five million words	Mike McCarthy, Ronald Carter	corpus of spoken interaction
	The edited Polytechnic of Wales (POW) corpus	65,000 words	O’Donoghue, Tim	spoken English, an individual interview with the same "friendly" adult for each child, in which the child's favourite games or TV programmes were discussed; 120 children
	The Lancaster-Oslo/Bergen Corpus (LOB Corpus)	1 million	Geoffrey Leech, Stig Johansson, Knut Hofland, Roger Garside	British English, 500 texts of c. 2,000 words, distributed across 15 text categories, 9 informative and 6 imaginative
	The Louvain Corpus of Native English Essays (LOCNESS)	totaling 324,304 words	the Centre for English Corpus Linguistics (CECL), Université catholique de Louvain, Belgium	British pupils’ A level essays: 60,209 words; British university students essays: 95,695 words; American university students’ essays: 168,400 words
	The Manually Annotated Sub-Corpus (MASC)	500,000 words	Nancy Ide, Keith Suderman	contemporary American English written and spoken data drawn from the Open American National Corpus (OANC)
	The Penn Parsed Corpora of Historical English		PPCME2: Kroch, Anthony, and Ann Taylor; PPCEME: Kroch, Anthony, Beatrice Santorini, and Lauren Delfs; PPCMBE2: Kroch, Anthony, Beatrice Santorini, and Ariel Diertani	the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English, second edition (PPCMBE2)
	The Saarbrücken Corpus of Spoken English (SCoSE)		the Department of English at Saarland University	Spoken English
	The Wellington Corpus of Spoken New Zealand English (WSC)	one million words	Janet Holmes, Bernadette Vine	Spoken New Zealand English
	The Wellington Corpus of Written New Zealand English	1 million	the Department of Linguistics at Victoria University of Wellington	written New Zealand English collected from writings published in the years 1986 to 1990
	The York-Helsinki Parsed Corpus of Old English Poetry	71,490 words of Old English text samples	Susan Pintzuk, Leendert Plug	a selection of poetic texts from the Old English section of the Helsinki corpus of English texts, annotated to facilitate searches on syntactic structure and lexical items
	The York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE)	1.5 million words	Taylor, Ann (ed.); Warner, Anthony (ed.); Pintzuk, Susan (ed.); Beths, Frank (ed.)	syntactically-annotated corpus of Old English prose texts
	The Intonational Variation in English (IviE) corpus	36 hours of speech data	Phonetics Laboratory, University of Oxford; Department of Linguistics, University of Cambridge	recordings of nine urban dialects of English spoken in the British Isles
	A Corpus of Late Modern English Prose	10,000 words	Denison, David	a collection of five 20,000 word block samples of Late Modern English
	A Representative Corpus of Historical English Registers	1.7 million words	a consortium of participants at fourteen universities in seven countries	a multi-genre corpus of British and American English covering the period 1600-1999
	AustLit	4,234,314 words	University of Queensland	Australian English, elect samples of out of copyright poetry, fiction and criticism ranging from 1795 to the 1930s
	Australian component of the International Corpus of English (ICE-AUS)	1,055,919 words	Macquarie University	Australian English, transcribed spoken and written Australian English from 1992-1995
	Australian Radio Talkback (ART)	251,677 words	Pam Peters	Australian English, transcribed recordings of samples of national, regional and commercial Australian talkback radio from 2004 to 2006
	Braided Channels Research Collection	363,670 words	Trish FitzSimons	Australian English, 70 hours of oral history interviews with women from Australia's Channel Country, together with archival film, transcripts, photos and music
	British Academic Written English Corpus (BAWE)	8336262 tokens	the universities of Warwick, Reading and Oxford Brookes	just under 3000 good-standard student assignments
	British National Corpus	100 million words	BNC Consortium	British English, both spoken and written from the late twentieth century; written texts (90%) and transcripts of speech (10%)
	Corpus of Early English Correspondence Sampler (CEECS)	450085 words	Terttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin at the Department of Modern Languages, University of Helsinki.	CEECS covers the years 1418-1680 and consists of 1147 letters written by 194 writers.
	Early English Books Online -TCP	48,339 books	the Text Creation Partnership, ProQuest and more than 150 libraries	consists of the works represented in the English Short Title Catalogue I and II (based on the Pollard & Redgrave and Wing short title catalogs), as well as the Thomason Tracts and the Early English Books Tract Supplement. Together these trace the history of English thought from the first book printed in English in 1475 through to 1700
	Eighteenth Century Collections Online-TCP	over 180,000 titles (200,000 volumes)	the Text Creation Partnership, Gale	includes every significant English-language and foreign-language title printed in the United Kingdom during the 18th century, along with thousands of important works from the Americas
	Evans Early American Imprint Collection-TCP	40,000 titles	the Text Creation Partnership, NewsBank/Readex Co., and the American Antiquarian Society	contains the full text of all known existing books, pamphlets, and broadsides printed in the United States (or British American colonies prior to Independence) from 1639 through 1819, some 72,000 titles.
	FLOB	1 million	Christian Mair, Albert Ludwigs-Universität Freiburg	British English
	Frown	1 million	Christian Mair, Albert Ludwigs-Universität Freiburg	American English
	Korpus of Early Modern Playtexts in English (KEMPE)	10.7 million tokens	Lene B. Petersen and Marcus X. Dahl (University of Bristol, UK) in association with the VISL project (University of Southern Denmark)	grammatically annotated with token based tags at the morphological/PoS and syntactic levels
	La Trobe Corpus of Spoken Australian English (LTCSAusE)	49,133 words	La Trobe University	Australian English, audio recordings (one recording per conversation) and transcripts
	Michigan Early Modern English Materials	36,000 modal verb entries	Richard W. Bailey, Jay L. Robinson, James W. Downer, and Patricia V. Lehman	consist of citations collected for the modal verbs and certain other English words for the Early Modern English Dictionary
	Mitchell and Delbridge collection	16 items	University of Sydney	Australian English, audio recordings of spoken wordlists and monologues
	Monash Corpus of English (MCE)	95,584 words	Monash University	Australian English, audio recordings and transcripts of spoken interviews
	Parsed Corpus of Early English Correspondence (PCEEC)	2159132	University of Helsinki and University of York	letters
	RCPCE profession-specific corpora	8 corpora	the Research Centre for Professional Communication in English of the Hong Kong Polytechnic University	Hong Kong Corpus of Spoken English
	The Academic Corpus	3.5 million words	Victoria University of Wellington	414 academic texts from a variety of subject areas
	The Aix-MARSEC database	5 hours of speech data	Cyril Auran, Savoirs, textes et langage, Caroline Bouzon, Savoirs, textes et langage, Céline De Looze, Savoirs, textes et langage, Daniel Hirst, Laboratoire parole et langage	spoken British English, 5 hours of BBC recordings together with annotations at several linguistic levels
	The Australian Corpus of English/The Macquarie corpus	757,024 words	Macquarie University	Australian English, published texts taken from 15 different categories of nonfiction and fiction
	The Bergen Corpus of London Teenage Language	a million words	the University of Bergen	speech of teenagers
	The British Academic Spoken English (BASE)	1,644,942 tokens	Hilary Nesi, Paul Thompson	British Academic Spoken English, 160 lectures and 39 seminars recorded in a variety of university departments
	The British component of the International Corpus of English (ICE-GB)	one million words	Gerald Nelson at the Chinese University of Hong Kong	the British component of the International Corpus of English (ICE)
	The CHRISTINE corpus	approximately 80,500 words	Geoffrey Sampson	spoken data
	The Corpus of English Dialogues 1560–1760 (CED)	1.3 million words	the Arts and Humanities Data Service (AHDS) , University of Oxford	dialogues from 1560 to 1760
	The Corpus of Middle English Prose and Verse	sixty-two texts	The Humanities Text Initiative	Middle English texts
	The Corpus of Oz Early English (COOEE)	1,545,163 words	Clemens Fritz	texts written in Australia, New Zealand or Norfolk Island, or by native Australians on travels, between 1788 and 1900
	The Corpus of Professional English	17 million words	Shogakukan Corpus Network	English academic journal texts in science, engineering, technology and other fields
	The Corpus of Spoken, Professional American-English (CSPA)	two million words of speech	Michael Barlow	a selection of existing transcripts of interactions in professional settings
	The Dictionary of Old English Corpus in Electronic Form	over three million words	Centre for Medieval Studies, University of Toronto	a complete record of surviving Old English except
	The English language of the north-west in the late Modern English period: a Corpus of late 18c Prose	30,000 words	David Denison in collaboration with Linda van Bergen (from 1998) and Joana Soliva (formerly Proud) (from 1999)	About 300,000 words of local English letters on practical subjects, dated 1761-90
	The Helsinki Corpus of English Texts	1.5 million words	Matti Rissanen	covers a thousand years of English texts, from the eighth to the beginning of the eighteenth century
	The Innsbruck Computer Archive of Machine-Readable English Texts (ICAMET)	7.8 million words	Markus, Manfred	three subsections, namely the Prose Corpus 1100-1500 (a full-text database), the Letter Corpus 1386-1688 (containing 254 complete letters from different sources, arranged diachronically), and the Prose Varia Corpus (a mixture of tagged, normalized, translated and otherwise manipulated or synopsized texts)
	The International Corpus of English (ICE)	long-term aim: twenty one million words	twenty-three research teams around the world	All ICE corpora contain 500 texts of approximately 2,000 words each, sampled from a wide range of spoken (60%) and written (40%) genres
	The Lampeter Corpus of Early Modern English Tracts	1.1 million words	the English Department at HelsinkiUniversity and the Department of Linguistics & Modern English Languageat Lancaster University	a collection of texts on various subject matter published between 1640 and 1740
	The London-Lund Corpus of Spoken English (LLC)	500,000 words	Jan Svartvik, Lund University	Spoken British English
	The LUCY corpus	165,000 words	Geoffrey Sampson	present-day British written English
	The Machine Readable Spoken English Corpus		the School of Linguistics at Reading University
	The Penn Treebank (PTB)	over 4.5 million words	Mitchell Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor	selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation
	The Santa Barbara Corpus of Spoken American English (SBCSAE)	approximately 249,000 words	the Linguistics Department of the University of California, Santa Barbara	recordings of naturally occurring spoken interaction from all over the United States
	The SUSANNE (an acronym for “surface and underlying structural analysis of natural English”)	a 130,000 word sub-sample	Geoffrey Sampson	a subset of the Brown Corpus of American English
	The Switchboard Corpus (SWB)	three million words (over 240 hours of recordings)	John J Godfrey, Edward Holliman	Spoken American English, approximately 2,400 telephone conversations between unacquainted adults
	The Zurich English Newspaper Corpus	1.6 million words	University of Zurich	English newspapers published between 1661 and 1791
	The Reading Academic Text (RAT) corpus	a million words	University of Reading	a collection of academic texts, written by academic staff or students at the University of Reading
French	The Project for American and French Research on the Treasury of the French Language (ARTFL)	150 million words	a cooperative project by the Centre National de la Recherche Scientifique and the University of Chicago.
	Un corpus d’entretiens spontanés	95 conversations/speakers	Kate Beeching, University of the West of England
German	COSMAS II (Corpus Search, Management and Analysis System)	two billion words, over 1.1 billion words is publicly available free of charge
	The core corpus of the 20th century	100 million tokens	Berlin-Brandenburg Academy of Sciences and Humanities (BBAW), Berlin
	The Berliner Zeitung corpus	252 million tokens		the complete set of articles which have ben published online between January 1994 and december 2005
	Willkommen beim Wortschatz-Portal
	Willkommen bei LIMAS
	The Hamburg Dependency Treebank	part A, 101,999 sentences; part B, 104,795 sentences; part C, 55,027 sentences	Wolfgang Menzel	the largest dependency treebank available; consists of dependency annotations, based on sentences sourced from the German news site heise.de, from articles published between 1996 and 2001.
	Korpus Südtirol			text corpus of South Tyrolean German
Hungarian	The Hungarian National Corpus (HNC)	187.6 million words	Department of Corpus Linguistics of the Research Institute for Linguistics of the Hungarian Academy of Sciences, Hungarian Language Offices,	5 subcopora: Hungary, Slovakia, Subcarpathia, Transylvania, Vojvodina; 5 text genres: press, literature, science, official, personal
	Hungarian Webcorpus	1.48 billion words unfiltered		The largest Hungarian language corpus, available in its entirety under a permissive Open Content license.
Icelandic	The Malromur corpus	About 120,000 voice samples from 592 individuals	Sigrún Helgadóttir	an open source corpus of Icelandic voice samples
	The Tagged Icelandic Corpus (MÍM)	25 million tokens	Sigrún Helgadóttir	contemporary Icelandic texts
Irish	Tobar na Gaedhilge (‘The source of Irish’)	2.5 million word	Ciarán Ó Duibhín
	Corpus Náisiúnta na Gaeilge / The National Corpus of Irish	8 million words	the Royal Irish Academy
Italian	CORIS/CODIS - corpus of written italian	130 million words	R. Rossini Favretti	CORIS is a corpus of written Italian; CODIS is a further corpus aimed at specialist needs that allows the selection of the subcorpora which are pertinent to a specific research project and also the size of every single sub-corpus.
	Banca dati dell'italiano parlato (BADIP)	490,000 words	a group of linguists under the direction of Tullio De Mauro, in collaboration with IBM Italy	spoken Italian
	Link to a list of Italian corpora		by Institute of Cognitive Sciences and Technologies
Korean	Korean National Corpus	goal: 200 million eojuls
Mandarin Chinese	Sinica Treebank	361,834 words	Academia Sinica	Mandarin Chinese as used in Taiwan, extracted from the Sinica Corpus
	Chinese Treebank 9.0	3,247,331 characters (hanzi or foreign)	Nianwen Xue, Xiuhong Zhang, Zixin Jiang, Martha Palmer, Fei Xia, Fu-Dong Chiou, Meiyu Chang	annotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups, weblogs, discussion forums, chat messages and transcribed conversational telephone speech
	The Lancaster Corpus of Mandarin Chinese	1 million	McEnery, A.M. (ed.); Xiao, Richard (ed.)
	Academia Sinica Balanced Corpus (ASBC) of Modern Chinese/Sinica Corpus	10 million words	Academia Sinica	Mandarin Chinese as used in Taiwan, texts published from 1981 to 2007
	The Modern Chinese Language Corpus (MCLC)/ Sinica Corpus	11,245,330 word tokens	Academia Sinica
	CCL (Center for Chinese Linguistics PKU) Corpus	783463175 tokens	Center for Chinese Linguistics PKU
	cncorpus	19455328 tokens	State Language Commission
	The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC)		Dr. Richard Xiao (UCREL of Lancaster University) and Professor Hongyin Tao (University of California Los Angeles)	a corpus of spoken Mandarin Chinese. The corpus is composed of 1,002,151 words of dialogues and monologues, both spontaneous and scripted, in 73,976 sentences and 49,670 utterance units (paragraphs)
Modern Greek	The Hellenic National Corpus	34 million words	The Institute for Language and Speech Processing	written texts
Persian	Uppsala Persian Corpus (UPC)	2,704,028 tokens	Mojgan Seraji	a modified version of the Bijankhan corpus (Bijankhan, 2004) with additional sentence segmentation and consistent tokenization
Polish	The Polish National Corpus	1.5 billion words	Institute of Computer Science at the Polish Academy of Sciences (coordinator), Institute of Polish Language at the Polish Academy of Sciences, Polish Scientific Publishers PWN, and the Department of Computational and Corpus Linguistics at the University of Łódź	on the PELCRA (Polish and English Language Corpora for Research and Application) project
Portuguese	The CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público)	180 million words		corpus of newspaper text from the daily Portuguese newspaper Público
	Linguateca		the Portuguese Ministry of Science and Technology	96 corpora
	O Corpus do Portuguese	two subcorpora: 45 million words and 1 billion words	Mark Davies, BYU	two different parts:the (original, smaller) corpus that allows you to look at historical changes and genre-based variation; the (new, much larger) corpus that you can use to look at dialectal variation (and have 50x as much data for Modern Portuguese)
	Tycho Brahe Parsed Corpus of Historical Portuguese	76 texts ( 3,303,196 words) are available	Galves, Charlotte, and Pablo Faria	texts written in Portuguese by authors born between 1380 and 1881
Russian	Russian National Corpus	300 million words	Institute of Russian language, Russian Academy of Sciences	6 subcorpora: The Deeply Annotated corpus, The Parallel Corpora, The Dialectal corpus, The Poetry corpus, The Educational corpus, The Corpus of Spoken Russian
	The Helsinki Annotated Corpus of Russian Texts HANCO	100, 000 running words	the Department of Slavonic and Baltic Languages and Literatures at the University of Helsink	extracted from a modern Russian magazine
	Open Corpus of the Russian language
	Stories about dreams and other corpora of spoken language	four subcorpora range from 5,000 to 14,000		spontaneous informal spoken discourse
	Corpus of Standard Written Russian
	Computer corpus of texts			retrieved from newspapers of the late 20th century
Tatar	Corpus of Written Tata	116 million word occurrences
Turkish	Turkish National Corpus	50 million words		samples of textual data across a wide variety of genres covering a period of 20 years (1990-2009); both written data and transcription from spoken data
Scottish English	The Scottish Corpus of Texts and Speech (SCOTS)	4.6 million words	John Corbett	Written and Spoken, with audio recordings to accompany many of the spoken texts
	The Corpus of Modern Scottish Writing (CMSW)	5.5 million words	John Corbett, Jeremy Smith	written and printed texts from the period 1700-1945
Slovak	Corpus of Slovak Wikipédia and Necyklopédia	42 615 597 tokens	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences	written
	Corpus of Spoken Slovak	5.72 million tokens	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences	spoken
	Corpus of Copywrighting Texts on the Web	1 648 229 tokens	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
	Corpus of Economic Texts	165 million tokens	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
	Corpus of Legal Texts	146 million tokens	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
	Corpus of Religious Texts	66 million tokens	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
	Corpus of Social Science Texts	38 616 514 tokens	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
	The Slovak National Corpus	publicly available 1250 million tokens	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
	Corpus of Spoken Slovak	5.72 million tokens	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
	Slovak Terminology Database	6000 terms	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
Spanish	Corpus de Referencia del Español Actual	133 million	Real Academia Española	written (90%) and spoken (10%)
	COLA (Corpus Oral de Lenguaje Adolescente Resource)	751168 tokens	University of Bergen	a corpus of recorded, spontaneous speech among teenagers from different schools and youth clubs in Madrid, Buenos Aires and Santiago de Chile
	the Corpus del Español	two subcorpora: 100 million words and 2 billion words	Mark Davies, BYU	two parts: the (original, smaller) corpus that allows you to look at historical changes and genre-based variation; the (new, much larger) corpus that you can use to look at dialectal variation (and have 100x as much data for Modern Spanish).
Swedish	The Bank of Swedish	about 12.5 billion tokens		a linguistic reference databank at the University of Gothenburg
Welsh	The CEG (Cronfa Electroneg o Gymraeg) corpus	one million words	Ellis, O'Dochartaigh & Hicks of the Welsh IT Unit and the School of Psychology, University of Wales, Bangor	modern (mainly post 1970) Welsh prose writing
Turkish	TS Wikipedia	approximately 1.6 million		processed Turkish Wikipedia pages
Multilingual	European Corpus Initiative Multilingual Corpus I (ECI/MCI)	98 million words	European Corpus Initiative (ECI)	46 subcorpora in 27 (mainly European) languages
	MULTEXT JOC Corpus	5 million words	the European Community	English, French, German, Italian and Spanish
	MULTEXT-East "1984" annotated corpus 4.0	100,000 English words and translations in 9 languages	Erjavec, Tomaž; Barbu, Ana-Maria; Derzhanski, Ivan; Dimitrova, Ludmila; Garabík, Radovan; Ide, Nancy; Kaalep, Heiki-Jaan; Kotsyba, Natalia; Krstev, Cvetana; Oravecz, Csaba; Petkevič, Vladimír; Priest-Dorman, Greg; QasemiZadeh, Behrang; Radziszewski, Adam; Simov, Kiril; Tufiş, Dan; Zdravkova, Katerina	The Multext-East parallel corpus consists of the English original of George Orwell's novel '1984' together with its translations into the nine project languages: Bulgarian, Czech, Estonian, Hungarian, Lithuanian, Romanian, Russian, Serbian, and Slovene.
	Multilingual Corpora for Cooperation (MLCC)	totaling approximately 10.2 million words	LTG, Edinburgh and ISSCO with coordination by CNR,Pisa	two main components: the Polylingual Document Collection, and a Multilingual Parallel Corpus consisting of translated data in nine European languages
	The Child Language Data Exchange System (CHILDES)	180 million characters (ca. 20 million words)		a system for sharing and studying conversational interactions
	The CLUVI (Linguistic Corpus of the University of Vigo) parallel corpus	25 million words	SLI (Computational Linguistics Group of the University of Vigo)	main components are the TECTRA Corpus of English-Galician literary texts, the FEGA Corpus of French-Galician literary texts, the LEGA Corpus of Galician-Spanish legal texts, the UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation texts, the LOGALIZA Corpus of English-Galician software localization, and the CONSUMER Corpus of Spanish-Galician-Catalan-Basque consumer information
	The EMILLE Corpus	monolingual corpora 92,799,000 words; parallel corpus 200,000 words of text in English and its accompanying translations	Lancaster University, UK, and the Central Institute of Indian Languages (CIIL), Mysore, India	three components: monolingual, parallel and annotated corpora
	The Oslo Multilingual Corpus (OMC)	2.6 million words	University of Oslo, the University of Bergen	original texts and translations from several languages: Norwegian, English, French, German, Dutch, Portuguese, Swedish and Finnish
	The Czech National Corpus	26 corpora, in which the syn series acieves 2232 mil	Institute of the Czech National Corpus (ICNC), Faculty of Arts, Charles University in Prague	written corpora, spoken corpora, parallel corpus, specialized corpora
	CELEX2			English, German, Dutch
	European Parliament Proceedings Parallel Corpus 1996-2011		Philipp Koehn
	Corpora Collection	425,703,278 tokens
	The RuN-Euro corpus	8,763,402 words	the RuN project (2008-2010) at the University of Oslo	a parallel corpus originally consisting of Norwegian and Russian texts, and other European languages are currently being added
Bilingual Parallel	Hong Kong Parallel Text	approximately 59 million English words and 49 million Chinese words (or 98 million Chinese characters)	Xiaoyi Ma	three sub-corpora, namely Hong Kong Hansards, Hong Kong Laws and Hong Kong News
	Parallelum Slovaco-Latinum Corpus	32000 sentences in Slovak and 29000 sentences in Latin	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences	Slovaco-Latin
	Slovak-Bulgarian Parallel Corpus	163 million tokens, 78 million in the Slovak half, 85 million in the Bulgarian one	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences	Slovak-Bulgarian
	Slovak-Czech Parallel Corpus	418.5 million tokens, 209.2 million in the Slovak half, 209.3 million in the Czech half	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences	Slovak-Czech
	Slovak-English parallel corpora	556 million tokens， 261 million tokens in the Slovak half, 295 million tokens in the English one	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences
	Slovak-French parallel corpora	441.5 million tokens, 213.3 million in the Slovak part and 228.2 million in the French part	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences	Slovak-French
	Slovak-German Parallel Corpus	446.2 million tokens (219.8 million tokens in the Slovak half, 226.4 million tokens in the German half)	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences	Slovak-German
	Slovak-Hungarian Parallel Corpus	99 million tokens (51 million in the Slovak half, 48 million in the Hungarian half)	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences	Slovak-Hungarian
	Slovak-Russian parallel corpora	8.45 million tokens, 4.2 million in the Slovak part and 4.25 million in the Russian part	Department of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences	Slovak-Russian
	TED English Chinese parallel corpus of speeches	6,187,849 English words and Chinese characters	Jiajin Xu
	The Babel Chinese-English Parallel Corpus	20 million Chinese characters and 10 million English words	the Institute of Computational Linguistics of Beijing University	written, 327 English articles and their translations into Mandarin Chinese
	The Canadian Hansard Corpus - USC version	1.3 million pairs of aligned text chunks, 2 million words in English and French each	Ulrich Germann	spoken and written texts in English and French from the Canadian Parliament
	The English-Norwegian Parallel Corpus (ENPC)	2.6 million words	Stig Johansson, Knut Hofland	original texts and their translations (English to Norwegian and Norwegian to English) and includes both fiction and non-fiction
	The English-Swedish Parallel Corpus (ESPC)	2.8 million words	the Departments of English at the Universities of Lund and Gothenburg	original texts and their translations (English to Swedish and Swedish to English); both fictional and non-fictional texts are included
	The IJS-ELAN Slovene-English Parallel Corpus (IJS-ELAN)	one million words	the Dept. of Knowledge Technologies, Jožef Stefan Institute	15 parallel Slovene-English/English-Slovene texts
Others	Romance Phonetics Database			an on-line research and teaching tool containing tagged sound samples (both individual words and passages) illustrative of various segmental and prosodic aspects of Romance phonetics and phonology
	links to non-English language corpora		Stanford University
	corpora list		The Humboldt University of Berlin