CALPER Corpus Portal

What is a corpus?

A corpus (plural: corpora) is a principled collection of samples of natural language use, either written or spoken, which are usually stored as computer files. A written corpus can be gathered from a number of sources such as news media, literary works, or personal writings. A spoken corpus can be assembled from tape- or video-recorded narratives, interviews, conversations and the like, which would be transcribed into written texts. The size of a corpus can range from tens of millions of words to a few thousand. Larger corpora are usually required for big research projects such as writing dictionaries and major grammars, but so-called “mini corpora” consisting of several thousands of words can be extremely useful for language teachers. Once a corpus is built, we can use software tools to analyze it and produce word frequency lists, concordances and other useful types of output.

How can corpora be useful for Chinese language teaching?

There are many ways in which language teachers can benefit from language corpora. For example, from corpora you can find out how frequently words (or characters) are used in different discourse contexts by way of frequency lists. Table 1 below compares the frequency lists of a spoken corpus (conversation) and a written one (newspaper reports), and we can easily tell how they differ in character frequency.

Table 1: Character frequencies in natural language

For one thing, while pronouns wo, ni, ta (我, 你, 他) are common in conversations, they do not appear at all in the first 20 most frequent characters in newspaper reports. Although this finding may seem obvious since news reports are normally impersonal, this type of information is not trivial and can even be very useful for textbook design.

If a textbook is computerized, we can run a frequency list and compare it with those of natural language corpora, as illustrated in Table 2. Here, if we judge only the occurrence of personal pronouns, we can be fairly sure that this particular textbook focuses more on speaking than writing. And indeed the two frequency lists look quite similar, apparently differing only in ranking. However, if we look closer, we can find notable differences regarding the use of certain particles and lexical items, which invite further investigation (e.g. the particle jiu 就is highly frequent in natural conversation but does not appear in the textbook’s list; and the differential frequency of demonstratives zhe 这 and na 那 between the two corpora is also interesting.)

Table 2: Character frequencies in natural language and a textbook

A key instrument for further investigating the actual use of a particular linguistic structure or a lexical item in real language corpora is a concordance, which is sometimes referred to as KWIC (Key Word In Context). A concordance lists all the occurrences of a search word (key word) in a corpus. Typically the key word is centered on a line and the context is displayed around the keyword. The lines can be sorted according to different criteria (usually by the first word to the right of the keyword) to facilitate visualization; and the exact window of context can also be manipulated according to the user’s needs. In the following we use three examples to illustrate this useful tool.

Example 1: ba3 (把)

Many textbooks describe the ba (把) constructions as requiring a definite noun as its object (e.g., ta ba naige pingguo chi le 她把那个苹果吃了’She ate the apple’), ignoring many instances of indefinite ba constructions. Although some researchers have rightly pointed out that indefinite objects may also be used with ba, their examples usually sound quite contrived. This is when a collection of real texts comes in handy. With a concordance of the keyword, we can easily find a number of good examples of the construction in question. The following is a sample concordance of 把 from a corpus of news wire texts, sorted by the first word to the right of the keyword, which evidences the common use of “ba + indefinite object.”

Example 2: qishi (其实, ‘actually’)

This next set of concordance lines shows that qishi (其实, ‘actually’) is more often used as a discourse conjunction (located outside the main clause and linking large domains of discourse) than as a constituent conjunction (inside a clause and linking clause-internal elements). We can tell the difference between the two uses simply by looking at the display of the concordance lines. In the discourse conjunction use, for example, there are often punctuation marks before and after the key word, reflecting the independent status of the item in question. By contrast, the constituent conjunction is usually embedded in a clause as several of the following excerpts indicate.

Example 3: kan-kan 看看

The third set of concordance lines reveals that the reduplication of kan (看, ‘see, look’) has a number of uses: directing attention (‘你们看看’), indicating a prolonged intensive action (‘拿到太阳底下再看看’), indicating a trivial action (‘看看表’) , among others.

Because computer programs can search the corpus quickly, we can obtain a large number of examples of real language use in a very short period of time. This in turn saves valuable time for analyzing the language and preparing teaching materials. Furthermore, as language teachers, we can actually build a learner corpus from our students’ language productions and use the various corpus-handling skills to uncover typical learner errors. In short, language corpora constitute a great data source for us to explore. And they benefit not only teachers and researchers but also motivate learners. In fact, corpora are increasingly being used as learning tools for students. Given that students nowadays tend to be well equipped with computer skills, they should be encouraged to make informed uses of corpus resources in conducting their own research and enhancing their learning.

Are there Chinese language corpora currently available to Chinese language teachers?

Yes, there are quite a few Chinese corpora that are freely available on the internet. Here is a list of some of them.

From Mainland China:
► The Beijing Language and Culture University Institute of Language Information Processing has a searchable written Chinese corpus totaling 15 billion Chinese characters. It is comprised of texts from multiple sources, including newspapers and magazines (2 billion), literature (3 billion), Weibo (3 billion), science and technology (3 billion), comprehensive (1 billion) and ancient Chinese (2 billion). The online corpus is searchable and users are able to search within a sub-corpus.
URL: http://bcc.blcu.edu.cn/lang/zh
► The Peking University Modern Chinese Corpus is another source.
URL: http://ccl.pku.edu.cn:8080/ccl_corpus
► An online search system for the modern Chinese corpus developed by the Chinese National Commission on Language (国家语委) is available at:
URL: http://www.aihanyu.org/cncorpus/index.aspx

From Taiwan:
► The Academia Sinica has a Web-based Balanced Corpus of Modern Chinese (平衡语料库), consisting of texts mostly from Taiwanese newspapers. This corpus can be searched based on parts of speech information. It is also possible to search reduplicated forms.
URL: http://www.sinica.edu.tw/ftms-bin/kiwi.sh
► The Academia Sinica also has a Digital Resource Center for Global Chinese Language Teaching and Learning (全球华语文数位教学资源中心). It provides word frequency lists and a web-based collection of reading materials that can be searched with grammatical and semantic information.
URL: http://elearning.ling.sinica.edu.tw/

From other parts of the world:
► The Lancaster Corpus of Mandarin Chinese (LCMC) was constructed by Tony McEnery and Richard Xiao, Lancaster University, UK. LCMC is a balanced corpus of Modern written Chinese, consisting of texts from mainland China. The corpus, including genres such as press reportage, press editorials, religious passages, skills texts, trade and hobbies passages, popular lore, biographies and essays, fictional literature, and so forth, is designed as a Chinese match of the Freiburg-LOB Corpus of British English (FLOB). The XML format can be downloaded from the following link.
URL: http://ota.ox.ac.uk/desc/2474
► The LIVAC Corpus, or Linguistic Variation in Chinese Speech Communities synchronous corpus, contains texts from representative Chinese newspapers and electronic media of Hong Kong, Taiwan, Beijing, Shanghai, Macau and Singapore. It also provides concordance and frequency analyses. Because this corpus is constantly updated, it is possible to trace the use of expressions over time (within the time span of the corpus itself).
URL: http://www.livac.org/index.php?lang=tc

Multilingual Corpora Involving Chinese and Other Languages:
►The Babel English-Chinese Parallel Corpus consists of 327 English articles and their translations in Mandarin Chinese. The corpus contains a total of 544,095 words (253,633 English words and 287,462 Chinese tokens). This corpus can be visited trough the CQPweb page at Beijing Foreign Studies University (BFSU CQPweb). The username and password are both test.
URL: http://bowland-files.lancs.ac.uk/corplang/babel/babel.htm

What Corpus Tools are Available as Freeware?

While there are many commercial software programs that can be used to prepare and/or analyze Chinese corpora, there are also a few programs which are available on the Web or as free downloads. A few of them are particularly valuable.
► DimSum Chinese Language Tool, by Erik Peterson, is a very useful Java-based program that can do word segmentation, English annotation, word lists, and Hanzi to Pinyin conversion, among other features. It runs on Windows, MacOS, and Linux systems.
URL: http://www.mandarintools.com/dimsum.html
► AntConc, by Laurence Anthony, is a free program for Windows and Linux systems that can provide concordance, collocation, N-Gram and key word analyses. It works with multilingual texts.
URL: http://www.antlab.sci.waseda.ac.jp/software.html

Which Other Resources Are Available?

There are numerous websites, books, and articles on “corpus linguistics”, “language corpora”, and “Chinese language and linguistics” available. Here is just a small selection:

Websites:
►Corpus4U.Org is a Web-based discussion forum for Chinese and English corpus linguistics and applications. It is based in mainland China and has over 2500 registered users as of May 2006.
URL: http://www.corpus4u.org/
►Marjorie K.M. Chan’s ChinaLinks has a wealth of information about Chinese language and linguistics.
URL: http://chinalinks.osu.edu
►Tianwei Xie’s Learning Chinese On-line web page provides a variety of links to Web sites that are related to Chinese learning and teaching.
URL: http://www.csulb.edu/~txie/on-line.htm

Books:
►Concordances in the Classroom: A Resource Book for Teachers by Chris Tribble and Glyn Jones (Houston: Athelstan, 1997) has many ideas for teachers with an interest in using electronic texts in the language classroom, even though it is English based.
►Corpus Linguistics by Douglas Biber, Susan Conrad, and Randi Reppen (Cambridge: CUP Press, 1998) is an introductory text to corpus linguistics.
►Exploring Spoken English by Ronald Carter and Michael McCarthy (Cambridge: CUP, 1997) is a practical guide to natural spoken English drawn from the CANCODE corpus. Although providing examples in English, it gives useful insights into using corpus data in teaching.
►Yuliaoku Yuyanxue (语料库语言学 Corpus Linguistics) by Huang Changning and Li Juanzi (Beijing: Commercial Press, 2002) is another introductory text to corpus linguistics.

Articles:

Carter, Ronald and Michael McCarthy (1995). Grammar and the Spoken Language. Applied Linguistics, 16 (2), 141-158.
Chan, Marjorie K.M. (2002). Concordancers and concordances: Tools for Chinese language teaching and research. Journal of the Chinese Language Teachers Association, 37 (2), pp. 1-58.
Chen, Jing and Hongyin Tao (2004). A usage-based study of preposed verbal quantification structures in Chinese. Journal of Chinese Language and Computing, 14 (2), 125-137, 2004. [Special Issue: Corpora, Language Use, and Grammar. Edited by Hongyin Tao.]
McCarthy, Michael and Ronald Carter (2001). “Size isn’t everything: Spoken English, corpus and the classroom.” TESOL Quarterly, 35, 337-340.
McCarthy, Michael and A. O’Keeffe (2004). Research in the teaching of speaking. Annual Review of Applied Linguistics, 24, 26-43.
McEnery, A., Z. Xiao & Y. Tono (2005). Corpus-based Language Studies: An advanced resource book. London: Routledge.
Ming, Tao & Hongyin Tao (forthcoming). Developing a Chinese Heritage Language Corpus: Issues and a Preliminary Report. University of California, Los Angeles, Asian Languages and Cultures Department.
Sun, Maosong (孙茂松)(1998). “取决”与“来源”小议 (Notes on qujue and laiyuan). 中国语文 (Chinese Language), 6.
Tao, Hongyin (2000). Adverbs of absolute time and assertiveness in vernacular Chinese: A corpus based study. Journal of the Chinese Language Teachers Association, 3, 53-73.
Tao, Hongyin (2001). Emergent grammar and verbs of appearing. Contemporary Research in Modern Chinese, (Japan), 2, 89-100.
Tao, Hongyin (2002). The semantics and pragmatics of relative clause constructions in Mandarin narrative discourse. Contemporary Research in Modern Chinese, (Japan), 4, 47-57.
Tao, Hongyin (2004). Fundamentals in spoken discourse analysis. Yuyan Kexue (Linguistic Sciences), 3, 50-67.
Tao, Hongyin (2005). The gap between natural speech and spoken Chinese teaching material: Discourse perspectives on Chinese pedagogy. Journal of the Chinese Language Teachers Association, 40, 1-24.
Xiao, Zhonghua & Anthony McEnery (2004). Aspect in Mandarin Chinese: A corpus-based study. Amsterdam: John Benjamins.
Xiao, Zhonghua & Anthony McEnery (2006). Collocation, semantic prosody and near synonymy: A cross-linguistic perspective. Applied Linguistics, 27(1), 103-129.
Wang, Lixun (2001). Exploring parallel concordancing in English and Chinese, Language Learning and Technology, 5, 174-184.