Corpus Research
Although a pile of newspaper articles in the drawer of your desk is certainly a collection of texts, most of today’s applied linguists, however, would not consider them a corpus unless they were stored and accessed electronically.
But when corpus research first began in the 1950s, the corpora were of course accessed manually. One of the most famous pre-electronic corpora is the Survey of English Usage (SEU) at University College London. This corpus was founded in 1959 by Randolph Quirk and was assembled mainly to inform grammar research. Since then corpora have been put together in an array of languages, from a variety of types of text and for a range of purposes.
Although you can search through printed texts manually (using an ‘ocular scan’), hard copies of texts are not particularly accessible. In addition, such scanning is time consuming and is not as accurate as the mechanical processing of digital text. These days all corpus researchers use collections of texts that are stored and accessed electronically.
So far we have said that a corpus is a collection of texts. It is difficult, however, to be specific about how large this collection should be. 1 million words is often given as a minimum, but it really depends on what kind of data the corpus contains, and why it has been gathered. You will get a better idea about sizes and purposes of corpora later in this unit, but, at a basic level, the collection of naturally occurring language data results in two types of corpora: general and specialist.
The data that is stored in a computer can be in its ‘raw’ form i.e. just the words of the source texts. However, many researchers choose to encode information other than the texts themselves. Marking up a corpus in this way allows structural features (e.g. titles, authors, subheadings in written corpora; speaker identification and perhaps other contextual information in spoken corpora) to be included. In addition, spoken corpora may also include information about the age and gender of the speakers, native language, dialect etc. As well as being marked up, corpora are also annotated or tagged. These tags commonly include the grammatical category of the individual words.
Corpus linguistics now has a great number of followers. There are those researchers whose main interest is discovering patterns of language use, and there are others who are more concerned with using insights gained from corpora to develop language teaching materials. Whatever your interest in corpora, one of the key attractions of this branch of linguistics is that it uses large collections of naturally occurring spoken or written texts. Because these texts are stored electronically, they can be analyzed using computer based tools. One of the fascinating things that this kind of analysis shows us is how language changes according to context (where/why it is written/spoken and by whom).
Although corpus research relies heavily on computer technology, not all the analysis that is done is mechanical and quantitative. Much qualitative research is also carried out, and the two approaches complement each other. Despite this technological progress, spoken corpora are still much more labor-intensive to assemble than written ones. Spoken data can be more difficult to collect in the first place (video and audio recording is fraught with challenges), and is time-consuming to transcribe. However, corpus linguists specializing in spoken language would argue that what this hard work reveals about how people really speak is priceless.
Further exploration:
Study section 1 of the online supplement to the first four chapters of McEnery and Wilson’s (1996), Introduction to corpus research. Edinburgh: Edinburgh University Press
According to the authors, does corpus-based research have a trouble-free history?