Clusters

In the previous unit we looked at how frequency lists are created and analyzed. The lists that we looked at focused on individual words only. However, it is also very useful to generate lists of clusters consisting of two or more words.

We can ask a program like Wordsmith Tools to generate frequency lists for recurring clusters of up to eight words, although the longest cluster that is likely to occur with any frequency in a general corpus is the six word, much-maligned ‘at the end of the day’, which you see at Position 2 in the list below.

We can use cluster analysis to find out what the most frequent phrases are in a corpus. Here is a list of the most frequently occurring five-word clusters in a corpus of spoken English. You will notice that not all these clusters are actually well-formed. Some of these clusters (e.g. ‘you know what I mean’ and ‘all the rest of it’) are meaningful, and it is formulaic sequences such as these that that contribute to language that is judged to be fluent and well-formed. The insight that such frequency lists gives is therefore important in that it can inform teachers about which phrases it could be sensible to teach learners.


Creating clusters

Please follow these steps to create a cluster with WordSmith Tools:

STEP 1: In order to create a cluster, we first need to load our corpus data into
WordSmith Tools. This step is the same as generating frequency list (see screenshots below).

 

 

 

STEP 2: One important step in creating clusters in WordSmith Tools is to create and save an index file for the corpus first. This is illustrated in the following two screenshots.

 

STEP 3: Now click compute >> clusters, and choose the size of the clusters you want, and the minimal frequency
to filter out less interesting ones. Click OK, and Voila! You’ll have your first cluster!!

 

 

 

Now create some clusters lists for your corpus. What sorts of clusters are most frequent?

Print Friendly, PDF & Email