(20th of July, 2015, Lancaster; co-located with Corpus Linguistics 2015)

Creating extremely large corpora no longer appears to be a challenge. With the constantly growing amount of born-digital text – be it available on the web or only on the servers of publishing companies – and with the increasing number of printed texts digitized by public institutions or technological giants such as Google, we may safely expect the upper limits of text collections to keep increasing for years to come. Although some of this was true already 20 years ago, we have the strong impression that the challenge is now shifted to dealing with the large amounts of primary data and much larger amounts of annotation data.

On the one hand, the new challenges demand a discovery of new (corpus-) linguistic methodologies that can make use of extremely large corpora e.g. in order to investigate rare phenomena involving multiple lexical items, to find and represent fine-grained sub-regularities, or to investigate variations within and across language domains involving for example new methods for structuring search results (to cope with false positives) or visualization techniques that facilitate the interpretation of results or the abduction of new hypotheses. On the other hand, some fundamental technical methods and strategies call for re-evaluation. These include e.g. efficient and sustainable curation of the data, management of collections that span multiple volumes or that are distributed across several centres, innovative corpus architectures that maximise the usefulness of data, and techniques that allow to search and to analyze the data efficiently.

The third edition of CMLC will accompany Corpus Linguistics 2015 in Lancaster, and will be held on the 20th of July 2015. This half-day workshop will gather the leading researchers in the field of Language Resource creation and Corpus Linguistics, in order to provide a platform for an intensive exchange of expertise, results and ideas.


