CfP: LREC-2014 Satellite Workshop
Challenges in the management of large corpora (CMLC-2)
We live in an age where the well-known maxim that “the only thing better than data is more data” is something that no longer sets unattainable goals. Creating extremely large corpora is no longer a challenge, given the proven methods that lie behind e.g. applying the Web-as-Corpus approach or utilizing Google's n-gram collection. Indeed, the challenge is now shifted towards dealing with large amounts of primary data and much larger amounts of annotation data. On the one hand, this challenge concerns finding new (corpus-)linguistic methodologies that can make use of such extremely large corpora e.g. in order to investigate rare phenomena involving multiple lexical items, to find and represent fine-grained sub-regularities, or to investigate variations within and across language domains; on the other hand, some fundamental technical methods and strategies are being called into question. These include e.g. successful curation of data, management of collections that span multiple volumes or that are distributed across several centres, methods to clean the data from non-linguistic intrusions or duplicates, as well as automatic annotation methods or innovative corpus architectures that maximise the usefulness of data or allow to search and to analyze it efficiently. Among the new tasks are also collaborative manual annotation and methods to manage it as well as new challenges to the statistical analysis of such data and metadata.
The second LREC-workshop on “Challenges in the management of large corpora” aims at gathering the leading researchers in the fields of Language Resource creation and Corpus Linguistics, in order to provide for an intensive exchange of expertise, results and ideas. In accordance with this LREC’s hot topic: “Big Data”, contributions concerned with national corpora, reference corpora and other very large corpora are particularly welcome.
The half day workshop will be wrapped up with a discussion about the common challenges, ideas for possible solutions and potential co-operations.
We invite submissions dealing with:
The workshop aims at gathering the leading researchers in the field of Language Resource creation and Corpus Linguistics, in order to provide for an intensive exchange of expertise, results and ideas concerning the issues mentioned as "topics of interest" above, and primarily concerning the creation, maintenance, extensibility and use of *large* and richly annotated linguistic data sets, well above 1 billion (1*10^9) of tokens and nearing the petabyte range of volume.
We invite extended abstracts for 15 to 20 minute presentations. All abstracts have to be submitted via the START Conference Manager.
Please note: When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research.
Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.), to enable their reuse, replicability of experiments, including evaluation ones, etc...
Piotr Bański, Marc Kupietz, Harald Lüngen, Andreas Witt
Evelyn Breiteneder, Hanno Biber, Karlheinz Mörth
Workshop homepage: http://corpora.ids-mannheim.de/cmlc.html