Marc Kupietz, Andreas Witt, Piotr Bański, Dan Tufiş, Dan Cristea, Tamás Váradi


Joining Forces for a European Reference Corpus as a sustainable base for cross-linguistic research

CMLC-5-Workshop at CL 2017, Birmingham, 2017-07-24


  1. Aims and foundations

  2. Previous and current work

  3. Outlook & Summary

1. Aims and foundations

Strong need for comparable corpora

  • contrastive and cross-linguistic research is a rich source of linguistic insights

  • parallel corpora are hardly suitable for finer-grained linguistic research

    • several kinds of biases introduced by translations

  • large and diverse comparable corpora would be nice to have

Available comparable corpora
that include German

  • currently only web-based corpora

    • Aranea – Family of Comparable Gigaword Web Corpora (Benko 2014)

  • in future:

    • International Comparable Corpora (ICC) (Kirk/Čermáková 2017)

Available monolingual corpora

  • several initiatives for national and reference corpora

    • BNC, CNC, CoRoLa, DeReKo , DWDS, HNC, NKJP, …

    • not all in active development

    • but usually with some institutional commitment

  • hosting institutions

    • loosely connected by bilateral contacts

    • some co-operation within CLARIN

      • but subordinated to various other goals and funding necessities

    • some coordination via EFNIL

      • so far mostly unrelated to corpora

Why not join our forces…

  • and build comparable corpora by using these existing resources?

  • idea seems obvious for comparable corpora

  • several national and reference corpora built and maintained anyway

  • creating methodology and techniques for joining them

    • with each hosting centre still responsible for its language

  • much more economical, scalable and sustainable

  • than creating comparable corpora individually from scratch

Problem: IPR and license contracts
… typically tie existing corpora to their host institutions

Solution: Join the existing corpora just virtually

  • by some de-central technical infrastructure

  • that allows defining and using virtual comparable corpora

  • based on existing corpora

  • with each corpus still physically located at its centre

    • ideally benefiting from ongoing and future expansions

If the data cannot be moved…
Loosely following Jim Gray's postulate

…put the computation near the data
Jim Gray (2003)

Translated to EuReCo
… build some infrastructure to use it from where it is.

How about CLARIN?
Doesn't CLARIN already provide such an infrastructure?

  • CLARIN federated content search (CLARIN-FCS) not powerful enough for corpus linguistic analysis

    • API approach probably also not sufficiently extensible

  • however, CLARIN provides some foundations to build upon:

    • corpus interoperability (CMDI, encoding standards)

    • virtual collection registry (VCR)

    • PIDs

  • CLARIN extension required

Still needed: corpus query tool / infrastructure

  • that provides some uniform access to physically distributed corpora

  • that enables the user to define virtual comparable corpora based on metadata:

    • e.g.: build the largest possible German-Romanian comparable corpus with equal distribution of text types, topics and publication years between 2000 and 2009

  • and to analyze the data appropriately

Is it really worth it?
Will the benefits outweigh the technical overhead?

  • in this special case we are optimistic

  • the additional functionalities for corpus query systems are not that big

  • with KorAP there already is a corpus query system with the key functionalities

    • support for distributed indexes

    • dynamically definable virtual sub-corpora

Additional benefits of joining forces

  • collaborative development of research software

    • could make it more affordable and sustainable

  • brings the research communities closer together

    • currently very much centered around their philologies

2. Previous and current work

EuReCo-related history

start of KorAP project
EuReCo idea born at CMLC-I in Istanbul
EuReCo project proposal (rejected)
  • with Poland, Romania, Hungary and Germany

start of the DRuKoLA project
  • contrastive studies on German and Romanian

  • pilot study for EuReCo

start of DeutUng
  • integration of the Hungarian National Corpus (HNC)

The current technical basis for EuReCo

  • new corpus analysis platform for DeReKo

  • developed at the IDS since 2011

  • designated successor of the current COSMAS II system

    • with ~ 40.000 registered users

    • already designed in 1993

KorAP key features for EuReCo

  • KorAP can work with physically distributed indexes

    • ➞ corpus data can be stored at different locations

  • virtual corpora can easily be defined based on metadata properties

  • unlimited maximum corpus size

  • unlimited number of annotation layers

  • support for multiple query languages

    • e.g. the CQP variant/extension Poliqarp

  • sustainable, standing project (2.5 FTE)


Open Source
BSD licensed


Current State of KorAP

  • publicly available for querying large parts of DeReKo since May 2017

  • some essential functionality still missing:

    • sorting results

    • all frequency related functions

    • corpus inspection

    • special functions for comparable corpora

  • functionality complements the one of COSMAS II

  • KorAP will gradually replace COSMAS within the next 5 years

The DRuKoLA project

  • original title: Sprachvergleich korpustechnologisch. Deutsch - Rumänisch

  • based on DeReKo and the Reference Corpus of Contemporary Romanian Language (CoRoLa)

  • funded by the Alexander von Humboldt-Foundation

  • Research Group Linkage Programme

    • University of Bucharest

    • Institute for the German Language in Mannheim

    • Romanian Academy as associate partner:

      • Institute for Artificial Intelligence Mihai Drăgănescu (RACAI, Bucharest)

      • Institute of Computer Science (IIT, Iaşi)

  • project runtime 2016-2018

DRuKoLA aims

  1. provision of German-Romanian comparable corpora

  2. development of criteria for comparable virtual corpora

  3. exploration of differences between quantitative distributions

  4. ​comparative corpus-based case studies

  5. development of corpus technology to share the corpus, technical and research results in a common Corpus Analysis platform.

  6. building a crystallization structure for a EuReCo

3. Outlook & Summary

Current State of EuReCo / DRuKoLA / DeutUng

  • CoRoLa is available under KorAP

  • mapping of DeReKo - and CoRoLa topic and text type taxonomies towards a common intermediate taxonomy will be finalized in September

  • DeutUng kickoff workshop in October in Szeged

Next steps

  • exploration of differently defined comparable corpora

    • and their effect on quantitative distributions wrt some case studies

  • implementation of missing KorAP features

  • ui-experiments:

    • what other functionalities would be nice to work with comparable corpora?

EuReCo tries to

  • build virtual comparable corpora based on existing corpora

  • with each hosting institution still being responsible for its own part

  • linked via a corpus platform that allows to dynamically define and analyse virtual comparable corpora

  • abstract idea: approach scientific, organizational, legal, economical problems with an infrastructural solution

  • everybody is invited to join

Thank you very much for your attention!


Bański, Piotr/Bingel, Joachim/Diewald, Nils/Frick, Elena/Hanl, Michael/Kupietz, Marc/Pęzik, Piotr/Schnober, Carsten/Witt, Andreas (2013):

KorAP: the new corpus analysis platform at IDS Mannheim. In: Vetulani, Zygmunt/Uszkoreit, Hans (eds.): Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of the 6th Language and Technology Conference. S. 586-587 - Poznań: Fundacja Uniwersytetu im. A., 2013.

Cosma, Ruxandra/Cristea, Dan/Kupietz, Marc/Tufiş, Dan/Witt, Andreas (2016):

DRuKoLA – Towards Contrastive German-Romanian Research based on Comparable Corpora. In: Bański, Piotr/Barbaresi, Adrien/Biber, Hanno/Breiteneder, Evelyn/Clematide, Simon/Kupietz, Marc/Lüngen, Harald/Witt, Andreas: 4th Workshop on Challenges in the Management of Large Corpora. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slowenien. Paris: European Language Resources Association (ELRA), 2016. pp 28-32.

Benko, Vladimír (2014):

Aranea: Yet Another Family of (Comparable) Web Corpora. In Petr Sojka, Aleš Horák, Ivan Kopeček and Karel Pala (Eds.): Text, Speech and Dialogue. 17th International Conference, TSD 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings. LNCS 8655. Springer International Publishing Switzerland, 2014. pp. 257-264. ISBN: 978-3-319-10815-5 (Print), 978-3-319-10816-2 (Online). BibTeX PDF

Gray, Jim (2003):

Distributed Computing Economics. Technical Report MSR-TR-2003-24, Microsoft Research.

Kirk, John/Čermáková, Anna (2017):

From ICE to ICC: The new International Comparable Corpus. In Bański et al. (eds.): Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section

Kupietz, Marc/Belica, Cyril/Keibel, Holger/Witt, Andreas (2010):

The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, Nicoletta et al. (eds): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). S. 1848-1854 - ELRA.