Marc Kupietz, Andreas Witt, Piotr Bański, Dan Tufiş, Dan Cristea, Tamás Váradi
CMLC-5-Workshop at CL 2017, Birmingham, 2017-07-24
contrastive and cross-linguistic research is a rich source of linguistic insights
parallel corpora are hardly suitable for finer-grained linguistic research
several kinds of biases introduced by translations
large and diverse comparable corpora would be nice to have
currently only web-based corpora
Aranea – Family of Comparable Gigaword Web Corpora (Benko 2014)
in future:
International Comparable Corpora (ICC) (Kirk/Čermáková 2017)
several initiatives for national and reference corpora
BNC, CNC, CoRoLa, DeReKo , DWDS, HNC, NKJP, …
not all in active development
but usually with some institutional commitment
hosting institutions
loosely connected by bilateral contacts
some co-operation within CLARIN
but subordinated to various other goals and funding necessities
some coordination via EFNIL
so far mostly unrelated to corpora
and build comparable corpora by using these existing resources?
idea seems obvious for comparable corpora
several national and reference corpora built and maintained anyway
creating methodology and techniques for joining them
with each hosting centre still responsible for its language
much more economical, scalable and sustainable
than creating comparable corpora individually from scratch
by some de-central technical infrastructure
that allows defining and using virtual comparable corpora
based on existing corpora
with each corpus still physically located at its centre
ideally benefiting from ongoing and future expansions
CLARIN federated content search (CLARIN-FCS) not powerful enough for corpus linguistic analysis
API approach probably also not sufficiently extensible
however, CLARIN provides some foundations to build upon:
corpus interoperability (CMDI, encoding standards)
virtual collection registry (VCR)
PIDs
CLARIN extension required
that provides some uniform access to physically distributed corpora
that enables the user to define virtual comparable corpora based on metadata:
e.g.: build the largest possible German-Romanian comparable corpus with equal distribution of text types, topics and publication years between 2000 and 2009
and to analyze the data appropriately
in this special case we are optimistic
the additional functionalities for corpus query systems are not that big
with KorAP there already is a corpus query system with the key functionalities
support for distributed indexes
dynamically definable virtual sub-corpora
collaborative development of research software
could make it more affordable and sustainable
brings the research communities closer together
currently very much centered around their philologies
with Poland, Romania, Hungary and Germany
contrastive studies on German and Romanian
pilot study for EuReCo
integration of the Hungarian National Corpus (HNC)
new corpus analysis platform for DeReKo
developed at the IDS since 2011
designated successor of the current COSMAS II system
with ~ 40.000 registered users
already designed in 1993
KorAP can work with physically distributed indexes
➞ corpus data can be stored at different locations
virtual corpora can easily be defined based on metadata properties
unlimited maximum corpus size
unlimited number of annotation layers
support for multiple query languages
e.g. the CQP variant/extension Poliqarp
sustainable, standing project (2.5 FTE)
publicly available for querying large parts of DeReKo since May 2017
some essential functionality still missing:
sorting results
all frequency related functions
corpus inspection
special functions for comparable corpora
functionality complements the one of COSMAS II
KorAP will gradually replace COSMAS within the next 5 years
original title: Sprachvergleich korpustechnologisch. Deutsch - Rumänisch
based on DeReKo and the Reference Corpus of Contemporary Romanian Language (CoRoLa)
funded by the Alexander von Humboldt-Foundation
Research Group Linkage Programme
University of Bucharest
Institute for the German Language in Mannheim
Romanian Academy as associate partner:
Institute for Artificial Intelligence Mihai Drăgănescu (RACAI, Bucharest)
Institute of Computer Science (IIT, Iaşi)
project runtime 2016-2018
provision of German-Romanian comparable corpora
development of criteria for comparable virtual corpora
exploration of differences between quantitative distributions
comparative corpus-based case studies
development of corpus technology to share the corpus, technical and research results in a common Corpus Analysis platform.
building a crystallization structure for a EuReCo
mapping of DeReKo - and CoRoLa topic and text type taxonomies towards a common intermediate taxonomy will be finalized in September
DeutUng kickoff workshop in October in Szeged
exploration of differently defined comparable corpora
and their effect on quantitative distributions wrt some case studies
implementation of missing KorAP features
ui-experiments:
what other functionalities would be nice to work with comparable corpora?
build virtual comparable corpora based on existing corpora
with each hosting institution still being responsible for its own part
linked via a corpus platform that allows to dynamically define and analyse virtual comparable corpora
abstract idea: approach scientific, organizational, legal, economical problems with an infrastructural solution
everybody is invited to join
Bański, Piotr/Bingel, Joachim/Diewald, Nils/Frick, Elena/Hanl, Michael/Kupietz, Marc/Pęzik, Piotr/Schnober, Carsten/Witt, Andreas (2013):
KorAP: the new corpus analysis platform at IDS Mannheim. In: Vetulani, Zygmunt/Uszkoreit, Hans (eds.): Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of the 6th Language and Technology Conference. S. 586-587 - Poznań: Fundacja Uniwersytetu im. A., 2013.
Cosma, Ruxandra/Cristea, Dan/Kupietz, Marc/Tufiş, Dan/Witt, Andreas (2016):
DRuKoLA – Towards Contrastive German-Romanian Research based on Comparable Corpora. In: Bański, Piotr/Barbaresi, Adrien/Biber, Hanno/Breiteneder, Evelyn/Clematide, Simon/Kupietz, Marc/Lüngen, Harald/Witt, Andreas: 4th Workshop on Challenges in the Management of Large Corpora. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slowenien. Paris: European Language Resources Association (ELRA), 2016. pp 28-32.
Benko, Vladimír (2014):
Aranea: Yet Another Family of (Comparable) Web Corpora. In Petr Sojka, Aleš Horák, Ivan Kopeček and Karel Pala (Eds.): Text, Speech and Dialogue. 17th International Conference, TSD 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings. LNCS 8655. Springer International Publishing Switzerland, 2014. pp. 257-264. ISBN: 978-3-319-10815-5 (Print), 978-3-319-10816-2 (Online). BibTeX PDF
Gray, Jim (2003):
Distributed Computing Economics. Technical Report MSR-TR-2003-24, Microsoft Research.
Kirk, John/Čermáková, Anna (2017):
From ICE to ICC: The new International Comparable Corpus. In Bański et al. (eds.): Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section
Kupietz, Marc/Belica, Cyril/Keibel, Holger/Witt, Andreas (2010):
The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, Nicoletta et al. (eds): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). S. 1848-1854 - ELRA.