Marc Kupietz¹ · Peter Leinen² · Nils Diewald¹
¹Leibniz-Institute for the German Language (IDS)
²German National Library (DNB)
Workshop on Comparable and Interoperable Corpora
of Academic Texts @CLARIN2024, Barcelona 2024-10-17
established in 1964
aims to serve as an empirical basis for German linguistics
for a very broad range of applications
samples the current use of written German
since ~1956
is continuously expanded (now ~60 billion words)
takes a primordial sample approach (Kupietz et al. 2010)
invites users to define their own sub-corpora
relative sizes don’t matter
but sparsely populated strata can hurt
many of such, but most notably:
fiction texts
academic texts
our research data is usually affected by third parties' rights
authors, speakers, publishers …
all not part of the scientific community (!)
affected rights:
intellectual property rights
general personality rights
overrides licence contracts (unlike other exceptions)
German transposition allows for:
sharing corpora within research projects
storing them as long as necessary for the purposes of scientific research or the verification of scientific findings
remuneration to collecting societies no longer required
however:
corpus data still can only be shared within a »specifically delimited circle of persons« and not within the whole linguistic community
models like:
I make my corpus available to you and in turn, you make your corpus available to me
we make all our corpora available since the taxpayers financed them
lead to nothing:
my corpora = your corpora = our corpora = our corpora, which the taxpayer paid = {}
use copyright exceptions
wait for expiration of copyright
try to stay below the threshold of originality
(conclude and) use licences
(technical challenges)
relatively high proportion of CC licenses for PhD theses
apart from that licensing very costly:
individual agreements with each author required
licensing with publishers costly
use copyright exceptions
wait for expiration of copyright
try to stay below the threshold of originality
(conclude and) use licences
solve legal problems by infrastructural means
§ 14 DNBG (German National Library Act): Mandatory Digital Deposit
publishers in Germany are required by law to submit copies of their published electronic books to DNB
in addition the DNB must provide access to the books
comparable laws in some other EU countries,
but digital deposit often not mandatory (see Roudik et al. 2017)
run a corpus query system at the DNB that provides access to the deposited e-books for linguists and related parties
publicly accessible under https://korap.dnb.de/
no login required
restriction: only ~40 words per KWIC
in Germany the DNB is partner in the National Research Data Infrastructure (NFDI) Text+
making the data hosted at the DNB available to the scientific community as useful as possible is in their own strong interest
no need to convince them
in addition:
very good connection between DNB and the German Publishers and Booksellers Association
top legal expertise available at the DNB, to make things possible that are legally possible
offering users an attractive, expanding and broad range of services for their research
strengthening the role as a partner to the scientific community
reaching new groups of users through expanded services
improving access to data especially for machine analyses
aim: provide a corresponding academic corpus to the fiction corpus
start with 10,000 doctoral dissertations
advantages:
no multi-column layout, article breaks, …
homogeneous
350,000 dissertations collected by the DNB as part of the DissOnline project
planned stratified sampling by:
domain (DDC top level)
year of publication
to arrive at 10,000 dissertations
what domains would make sense?
~100 PhD theses as PDF
year of publication 2000-2020
DDC Top-Level Domain | # |
---|---|
Computer Science | 6 |
Philosophy and psychology | 3 |
Religion | 1 |
Social Sciences | 96 |
Language | 1 |
Science | 1 |
Technology | 11 |
Arts and recreation | 5 |
History and geography | 3 |
user+sys | clock | est. for 10,000 | |
---|---|---|---|
PDF to TEI P5 XML (GROBID on CPUs) | 4h17m | 15m26s | |
P5 XML to KorAP index including tokenization, lemmatization, POS tagging with TreeTagger + morphological tagging with MarMot, and Malt dependency annotation | 2h22m | 3m17s | |
18m43s | 31h11m40s | ||
(on a server with 96 CPUs and 1.5 TB RAM) |
currently: battery of CI tests
What are realistic approaches?
if you want to do this regularly
without founding or staff
for ci and unit tests with XSLT
federation level:
support for CLARIN FCS integration using SRU protocol
(no authorization required)
UI & API level:
QLs: Cosmas, Poliqarp+, CQP, CQL, CQL v1.2, FSCQL
client libraries for R (➞ tidyverse) and Python (➞ pandas)
de facto interoperability: more than 50,000 registered users
domain classification: DDC from DNB API
intermediate levels: KorAP XML, CoNLL-U (import & export)
annotations: currently STTS, TIGER – UPOS + UD planned
corpus source level: TEI compliant encoding and metadata
source code level: GNU Make, GROBID, Java, XSLT
structured around ten main classes covering all domains of knowledge
Class 000 – Computer science, information, and general works
Class 100 – Philosophy and psychology
Class 200 – Religion
Class 300 – Social sciences
Class 400 – Language
Class 500 – Science
Class 600 – Technology
Class 700 – Arts and recreation
Class 800 – Literature
Class 900 – History and geography
each main class is further structured into ten hierarchical divisions
each having ten divisions of increasing specificity
open initiative established in 2012
re-use existing national and reference corpora
(no corpus data is moved)
define pairwise comparable virtual corpora dynamically
based on metadata properties
integrated corpora:
DeReKo , CoRoLa (Cosma & Kupietz 2019), HNC (Váradi 2002)
WIP: BNRC (Spassova 2023), NKJP, kielipankki
CLARIN project proposal planned for 2025
decided at EuReCo -workshop@CaC 2023
integrate other academic corpora into EuReCo
maybe following the national library cooperation route
however: digital deposit in many countries not mandatory
also via Conference of European National Librarians (CENL) Dialogue Forum National Libraries as Data
possible starting point:
https://www.clarin.eu/resource-families/corpora-academic-texts
step 1: build and provide access to a 10,000-doctoral-dissertation corpus
approach: infrastructural solution to legal obstacles:
run KorAP instance at German National Library (DNB) to serve all data from there
current state:
~100 test PhD theses available via KorAP in 1st version
next steps:
stratified sampling of 10,000 dissertations
refine metadata extraction & improve evaluation
add other kinds of academic texts
integrate multiple backend results in one KorAP UI
(And many thanks to GROBID 🙂!)
Bański, Piotr/Bingel, Joachim/Diewald, Nils/Frick, Elena/Hanl, Michael/Kupietz, Marc/Pęzik, Piotr/Schnober, Carsten/Witt, Andreas (2013):
KorAP: the new corpus analysis platform at IDS Mannheim. In: Vetulani, Zygmunt/Uszkoreit, Hans (eds.): Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of the 6th Language and Technology Conference. S. 586-587 - Pozna?: Fundacja Uniwersytetu im. A., 2013. →IDS-Publikationsserver
Barbu Mititelu, V., Tufiş, D., Irimia, E., (2018):
The Reference Corpus of the Contemporary Romanian Language (CoRoLa), in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan.
Cosma, Ruxandra/Kupietz, Marc (2019):
On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo. (= Revue Roumaine de Linguistique 64(3)). Bucureşti: Editura Academiei Române.
Diewald, Nils/Barbu Mititelu, Verginica/Kupietz, Marc (2019):
The KorAP user interface. Accessing CoRoLa via KorAP. In: Cosma, Ruxandra/Kupietz, Marc (eds.): On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo. (= Revue Roumaine de Linguistique 64(3)). Bucureşti: Editura Academiei Române, 2019. 265-277.
Gray, Jim (2003):
Distributed Computing Economics. Technical Report MSR-TR-2003-24, Microsoft Research.
GROBID (2008-2024):
GitHub. https://github.com/kermitt2/grobid
Kupietz, Marc/Diewald, Nils/Trawiński, Beata/Cosma, Ruxandra/Cristea, Dan/Tufiş, Dan/Váradi, Tamás/Wöllstein, Angelika (2020):
Recent developments in the European Reference Corpus EuReCo. In Sylviane Granger & Marie-Aude Lefer (eds) Translating and Comparing Languages: Corpus-based Insights. Corpora and Language in Use Proceedings 6, Louvain-la-Neuve: Presses universitaires de Louvain,
Kupietz, Marc/Diewald, Nils/Margaretha, Eliza (2022):
Building Paths to Corpus Data - A multi-level least effort and maximum return approach. In: Fišer, Darja/Witt, Andreas (eds.): CLARIN. The Infrastructure for Language Resources. DeGruyter.
Kupietz, Marc/Adrien Barbaresi/Anna Cermakova/ Małgorzata Czachor/Nils Diewald/Jarle Ebeling/ Rafał L. Górski/Eliza Margaretha/John Kirk/Michal Křen/ Harald Lüngen/Signe Oksefjell Ebeling/Mícheál Ó Meachair/Ines Pisetta/Elaine Uí Dhonnchadha/ Friedemann Vogel/Rebecca Wilm/Jiajin Xu/ Rameela Yaddehige (2023):
News from the International Comparable Corpus: First launch of ICC written. In: Trawiński, Beata / Kupietz, Marc / Proost, Kristel / Zinken, Jörg (eds): 10th International Contrastive Linguistics Conference (ICLC-10). Book of Abstracts. Mannheim: IDS-Verlag · Leibniz-Institut für Deutsche Sprache.
Kupietz, Marc/Diewald, Nils/Trawiński, Beata/Cosma, Ruxandra/Cristea, Dan/Tufiş, Dan/Váradi, Tamás/Wöllstein, Angelika (2020):
Recent developments in the European Reference Corpus EuReCo. In: Translating and Comparing Languages: Corpus-based Insights. Selected Proceedings of the Fifth Using Corpora in Contrastive and Translation Studies Conference. Louvain-la-Neuve: Presses universitaires de Louvain, pp. 257–273.
Kupietz, Marc/Banski, Piotr/Diewald, Nils/Trawinski, Beata/Witt, Andreas (2024):
EuReCo: Not Building and Yet Using Federated Comparable Corpora for Cross-Linguistic Research. In: Zweigenbaum, Pierre/Rapp, Reinhard/Sharoff, Serge (eds.): Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024. Torino, Italia: ELRA and ICCL, pp. 94–103. https://aclanthology.org/2024.bucc-1.10.
Kupietz, Marc/Belica, Cyril/Keibel, Holger/Witt, Andreas (2010):
The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, Nicoletta/Choukri, Khalid/Maegaard, Bente/Mariani, Joseph/Odjik, Jan/Piperidis, Stelios/Rosner, Mike/Tapias, Daniel (eds.): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). 1848-1854 - European Language Resources Association (ELRA)
Roudik, P., Buchanan, K., Ahmad, T., Zhang, L., Isajanyan, N., Boring, N., Gesley, J., Levush, R., Figueroa, D., Umeda, S., Hofverberg, E., Rodriguez-Ferrand, G., Feikert-Ahalt, C., & Law Library of Congress (U.S.) (eds.). (2018):
Digital legal deposit in selected jurisdictions. The Law Library of Congress, Global Legal Research Directorate. https://www.loc.gov/item/2018299330
Spassova, Lora (2023):
Integrating a Large Bulgarian Corpus into the European Reference Corpus EuReCo. Bachelor Thesis. University of Düsseldorf.
Váradi, Tamás (2002):
The Hungarian National Corpus, in: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). European Language Resources Association (ELRA), Las Palmas, Canary Islands - Spain.