Marc Kupietz¹ · Peter Leinen² · Nils Diewald¹
¹Leibniz-Institute for the German Language (IDS)
²German National Library (DNB)

Towards a Very Large German Academic Corpus

Step 1: Building and Making Available a Corpus of 10,000 Doctoral Dissertations

Workshop on Comparable and Interoperable Corpora
of Academic Texts @CLARIN2024, Barcelona 2024-10-17

Overview

  1. Introduction: Background

  2. Legal Challenges

  3. Data and Methodology

  4. Interoperability

  5. Comparable Corpora & Cross-Linguistic Research

  6. Summary & Conclusions

Work in Progress Timeline

1. Introduction: Background

German Reference Corpus DeReKo
IDS corpora of contemporary written German (Kupietz et al. 2010, 2018)

  • established in 1964

  • aims to serve as an empirical basis for German linguistics

    • for a very broad range of applications

  • samples the current use of written German

    • since ~1956

    • is continuously expanded (now ~60 billion words)

  • takes a primordial sample approach (Kupietz et al. 2010)

    • invites users to define their own sub-corpora

Stratum sizes in DeReKo

  • relative sizes don’t matter

  • but sparsely populated strata can hurt

  • many of such, but most notably:

    • fiction texts

    • academic texts

2. Legal Challenges

Linguistics is in a special situation

  • our research data is usually affected by third parties' rights

    • authors, speakers, publishers …

    • all not part of the scientific community (!)

  • affected rights:

    • intellectual property rights

    • general personality rights

This will not change fundamentally
Ultimately: collision of fundamental rights

Text and Data Mining Exception
EU directive 2019/790 on copyright and related rights in the Digital Single Market

  • overrides licence contracts (unlike other exceptions)

  • German transposition allows for:

    • sharing corpora within research projects

    • storing them as long as necessary for the purposes of scientific research or the verification of scientific findings

    • remuneration to collecting societies no longer required

  • however:

    • corpus data still can only be shared within a »specifically delimited circle of persons« and not within the whole linguistic community

➞ Open-Data-Models still not applicable

  • models like:

    • I make my corpus available to you and in turn, you make your corpus available to me

    • we make all our corpora available since the taxpayers financed them

  • lead to nothing:

    • my corpora = your corpora = our corpora = our corpora, which the taxpayer paid = {}

How to Share / Provide Corpora Legally and practically?

  1. use copyright exceptions

  2. wait for expiration of copyright

  3. try to stay below the threshold of originality

  4. (conclude and) use licences

Special challenges for academic texts

  • (technical challenges)

  • relatively high proportion of CC licenses for PhD theses

  • apart from that licensing very costly:
    individual agreements with each author required

  • licensing with publishers costly

How to Share / Provide Corpora Legally and practically?

  1. use copyright exceptions

  2. wait for expiration of copyright

  3. try to stay below the threshold of originality

  4. (conclude and) use licences

  5. solve legal problems by infrastructural means

If the data is not allowed to move …
(Gray 2003)

mohammed-no-way.png

… build some technology
to make it usable from where it is (Kupietz et al. 2010; 2021)

mohammed4-ausschnitt.png

»National Library as Corpus« – »KorAP@DNB«
latest cooperation (IDS+DNB) under this motto

  • § 14 DNBG (German National Library Act): Mandatory Digital Deposit

    • publishers in Germany are required by law to submit copies of their published electronic books to DNB

    • in addition the DNB must provide access to the books

    • comparable laws in some other EU countries,
      but digital deposit often not mandatory (see Roudik et al. 2017)

  • run a corpus query system at the DNB that provides access to the deposited e-books for linguists and related parties

1ˢᵗ project: German Fiction Corpus DeLiKo@DNB
sample of 26.000 books via KorAP@DNB (last week publicly launch)

Why is a National Library interested in such a cooperation?
(Or: How can I convince my National Library)

  • in Germany the DNB is partner in the National Research Data Infrastructure (NFDI) Text+

  • making the data hosted at the DNB available to the scientific community as useful as possible is in their own strong interest

  • no need to convince them

  • in addition:

    • very good connection between DNB and the German Publishers and Booksellers Association

    • top legal expertise available at the DNB, to make things possible that are legally possible

Motivations of the DNB

  • offering users an attractive, expanding and broad range of services for their research

  • strengthening the role as a partner to the scientific community

  • reaching new groups of users through expanded services

  • improving access to data especially for machine analyses

German Academic Corpus DeFoKo@DNB
today's focus

  • aim: provide a corresponding academic corpus to the fiction corpus

  • start with 10,000 doctoral dissertations

    • advantages:

      • no multi-column layout, article breaks, …

      • homogeneous

3. Data and Methodology

Data

  • 350,000 dissertations collected by the DNB as part of the DissOnline project

  • planned stratified sampling by:

    • domain (DDC top level)

    • year of publication

  • to arrive at 10,000 dissertations

  • what domains would make sense?

Current Test Sample Composition

  • ~100 PhD theses as PDF

  • year of publication 2000-2020

DDC Top-Level Domain #
Computer Science 6
Philosophy and psychology 3
Religion 1
Social Sciences 96
Language 1
Science 1
Technology 11
Arts and recreation 5
History and geography 3

Conversion PDF ➞ TEI I5

TEI I5 ➞ KorAP
including tokenization and annotation

Benchmark for converting the 99 diss. test set
1.5 million tokens

user+sys clock est. for 10,000
PDF to TEI P5 XML (GROBID on CPUs) 4h17m 15m26s
P5 XML to KorAP index including tokenization, lemmatization, POS tagging with TreeTagger + morphological tagging with MarMot, and Malt dependency annotation 2h22m 3m17s
18m43s 31h11m40s
(on a server with 96 CPUs and 1.5 TB RAM)

Test Corpus accessible via KorAP

Expanded annotations

Expanded metadata

Evaluation & Quality Control

  • currently: battery of CI tests

  • What are realistic approaches?

    • if you want to do this regularly

    • without founding or staff

    • for ci and unit tests with XSLT

4. Interoperability

Interoperability on different levels
(Kupietz et al 2022)

  • federation level:

    • support for CLARIN FCS integration using SRU protocol
      (no authorization required)

  • UI & API level:

    • QLs: Cosmas, Poliqarp+, CQP, CQL, CQL v1.2, FSCQL

    • client libraries for R (➞ tidyverse) and Python (➞ pandas)

    • de facto interoperability: more than 50,000 registered users

  • domain classification: DDC from DNB API

  • intermediate levels: KorAP XML, CoNLL-U (import & export)

  • annotations: currently STTS, TIGER – UPOS + UD planned

  • corpus source level: TEI compliant encoding and metadata

  • source code level: GNU Make, GROBID, Java, XSLT

Interoperable Domain Classification: DDC
Dewey Decimal Classification system – the library standard

  • structured around ten main classes covering all domains of knowledge

      • Class 000 – Computer science, information, and general works

      • Class 100 – Philosophy and psychology

      • Class 200 – Religion

      • Class 300 – Social sciences

      • Class 400 – Language

      • Class 500 – Science

      • Class 600 – Technology

      • Class 700 – Arts and recreation

      • Class 800 – Literature

      • Class 900 – History and geography

  • each main class is further structured into ten hierarchical divisions

  • each having ten divisions of increasing specificity

KorAP supports extensible set of Query Languages
(Bingel & Diewald 2015)

image6.png

5. Comparable Corpora & Cross-Linguistic Research

European Reference Corpus EuReCo
(Kupietz et al. 2024, 2020, 2017)

  • open initiative established in 2012

  • re-use existing national and reference corpora

    • (no corpus data is moved)

  • define pairwise comparable virtual corpora dynamically

    • based on metadata properties

  • integrated corpora:

    • DeReKo , CoRoLa (Cosma & Kupietz 2019), HNC (Váradi 2002)

    • WIP: BNRC (Spassova 2023), NKJP, kielipankki

  • CLARIN project proposal planned for 2025

    • decided at EuReCo -workshop@CaC 2023

Aim: Comparable Academic Corpora

  • integrate other academic corpora into EuReCo

  • maybe following the national library cooperation route

    • however: digital deposit in many countries not mandatory

  • also via Conference of European National Librarians (CENL) Dialogue Forum National Libraries as Data

  • possible starting point:
    https://www.clarin.eu/resource-families/corpora-academic-texts

6. Summary & Conclusions

Objective: Very Large German Academic Corpus
»National Library as Corpus«

  • step 1: build and provide access to a 10,000-doctoral-dissertation corpus

  • approach: infrastructural solution to legal obstacles:

    • run KorAP instance at German National Library (DNB) to serve all data from there

  • current state:

    • ~100 test PhD theses available via KorAP in 1st version

  • next steps:

    • stratified sampling of 10,000 dissertations

    • refine metadata extraction & improve evaluation

    • add other kinds of academic texts

    • integrate multiple backend results in one KorAP UI

Thank you very much for your attention!

(And many thanks to GROBID 🙂!)

References

References I

Bański, Piotr/Bingel, Joachim/Diewald, Nils/Frick, Elena/Hanl, Michael/Kupietz, Marc/Pęzik, Piotr/Schnober, Carsten/Witt, Andreas (2013):

KorAP: the new corpus analysis platform at IDS Mannheim. In: Vetulani, Zygmunt/Uszkoreit, Hans (eds.): Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of the 6th Language and Technology Conference. S. 586-587 - Pozna?: Fundacja Uniwersytetu im. A., 2013. →IDS-Publikationsserver

Barbu Mititelu, V., Tufiş, D., Irimia, E., (2018):

The Reference Corpus of the Contemporary Romanian Language (CoRoLa), in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan.

Cosma, Ruxandra/Kupietz, Marc (2019):

On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo. (= Revue Roumaine de Linguistique 64(3)). Bucureşti: Editura Academiei Române.

Diewald, Nils/Barbu Mititelu, Verginica/Kupietz, Marc (2019):

The KorAP user interface. Accessing CoRoLa via KorAP. In: Cosma, Ruxandra/Kupietz, Marc (eds.): On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo. (= Revue Roumaine de Linguistique 64(3)). Bucureşti: Editura Academiei Române, 2019. 265-277.

Gray, Jim (2003):

Distributed Computing Economics. Technical Report MSR-TR-2003-24, Microsoft Research.

GROBID (2008-2024):

GitHub. https://github.com/kermitt2/grobid

Kupietz, Marc/Diewald, Nils/Trawiński, Beata/Cosma, Ruxandra/Cristea, Dan/Tufiş, Dan/Váradi, Tamás/Wöllstein, Angelika (2020):

Recent developments in the European Reference Corpus EuReCo. In Sylviane Granger & Marie-Aude Lefer (eds) Translating and Comparing Languages: Corpus-based Insights. Corpora and Language in Use Proceedings 6, Louvain-la-Neuve: Presses universitaires de Louvain,

Kupietz, Marc/Diewald, Nils/Margaretha, Eliza (2022):

Building Paths to Corpus Data - A multi-level least effort and maximum return approach. In: Fišer, Darja/Witt, Andreas (eds.): CLARIN. The Infrastructure for Language Resources. DeGruyter.

Kupietz, Marc/Adrien Barbaresi/Anna Cermakova/ Małgorzata Czachor/Nils Diewald/Jarle Ebeling/ Rafał L. Górski/Eliza Margaretha/John Kirk/Michal Křen/ Harald Lüngen/Signe Oksefjell Ebeling/Mícheál Ó Meachair/Ines Pisetta/Elaine Uí Dhonnchadha/ Friedemann Vogel/Rebecca Wilm/Jiajin Xu/ Rameela Yaddehige (2023):

News from the International Comparable Corpus: First launch of ICC written. In: Trawiński, Beata / Kupietz, Marc / Proost, Kristel / Zinken, Jörg (eds): 10th International Contrastive Linguistics Conference (ICLC-10). Book of Abstracts. Mannheim: IDS-Verlag · Leibniz-Institut für Deutsche Sprache.

References II

Kupietz, Marc/Diewald, Nils/Trawiński, Beata/Cosma, Ruxandra/Cristea, Dan/Tufiş, Dan/Váradi, Tamás/Wöllstein, Angelika (2020):

Recent developments in the European Reference Corpus EuReCo. In: Translating and Comparing Languages: Corpus-based Insights. Selected Proceedings of the Fifth Using Corpora in Contrastive and Translation Studies Conference. Louvain-la-Neuve: Presses universitaires de Louvain, pp. 257–273.

Kupietz, Marc/Banski, Piotr/Diewald, Nils/Trawinski, Beata/Witt, Andreas (2024):

EuReCo: Not Building and Yet Using Federated Comparable Corpora for Cross-Linguistic Research. In: Zweigenbaum, Pierre/Rapp, Reinhard/Sharoff, Serge (eds.): Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024. Torino, Italia: ELRA and ICCL, pp. 94–103. https://aclanthology.org/2024.bucc-1.10.

Kupietz, Marc/Belica, Cyril/Keibel, Holger/Witt, Andreas (2010):

The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, Nicoletta/Choukri, Khalid/Maegaard, Bente/Mariani, Joseph/Odjik, Jan/Piperidis, Stelios/Rosner, Mike/Tapias, Daniel (eds.): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). 1848-1854 - European Language Resources Association (ELRA)

Roudik, P., Buchanan, K., Ahmad, T., Zhang, L., Isajanyan, N., Boring, N., Gesley, J., Levush, R., Figueroa, D., Umeda, S., Hofverberg, E., Rodriguez-Ferrand, G., Feikert-Ahalt, C., & Law Library of Congress (U.S.) (eds.). (2018):

Digital legal deposit in selected jurisdictions. The Law Library of Congress, Global Legal Research Directorate. https://www.loc.gov/item/2018299330

Spassova, Lora (2023):

Integrating a Large Bulgarian Corpus into the European Reference Corpus EuReCo. Bachelor Thesis. University of Düsseldorf.

Váradi, Tamás (2002):

The Hungarian National Corpus, in: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). European Language Resources Association (ELRA), Las Palmas, Canary Islands - Spain.