Marc Kupietz¹ · Peter Leinen² · Nils Diewald¹
¹Leibniz-Institute for the German Language (IDS)
²German National Library (DNB)

Towards a Very Large German Academic Corpus

Step 1: Building and Making Available a Corpus of 10,000 Doctoral Dissertations

Workshop on Comparable and Interoperable Corpora
of Academic Texts @CLARIN2024, Barcelona 2024-10-17

Overview

Introduction: Background
Legal Challenges
Data and Methodology
Interoperability
Comparable Corpora & Cross-Linguistic Research
Summary & Conclusions

Work in Progress Timeline

1. Introduction: Background

German Reference Corpus DeReKo
IDS corpora of contemporary written German (Kupietz et al. 2010, 2018)

established in 1964
aims to serve as an empirical basis for German linguistics
- for a very broad range of applications
samples the current use of written German
- since ~1956
- is continuously expanded (now ~60 billion words)
takes a primordial sample approach (Kupietz et al. 2010)
- invites users to define their own sub-corpora

Stratum sizes in DeReKo

relative sizes don’t matter
but sparsely populated strata can hurt
many of such, but most notably:
- fiction texts
- academic texts

2. Legal Challenges

Linguistics is in a special situation

our research data is usually affected by third parties' rights
- authors, speakers, publishers …
- all not part of the scientific community (!)
affected rights:
- intellectual property rights
- general personality rights

This will not change fundamentally
Ultimately: collision of fundamental rights

Text and Data Mining Exception
EU directive 2019/790 on copyright and related rights in the Digital Single Market

overrides licence contracts (unlike other exceptions)
German transposition allows for:
- sharing corpora within research projects
- storing them as long as necessary for the purposes of scientific research or the verification of scientific findings
- remuneration to collecting societies no longer required
however:
- corpus data still can only be shared within a »specifically delimited circle of persons« and not within the whole linguistic community

➞ Open-Data-Models still not applicable

models like:
- I make my corpus available to you and in turn, you make your corpus available to me
- we make all our corpora available since the taxpayers financed them
lead to nothing:
- my corpora = your corpora = our corpora = our corpora, which the taxpayer paid = {}

How to Share / Provide Corpora Legally and practically?

use copyright exceptions
wait for expiration of copyright
try to stay below the threshold of originality
(conclude and) use licences

Special challenges for academic texts

(technical challenges)
relatively high proportion of CC licenses for PhD theses
apart from that licensing very costly:
individual agreements with each author required
licensing with publishers costly

How to Share / Provide Corpora Legally and practically?

use copyright exceptions
wait for expiration of copyright
try to stay below the threshold of originality
(conclude and) use licences
solve legal problems by infrastructural means

If the data is not allowed to move …
(Gray 2003)

… build some technology
to make it usable from where it is (Kupietz et al. 2010; 2021)

»National Library as Corpus« – »KorAP@DNB«
latest cooperation (IDS+DNB) under this motto

§ 14 DNBG (German National Library Act): Mandatory Digital Deposit
- publishers in Germany are required by law to submit copies of their published electronic books to DNB
- in addition the DNB must provide access to the books
- comparable laws in some other EU countries,
  but digital deposit often not mandatory (see Roudik et al. 2017)
run a corpus query system at the DNB that provides access to the deposited e-books for linguists and related parties

1ˢᵗ project: German Fiction Corpus DeLiKo@DNB
sample of 26.000 books via KorAP@DNB (last week publicly launch)

publicly accessible under https://korap.dnb.de/
no login required
restriction: only ~40 words per KWIC

Why is a National Library interested in such a cooperation?
(Or: How can I convince my National Library)

in Germany the DNB is partner in the National Research Data Infrastructure (NFDI) Text+
making the data hosted at the DNB available to the scientific community as useful as possible is in their own strong interest
no need to convince them
in addition:
- very good connection between DNB and the German Publishers and Booksellers Association
- top legal expertise available at the DNB, to make things possible that are legally possible

Motivations of the DNB

offering users an attractive, expanding and broad range of services for their research
strengthening the role as a partner to the scientific community
reaching new groups of users through expanded services
improving access to data especially for machine analyses

German Academic Corpus DeFoKo@DNB
today's focus

aim: provide a corresponding academic corpus to the fiction corpus
start with 10,000 doctoral dissertations
- advantages:
  - no multi-column layout, article breaks, …
  - homogeneous

3. Data and Methodology

Data

350,000 dissertations collected by the DNB as part of the DissOnline project
planned stratified sampling by:
- domain (DDC top level)
- year of publication
to arrive at 10,000 dissertations
what domains would make sense?

Current Test Sample Composition

~100 PhD theses as PDF
year of publication 2000-2020

DDC Top-Level Domain	#
Computer Science	6
Philosophy and psychology	3
Religion	1
Social Sciences	96
Language	1
Science	1
Technology	11
Arts and recreation	5
History and geography	3

Conversion PDF ➞ TEI I5

TEI I5 ➞ KorAP
including tokenization and annotation

Benchmark for converting the 99 diss. test set
1.5 million tokens

	user+sys	clock	est. for 10,000
PDF to TEI P5 XML (GROBID on CPUs)	4h17m	15m26s
P5 XML to KorAP index including tokenization, lemmatization, POS tagging with TreeTagger + morphological tagging with MarMot, and Malt dependency annotation	2h22m	3m17s
		18m43s	31h11m40s
(on a server with 96 CPUs and 1.5 TB RAM)

Test Corpus accessible via KorAP
http://korap.dnb.de/defako

Expanded annotations
http://korap.dnb.de/defako

Expanded metadata
http://korap.dnb.de/defako

Evaluation & Quality Control

currently: battery of CI tests
What are realistic approaches?
- if you want to do this regularly
- without founding or staff
- for ci and unit tests with XSLT

4. Interoperability

Interoperability on different levels
(Kupietz et al 2022)

federation level:
- support for CLARIN FCS integration using SRU protocol
  (no authorization required)
UI & API level:
- QLs: Cosmas, Poliqarp+, CQP, CQL, CQL v1.2, FSCQL
- client libraries for R (➞ tidyverse) and Python (➞ pandas)
- de facto interoperability: more than 50,000 registered users
domain classification: DDC from DNB API
intermediate levels: KorAP XML, CoNLL-U (import & export)
annotations: currently STTS, TIGER – UPOS + UD planned
corpus source level: TEI compliant encoding and metadata
source code level: GNU Make, GROBID, Java, XSLT
- https://gitlab.ids-mannheim.de/korap4dnb/defako

Interoperable Domain Classification: DDC
Dewey Decimal Classification system – the library standard

structured around ten main classes covering all domains of knowledge
- - Class 000 – Computer science, information, and general works
  - Class 100 – Philosophy and psychology
  - Class 200 – Religion
  - Class 300 – Social sciences
  - Class 400 – Language
  - Class 500 – Science
  - Class 600 – Technology
  - Class 700 – Arts and recreation
  - Class 800 – Literature
  - Class 900 – History and geography
each main class is further structured into ten hierarchical divisions
each having ten divisions of increasing specificity

KorAP supports extensible set of Query Languages
(Bingel & Diewald 2015)

5. Comparable Corpora & Cross-Linguistic Research

European Reference Corpus EuReCo
(Kupietz et al. 2024, 2020, 2017)

open initiative established in 2012
re-use existing national and reference corpora
- (no corpus data is moved)
define pairwise comparable virtual corpora dynamically
- based on metadata properties
integrated corpora:
- DeReKo , CoRoLa (Cosma & Kupietz 2019), HNC (Váradi 2002)
- WIP: BNRC (Spassova 2023), NKJP, kielipankki
CLARIN project proposal planned for 2025
- decided at EuReCo -workshop@CaC 2023

Aim: Comparable Academic Corpora

integrate other academic corpora into EuReCo
maybe following the national library cooperation route
- however: digital deposit in many countries not mandatory
also via Conference of European National Librarians (CENL) Dialogue Forum National Libraries as Data
possible starting point:
https://www.clarin.eu/resource-families/corpora-academic-texts

6. Summary & Conclusions

Objective: Very Large German Academic Corpus
»National Library as Corpus«

step 1: build and provide access to a 10,000-doctoral-dissertation corpus
approach: infrastructural solution to legal obstacles:
- run KorAP instance at German National Library (DNB) to serve all data from there
current state:
- ~100 test PhD theses available via KorAP in 1st version
next steps:
- stratified sampling of 10,000 dissertations
- refine metadata extraction & improve evaluation
- add other kinds of academic texts
- integrate multiple backend results in one KorAP UI

Thank you very much for your attention!

(And many thanks to GROBID 🙂!)

References

References I

Bański, Piotr/Bingel, Joachim/Diewald, Nils/Frick, Elena/Hanl, Michael/Kupietz, Marc/Pęzik, Piotr/Schnober, Carsten/Witt, Andreas (2013):

KorAP: the new corpus analysis platform at IDS Mannheim. In: Vetulani, Zygmunt/Uszkoreit, Hans (eds.): Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of the 6th Language and Technology Conference. S. 586-587 - Pozna?: Fundacja Uniwersytetu im. A., 2013. →IDS-Publikationsserver

Barbu Mititelu, V., Tufiş, D., Irimia, E., (2018):

The Reference Corpus of the Contemporary Romanian Language (CoRoLa), in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan.

Cosma, Ruxandra/Kupietz, Marc (2019):

On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo. (= Revue Roumaine de Linguistique 64(3)). Bucureşti: Editura Academiei Române.

Diewald, Nils/Barbu Mititelu, Verginica/Kupietz, Marc (2019):

The KorAP user interface. Accessing CoRoLa via KorAP. In: Cosma, Ruxandra/Kupietz, Marc (eds.): On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo. (= Revue Roumaine de Linguistique 64(3)). Bucureşti: Editura Academiei Române, 2019. 265-277.

Gray, Jim (2003):

Distributed Computing Economics. Technical Report MSR-TR-2003-24, Microsoft Research.

GROBID (2008-2024):

GitHub. https://github.com/kermitt2/grobid

Kupietz, Marc/Diewald, Nils/Trawiński, Beata/Cosma, Ruxandra/Cristea, Dan/Tufiş, Dan/Váradi, Tamás/Wöllstein, Angelika (2020):

Recent developments in the European Reference Corpus EuReCo. In Sylviane Granger & Marie-Aude Lefer (eds) Translating and Comparing Languages: Corpus-based Insights. Corpora and Language in Use Proceedings 6, Louvain-la-Neuve: Presses universitaires de Louvain,

Kupietz, Marc/Diewald, Nils/Margaretha, Eliza (2022):

Building Paths to Corpus Data - A multi-level least effort and maximum return approach. In: Fišer, Darja/Witt, Andreas (eds.): CLARIN. The Infrastructure for Language Resources. DeGruyter.

Kupietz, Marc/Adrien Barbaresi/Anna Cermakova/ Małgorzata Czachor/Nils Diewald/Jarle Ebeling/ Rafał L. Górski/Eliza Margaretha/John Kirk/Michal Křen/ Harald Lüngen/Signe Oksefjell Ebeling/Mícheál Ó Meachair/Ines Pisetta/Elaine Uí Dhonnchadha/ Friedemann Vogel/Rebecca Wilm/Jiajin Xu/ Rameela Yaddehige (2023):

News from the International Comparable Corpus: First launch of ICC written. In: Trawiński, Beata / Kupietz, Marc / Proost, Kristel / Zinken, Jörg (eds): 10th International Contrastive Linguistics Conference (ICLC-10). Book of Abstracts. Mannheim: IDS-Verlag · Leibniz-Institut für Deutsche Sprache.

References II

Kupietz, Marc/Diewald, Nils/Trawiński, Beata/Cosma, Ruxandra/Cristea, Dan/Tufiş, Dan/Váradi, Tamás/Wöllstein, Angelika (2020):

Recent developments in the European Reference Corpus EuReCo. In: Translating and Comparing Languages: Corpus-based Insights. Selected Proceedings of the Fifth Using Corpora in Contrastive and Translation Studies Conference. Louvain-la-Neuve: Presses universitaires de Louvain, pp. 257–273.

Kupietz, Marc/Banski, Piotr/Diewald, Nils/Trawinski, Beata/Witt, Andreas (2024):

EuReCo: Not Building and Yet Using Federated Comparable Corpora for Cross-Linguistic Research. In: Zweigenbaum, Pierre/Rapp, Reinhard/Sharoff, Serge (eds.): Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024. Torino, Italia: ELRA and ICCL, pp. 94–103. https://aclanthology.org/2024.bucc-1.10.

Kupietz, Marc/Belica, Cyril/Keibel, Holger/Witt, Andreas (2010):

The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, Nicoletta/Choukri, Khalid/Maegaard, Bente/Mariani, Joseph/Odjik, Jan/Piperidis, Stelios/Rosner, Mike/Tapias, Daniel (eds.): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). 1848-1854 - European Language Resources Association (ELRA)

Roudik, P., Buchanan, K., Ahmad, T., Zhang, L., Isajanyan, N., Boring, N., Gesley, J., Levush, R., Figueroa, D., Umeda, S., Hofverberg, E., Rodriguez-Ferrand, G., Feikert-Ahalt, C., & Law Library of Congress (U.S.) (eds.). (2018):

Digital legal deposit in selected jurisdictions. The Law Library of Congress, Global Legal Research Directorate. https://www.loc.gov/item/2018299330

Spassova, Lora (2023):

Integrating a Large Bulgarian Corpus into the European Reference Corpus EuReCo. Bachelor Thesis. University of Düsseldorf.

Váradi, Tamás (2002):

The Hungarian National Corpus, in: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). European Language Resources Association (ELRA), Las Palmas, Canary Islands - Spain.

Towards a Very Large German Academic Corpus

Step 1: Building and Making Available a Corpus of 10,000 Doctoral Dissertations

Overview

Work in Progress Timeline

1. Introduction: Background

German Reference Corpus DeReKo IDS corpora of contemporary written German (Kupietz et al. 2010, 2018)

Stratum sizes in DeReKo

2. Legal Challenges

Linguistics is in a special situation

This will not change fundamentally Ultimately: collision of fundamental rights

Text and Data Mining Exception EU directive 2019/790 on copyright and related rights in the Digital Single Market

➞ Open-Data-Models still not applicable

How to Share / Provide Corpora Legally and practically?

Special challenges for academic texts

How to Share / Provide Corpora Legally and practically?

If the data is not allowed to move … (Gray 2003)

… build some technology to make it usable from where it is (Kupietz et al. 2010; 2021)

»National Library as Corpus« – »KorAP@DNB« latest cooperation (IDS+DNB) under this motto

1ˢᵗ project: German Fiction Corpus DeLiKo@DNB sample of 26.000 books via KorAP@DNB (last week publicly launch)

Why is a National Library interested in such a cooperation? (Or: How can I convince my National Library)

Motivations of the DNB

German Academic Corpus DeFoKo@DNB today's focus

3. Data and Methodology

Data

Current Test Sample Composition

Conversion PDF ➞ TEI I5

TEI I5 ➞ KorAP including tokenization and annotation

Benchmark for converting the 99 diss. test set 1.5 million tokens

Test Corpus accessible via KorAP http://korap.dnb.de/defako

Expanded annotations http://korap.dnb.de/defako

Expanded metadata http://korap.dnb.de/defako

Evaluation & Quality Control

4. Interoperability

Interoperability on different levels (Kupietz et al 2022)

Interoperable Domain Classification: DDC Dewey Decimal Classification system – the library standard

KorAP supports extensible set of Query Languages (Bingel & Diewald 2015)

5. Comparable Corpora & Cross-Linguistic Research

European Reference Corpus EuReCo (Kupietz et al. 2024, 2020, 2017)

Aim: Comparable Academic Corpora

6. Summary & Conclusions

Objective: Very Large German Academic Corpus »National Library as Corpus«

Thank you very much for your attention!

References

References I

References II

German Reference Corpus DeReKo
IDS corpora of contemporary written German (Kupietz et al. 2010, 2018)

This will not change fundamentally
Ultimately: collision of fundamental rights

Text and Data Mining Exception
EU directive 2019/790 on copyright and related rights in the Digital Single Market

If the data is not allowed to move …
(Gray 2003)

… build some technology
to make it usable from where it is (Kupietz et al. 2010; 2021)

»National Library as Corpus« – »KorAP@DNB«
latest cooperation (IDS+DNB) under this motto

1ˢᵗ project: German Fiction Corpus DeLiKo@DNB
sample of 26.000 books via KorAP@DNB (last week publicly launch)

Why is a National Library interested in such a cooperation?
(Or: How can I convince my National Library)

German Academic Corpus DeFoKo@DNB
today's focus

TEI I5 ➞ KorAP
including tokenization and annotation

Benchmark for converting the 99 diss. test set
1.5 million tokens

Test Corpus accessible via KorAP
http://korap.dnb.de/defako

Expanded annotations
http://korap.dnb.de/defako

Expanded metadata
http://korap.dnb.de/defako

Interoperability on different levels
(Kupietz et al 2022)

Interoperable Domain Classification: DDC
Dewey Decimal Classification system – the library standard

KorAP supports extensible set of Query Languages
(Bingel & Diewald 2015)

European Reference Corpus EuReCo
(Kupietz et al. 2024, 2020, 2017)

Objective: Very Large German Academic Corpus
»National Library as Corpus«