Marc Kupietz, Nils Diewald, Eliza Margaretha
Leibniz Institute for the German Language
16th Edition of the International Conference on Linguistic Resources and Tools for Natural Language Processing – Iași and online, 2021-12-13
corpora often cannot be copied, because of
IPR and license restrictions
their size
interpreting corpus data can be pretty complex
how to analyze corpora depends on the research question and is itself subject of ongoing research
»curse of dimensionality« (Bellman 1953)
LNRE distributions with linguistically interesting phenomena somewhere in the long tail
having properties of s social artefact, language is a moving target
it's not possible to implement all desired methods in a corpus analysis platform
Linguistic research data is usually affected by third parties' rights
authors and publishers
not part of the scientific community
affected rights:
intellectual property rights
database rights
general personality rights
open-data-models from other disciplines are not transferable to linguistics
this will not change fundamentally
models from other disciplines:
I make my corpus available to you and in turn, you make your corpus available to me
we make all our corpora available since the taxpayers financed them
lead to nothing:
my corpora = your corpora = our corpora = our corpora, which the taxpayer paid = {}
...
report on approaches to make corpus data as actually usable as possible
despite the aforementioned challenges
at feasible costs
used at the Leibniz Institute for the German Language (IDS),
in the context of
German Reference Corpus DeReKo
corpus analysis platform KorAP
construction started in 1964
aims to serve as an empirical basis for German linguistics
has ~ 40,000 registered users
samples the current use of written German
since ~1956
is continually expanded
18,000 DeReKo users via COSMAS II platform
mostly German linguists
growing interests in more sophisticated applications
from more and more corpus linguistics experts
and from NLP / computational linguistics
that had completely switched to corpus based statistical approaches
CLARIN infrastructure was well established
knowing that we will not be able to provide all ever desired functionalities,
can we use the funding somehow
to maximize the usability of the valuable corpus data?
also for sophisticated individual applications that we would hardly be able to implement and maintain ourselves?
for linguists
and ideally also for other users from other DH disciplines?
without interfering with our licences and the interests of right holders?
while keeping the follow-up costs manageable?
tackle the legal, scientific, and economical challenges
with an infrastructural approach
partially implemented in an OSS project
roughly following one of the basic principles of grid computing posutlated by Jim Gray (2004) ...
provide a virtual machine for user supplied code
with controlled output
but that turned out to be too cost-intensive and unmaintainable
like most other grid inspired approaches
in the lesser resourced DH disciplines
not only for us, but particularly for the users
new approach needed, much more focused on
feasibility, maintainability, economic efficiency
KISS principle
distribute work over several shoulders
the web user interface
accessible directly or via client libraries
user interface plugins
independent access by fully customized installations
new features by source code contributions
direct access to corpus data (without KorAP)
except that simplicity and Information on Demand principle (Diewald et al. 2019) can be followed quite far
makes all backend functionality available
different query languages (Bingel & Diewald 2015)
complex expressions referring to multiple annotation layers
the definition of virtual corpora based on metadata properties
UI itself uses the API only
provides OAuth2 for authorized access to restricted data
offers unauthorised access to not copyrighted numerical data
accessible directly (➞ documentation)
and via client libraries for R and Python (Kupietz et al. 2020b)
complex, multipart queries
applications where reproducibility and replicability is required (with varied query or corpus base)
providing features that are not yet supported by KorAP's backend or UI, currently e.g.
collocation analysis
aggregation of search results
aim: make programmatic use as easy as possible
in order to also pick up linguists who never coded before
library(RKorAPClient)
query = c("[drukola/l=epidemie]", "[drukola/l=pandemie]")
years = c(2000:2017)
new("KorAPConnection", KorAPUrl = "https://korap.racai.ro", verbose=T) %>%
frequencyQuery(query, vc = sprintf("pubDate in %d", years)) %>%
hc_freq_by_year_ci()
#!/usr/bin/env python3
from KorAPClient import KorAPConnection, KorAPClient
import plotly.express as px
import pandas as pd
years = list(range(2000, 2017))
query = ["[drukola/l=epidemie]", "[drukola/l=pandemie]"]
df = pd.DataFrame({'year': years,
'vc': [f"pubDate in {y}" for y in years]}) \
.merge(pd.DataFrame(query, columns=["lemma"]), how='cross')
results = KorAPClient.ipm(
KorAPConnection(KorAPUrl="https://korap.racai.ro", verbose=True) \
.frequencyQuery(df['lemma'], df['vc']))
df = pd.concat([df, results.reset_index(drop=True)], axis=1)
px.line(df, x="year", y="ipm", color="lemma").show()
library(RKorAPClient)
df <-
new("KorAPConnection", KorAPUrl = "https://korap.racai.ro", verbose = T) %>%
collocationAnalysis(
"[drukola/l=pune] în",
leftContextSize = 0,
rightContextSize = 1
)
search input
e.g. Lemma-Expansion
definition of virtual corpora
e.g. corpus visualization
search results
e.g. export
individual matches
e.g. annotation visualizations
...
… are only required by certain user groups
also in order not to overload the UI
following the IoD principle
… should be replaceable easily with variants according to specific needs
… are not intended to be maintained by the core KorAP team
in general:
much easier to more maintainable to add a plugin
than to develop a whole new UI based on the API
not yet released:
query expansion with the inverse lemmatizer glemm
it's hard to maintain multiple running platform instances
but it's impossible to create the one jack of all trades
with dynamical meta-configuration functions to switch between different corpora, authentication backends, …
we are running different platforms (COSMAS, KorAP, …)
and several instances of KorAP with different configurations
different authorization workflows
different corpora
on different hardware
most import application field now:
contrastive linguistic studies in the context of EuReCo
open initiative founded in 2013 by the IDS and the academies in Poland, Romania and Hungary (Kupietz et al. 2017)
to sustainably address the need for comparable corpora
re-using existing national and reference corpora,
by joining them just virtually
and defining virtual comparable sub-corpora dynamically based on metadata property distributions
<pune> în <NN> / CoRoLa | ||
---|---|---|
NN | logDice | EN (~DeepL) |
pericol | 11,16 | Danger |
aplicare | 10,74 | Application |
mișcare | 10,63 | Move |
discuție | 10,07 | Discussion |
funcțiune | 9,97 | Function |
evidență | 9,64 | Highlight |
practică | 8,95 | Practice |
executare | 8,85 | Version |
scenă | 8,81 | Scene |
Vânzare | 8,51 | Sale |
circulație | 8,44 | Circulation |
valoare | 8,31 | Value |
slujba | 8,24 | Job |
lumină | 7,88 | Light |
vedere | 7,26 | View |
discuția | 7,11 | Discussion |
JOC | 7,10 | Game |
libertate | 7,04 | Freedom |
relație | 6,87 | Relationship |
balanță | 6,79 | Balance |
situația | 6,55 | Situation |
borcane | 6,48 | Glasses |
serviciul | 6,41 | Service |
umbră | 6,23 | Shadow |
legătură | 6,20 | Link |
primejdie | 6,13 | Emergency |
posesie | 6,03 | Possession |
față | 6,02 | Face |
in <NN> <setzen> / vc_drukola | |
---|---|
<NN> | logDice |
Gang | 10,84 |
Szene | 10,59 |
Brand | 10,12 |
Kenntnis | 9,55 |
Bewegung | 9,44 |
Verbindung | 9,16 |
Marsch | 9,07 |
Kraft | 8,41 |
Beziehung | 7,80 |
Umlauf | 7,70 |
Anführungszeichen | 7,40 |
Flammen | 6,59 |
Relation | 6,39 |
Untersuchungshaft | 6,38 |
Klammern | 6,12 |
Betrieb | 5,92 |
Stand | 5,90 |
Erstaunen | 5,75 |
Bezug | 5,51 |
Vollzug | 5,13 |
Anführungsstriche | 5,06 |
Gänsefüßchen | 4,74 |
Auslieferungshaft | 4,42 |
Parallele | 4,39 |
Vergleich | 4,38 |
Verkehr | 4,28 |
Pose | 4,15 |
Positur | 4,10 |
contrastive studies still require lots of experimentation
applications heavily depend on properties of the used corpora
available metadata categories
POS annotations
but even more on their respective languages
some frequently used, popular features might rise to the UI
with the corresponding backend support
but probably nothing in the near future
let anyone create new features by suggesting source code extensions
apart from bug fixes, mainly aimed at larger projects
ideally avoiding forks
also to support external developers, we use Gerrit Code Review on top of git
use by Google, SAP, LibreOffice, Wikimedia, …
superior for discussing code contributions compared to GitHub pull requests
last resort if all other levels are not applicable
typical application scenarios
sophisticated corpus and quantitative linguistic applications
that require specialized language models
very costly, wrt.
expert staff
hardware use
only possible upon request with a kind of application
users can choose between different corpus data formats
TEI I5
KorAP-XML
CoNLL-U
Metadata SQL-DB
user gets copyright-free sample data
adapts their code to the data
send their code in a common git repo
IDS staff applies the code and sends back the results
we have designed the model only partly like this
for the most part, it has gradually solidified and established itself over the last few years
almost all users inside and outside could easily be convinced that an approach like this makes sense
by far the most popular and successful level
adopted by many power users of DeReKo
including the German Council for Orthography
more and more often included in teaching
also from the side of programming courses
manageable costs
very early stage
more examples need
looking forward to first student project
for the everybody-runs-their-own-KorAP scenario:
multiple instances for multiple languages
quite perfect but some maintenance effort
for us running multiple KorAP instances
more DevOps automation urgently needed
still only a very small proportion of external code contributions
third-party funding for larger external projects apparently difficult
no fundings schemes for combinations of research software development with linguistic research
nevertheless very important level
also to be able to channelise wishes and demands
could be mostly avoided by suggesting the API level
at least some of the cost intensive work could be carried out by the users themselves on the API level
very satisfied with the remaining projects
bottleneck currently rather hardware than staff
the procedure could perhaps be improved by switching to more formal cooperation applications
ptcntd and multi-staged approaches are the ways to go to make corpus data as actually usable a possible
our multi-staged model is probably not much more than a formalization of what we would be doing anyway
but the model seems to help us making the right decisions quickly
also concerning the level-raising of popular functionalities
it also seems to convince users and allows them to engage with one of the options on offer more easily
Bański, P., Fischer, P. M., Frick, E., Ketzan, E., Kupietz, M., Schnober, C., Schonefeld, O., Witt, A. (2012):
The New IDS Corpus Analysis Platform: Challenges and Prospects. In: Calzolari, N. et al. (eds.): Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12). Istanbul, Turkey, May 2012. European Language Resources Association (ELRA), 2012: 2905-2911.
Bański, P., Diewald, N., Hanl, M., Kupietz, M. and A. Witt (2014):
Access Control by Query Rewriting: the Case of KorAP. In: Proceedings of the 9th conference on the Language Resources and Evaluation Conference (LREC 2014), European Language Resources Association (ELRA), Reykjavik, Iceland, May 2014: 3817-3822.
Bingel, J. and Diewald, N. (2015):
KoralQuery – a General Corpus Query Protocol. In: Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015, Vilnius, Lithuania, May 11-13, pp. 1-5.
Cosma, R., Cristea, D., Kupietz, M., Tufiş, D., Witt, A. (2016):
DRuKoLA – Towards Contrastive German-Romanian Research based on Comparable Corpora. In: Bański, P. et al. (eds.): 4th Workshop on Challenges in the Management of Large Corpora. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slowenien. Paris: European Language Resources Association (ELRA), 2016: 28-32.
Cristea, Dan/Diewald, Nils/Haja, Gabriela/Mărănduc, Cătălina/Barbu Mititelu, Verginica/Onofrei, Mihaela (2019):
How to find a shining needle in the haystack. Querying CoRoLa: solutions and perspectives. In: Cosma, Ruxandra/Kupietz, Marc (Hrsg.), On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo, Revue Roumaine de Linguistique, 64(3). Editura Academiei Române, Bucharest, Romania.
Diewald, Nils/Barbu Mititelu, Verginica/Kupietz, Marc (2019):
The KorAP user interface. Accessing CoRoLa via KorAP. In: Cosma, Ruxandra/Kupietz, Marc (Hrsg.): On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo. (= Revue Roumaine de Linguistique 64(3)). Bucureşti: Editura Academiei Române, 2019. S. 265-277. →IDS-Publikationsserver →Text
Diewald, Nils and Margaretha, Eliza (2016):
Krill: KorAP search and analysis engine. In: Journal for Language Technology and Computational Linguistics (JLCL), 31 (1). 63-80.
Gray, Jim (2004):
Distributed Computing Economics. In: Herbert A., Jones K.S. (eds) Computer Systems. Monographs in Computer Science. Springer, New York, NY
Kupietz, Marc / Diewald, Nils / Margaretha, Eliza (forthcoming):
Building paths to corpus data - A multi-level least effort and maximum return approach. In Fišer, Darja / Witt, Andreas (eds.): The CLARIN Book. DeGruyter (forthcoming 2022).
Kupietz, Marc / Trawiński, Beata (forthcoming):
Neue Perspektiven für kontrastive Korpuslinguistik: Das Europäische Referenzkorpus EuReCo. In: Akten des XIV. Kongresses der Internationalen Vereinigung für Germanische Sprach- und Literaturwissenschaft (IVG). Peter Lang (to appear in 2022)
Kupietz, Marc/Diewald, Nils/Trawiński, Beata/Cosma, Ruxandra/Cristea, Dan/Tufiş, Dan/Váradi, Tamás/Wöllstein, Angelika (2020):
Recent developments in the European Reference Corpus EuReCo. In: Granger, Sylviane/Lefer, Marie-Aude (Hrsg.): Translating and Comparing Languages: Corpus-based Insights. (= Corpora and Language in Use, Proceedings 6). Louvain-la-Neuve: Presses universitaires de Louvain, 2020. S. 257-273.
Kupietz, Marc/Diewald, Nils/Margaretha, Eliza (2020b):
RKorAPClient: An R Package for Accessing the German Reference Corpus DeReKo via KorAP. In: Calzolari, Nicoletta/Béchet, Frédéric/Blache, Philippe/Choukri, Khalid/Cieri, Christopher/Declerck, Thierry/Goggi, Sara/Isahara, Hitoshi/Maegaard, Bente/Mariani, Joseph/Mazo, Hélène/Moreno, Asuncion/Odijk, Jan/Piperidis, Stelios (Hrsg.): Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC), May 11-16, 2020, Palais du Pharo, Marseille, France. Paris: European Language Resources Association, 2020. S. 7016-7021.
Kupietz, Marc/Diewald, Nils/Fankhauser, Peter (2018):
How to Get the Computation Near the Data: Improving data accessibility to, and reusability of analysis functions in corpus query platforms. In: Bański, Piotr/Kupietz, Marc/Barbaresi, Adrien/Biber, Hanno/Breiteneder, Evelyn/Clematide, Simon/Witt, Andreas (Hrsg.): Proceedings of the LREC 2018 Workshop “Challenges in the Management of Large Corpora (CMLC-6)”. 07 May 2018 – Miyazaki, Japan. Paris: ELRA. pp. 20-25.
Kupietz, Marc/Cosma, Ruxandra/Cristea, Dan/Diewald, Nils/Trawiński, Beata/Tufiş, Dan/Váradi, Tamás/Wöllstein, Angelika (2018b):
Recent developments in the European Reference Corpus (EuReCo). In: Granger, Sylviane/Lefer, Marie-Aude/Aguiar de Souza Penha Marion, Laura (eds.): Using Corpora in Contrastive and Translation Studies Conference (5th edition). Book of Abstract. Louvain-la-Neuve: CECL, 2018. pp.. 101-103.
Kupietz, M., Belica, C., Keibel, H. and Witt, A. (2010):
The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, N. et al. (eds.): Proceedings of LREC 2010. 1848-1854.
Kupietz, M., Lüngen, H., Bański, P. and Belica, C. (2014):
Maximizing the Potential of Very Large Corpora. In: Kupietz, M., Biber, H., Lüngen, H., Bański, P., Breiteneder, E., Mörth, K., Witt, A., Takhsha, J. (eds.): Proceedings of the LREC-2014-Workshop Challenges in the Management of Large Corpora (CMLC2). Reykjavik: ELRA, 1–6.
Kupietz, M., Witt, A., Bański, P., Tufiş, D., Cristea, D., Váradi, T. (2017):
EuReCo – Joining Forces for a European Reference Corpus as a sustainable base for cross-linguistic research. In: Bański, P. et al. (eds.): Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017. Birmingham, 24 July 2017. Mannheim: Institut für Deutsche Sprache, 2017: 15-19.
Váradi, T. (2002):
The Hungarian National Corpus. In Rodríguez, M. & Araujo, C. (eds) Proceedings of LREC 2002, Las Palmas / Paris: ELRA, 385–389.