Church, K. W. and Hanks, P. (1990): Word association norms, mutual information, and lexicography. Comput. Linguist. 16, 1 (March 1990), 22-29.
[1] Daille, B. (1994): Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, Université Paris 7.
[2] Thanopoulos, A., Fakotakis, N., Kokkinakis, G. (2002): Comparative evaluation of collocation extraction metrics. In: Proc. of LREC 2002: 620–625.
[1] Daille, B. (1994): Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, Université Paris 7.
[2] Thanopoulos, A., Fakotakis, N., Kokkinakis, G. (2002): Comparative evaluation of collocation extraction metrics. In: Proc. of LREC 2002: 620–625.
Bouma, Gerlof (2009): Normalized (pointwise) mutual information in collocation extraction. In Proceedings of GSCL.
Dunning, T. (1993): Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 1 (March 1993), 61-74.
Evert, Stefan (2004): The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD dissertation, IMS, University of Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714.
Rychlý, Pavel (2008): A lexicographer-friendly association score. In Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN, 6–9, 2008
DeReKoVecs (Fankhauser & Kupietz 2022, 2017, 2019; Kupietz et al. 2018) is the new open lab of the Corpus Linguistics group at IDS Mannheim. Similar to the Collocation Database CCDB (Keibel & Belica 2007, Belica 2011), DeReKoVecs serves for investigating and comparing of measurements, dimension reduction procedures, visualizations etc., to track down detailed paradigmatic and syntagmatic relations between words based on their use in very large corpora such as the German Reference Corpus DeReKo (Kupietz et al. 2010, 2018) or the Reference Corpus of the Contemporary Romanian Language CoRoLa (Barbu Mititelu et al. 2018, Cristea et al. 2017).
The models used here are based on an extension of word2vec (Mikolov et al. 2013), wang2vec (Ling et al. 2015) and on the other hand on simple co-occurence counts and analysis methods that operate on these.
Please note that – unlike e.g. KorAP – we cannot operate DeReKoVecs as a service with high reliability. DeReKoVecs is only intended as a glass laboratory. Things can change unannounced at any time and even a complete breakdown over a longer period of time cannot be excluded with certainty. If you would like to use DeReKoVecs for your own research in the longer term, please contact us.
Barbu Mititelu, Verginica/Tufiş, Dan/Irimia, Elena (2018):
The Reference Corpus of the Contemporary Romanian Language (CoRoLa). In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA), pp. 1178–1185.
Belica, Cyril (2011):
Semantische Nähe als Ähnlichkeit von Kookkurrenzprofilen. In: Andrea Abel, Renata Zanin, Hrsg., Korpora in Lehre und Forschung, S. 155-178. Bozen-Bolzano University Press. Freie Universität Bozen-Bolzano.
Cristea, Dan/Gîfu, Daniela/Moruz, Alex/Onofrei, Mihaela/Pistol, Laura/Scutelnicu, Andrei/Bolea, Cecilia (2017):
An Insight into the Corpus of Contemporary Romanian. In: Memoirs of the Scientific Sections / Memoriile Secțiilor Științifice, Series IV, Tome XL, Publishing House of the Romanian Academy, pp. 67-84, ISSN 1224-1407, ISSN (online) 2343-7049.
Fankhauser, Peter/Kupietz, Marc (2022):
Count-Based and Predictive Language Models for Exploring DeReKo. In: Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Paris/Marseille: ELRA. pp. 27-31.
Fankhauser, Peter, Kupietz, Marc (2017):
Visualizing Language Change in a Corpus of Contemporary German. In: Proceedings of the 9th International Corpus Linguistics Conference. Birmingham: University of Birmingham.
Fankhauser, Peter/Kupietz, Marc (2019):
Analyzing domain specific word embeddings for a large corpus of contemporary German. International Corpus Linguistics Conference, Cardiff, Wales, UK, July 22-26, 2019. 2019. 6 S.
Keibel, H., Belica, C. (2007):
CCDB: A Corpus-Linguistic Research and Development Workbench. In: Proceedings of the 4th Corpus Linguistics Conference (CL 2007). Birmingham: University of Birmingham.
Kupietz, M., Belica, C., Keibel, H., Witt, A. (2010):
The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, N. et al. (eds.): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). Paris: ELRA, 1848-1854.
Kupietz, M., Lüngen, H., Kamocki, P., Witt, A. (2018):
German Reference Corpus DeReKo: New Developments – New Opportunities. In: Calzolari, N. et al (eds): Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: ELRA, 4353-4360
Ling, W., Dyer, C., Black, A., & Trancoso, I. (2015):
Two/too simple adaptations of word2vec for syntax problems. In Proc. of NAACL.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., Dean, J.(2013):
Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS (Advances in Neural Information Processing Systems) 2013, 3111–3119.