---------------------------------------------------------------------------------------------------------- IDS Wikipedia Corpus Converter ---------------------------------------------------------------------------------------------------------- developed at Institut für Deutsche Sprache, Mannheim. This tool is licensed under GPL v3. Eliza Margaretha (margaretha@ids-mannheim.de) Harald Lüngen (luengen@ids-mannheim.de) ----------------------------------------------------------------------------------------------------------  1. Introduction The IDS Wikipedia corpus converter generates Wikipedia corpora in I5 format. The conversion is done in two stages: 1. Wikitext to WikiXML conversion (WikiXMLConverter) A Wikipedia dump written in wikitext is converted into XML. The conversion results is a collection of XML Wikipages, namely a WikiXML corpus. 2. WikiXML to I5 conversion (WikiI5Converter) The second stage of the conversion takes the WikiXML pages, transforms XML into I5, assembles all the pages together in a single file. 2. Data All the necessary data and tools for the Wikipedia corpus conversion can be downloaded from: http://corpora.ids-mannheim.de/pub/tools/. The table belows describes the files in the folder. Table 1.1 List of files in tools/ ______________________________________________________________________________________________________________________ | Filename | Description | |____________________________________________________________|________________________________________________________| | dewiki-20130728-sample.xml | A small sample of wikipedia dump | | german-inflectives.xml | A list of interaction words occurring as escaped tags | | WikiXMLCorpusIndexer.sh | A script to create an index of a WikiXML corpus | | 2013/WikiI5Converter-0.0.1.jar | A Java library for converting WikiXML to I5 converter | | 2013/WikiI5Converter-0.0.1-javadoc.jar | A Java documentation for WikiI5Converter source code | | 2013/WikiI5Converter-0.0.1-sources.jar | Java source code of the WikiI5Converter | | 2013/WikiXMLConverter-0.0.1-jar-with-dependencies.jar | A Java library for converting Wikitext to WikiXML | | 2013/WikiXMLConverter-0.0.1-javadoc.jar | A Java documentation of the WikiXMLConverter code | | 2013/WikiXMLConverter-0.0.1-sources.jar | Java source code of the WikiXMLConverter | |____________________________________________________________|________________________________________________________| A list of interaction words occuring as escaped tag names, is language-specific. It can be optionally used in converting WikiXML to I5. We provide a list for the German Wikipedia discussions (german-inflectives.xml). It has the following structure: abgreif auf-die-Nägel-blas ... 3. Instructions This section describes the steps to generate a Wikipedia corpus in I5: 1. generate a WikiXML corpus by converting wikitext into XML and transforming Wikipedia dump into a set of WikiXML pages. 2. prepare an index of the WikiXML corpus, which is needed by the WikiI5Converter. 3. generate a Wiki I5 corpus by converting WikiXML to I5 and merging the Wiki pages into a single I5 file. 3.1. Converting Wikitext to WikiXML WikiXMLconverter has 4 mandatory arguments: -l Two letter language code of the wikidump. The converter supports the following languages: german (de), french (fr), hungarian (hu), italian (it), polish (pl) and norwegian (no). Other languages can be added by using the Java API and instantiating the LanguageProperties class. For example: -l de -w Wiki dump file. For example: -w dewiki-20130728-sample.xml -t Wikipage type. The converter supports these Wikipedia page types: [articles | discusssions | all]. The default is all, i.e. both WikiXML articles and discussions will be generated. -o The XML output (root) directory. Sub-directories articles/ and discussions/ will be generated automatically with respect to the Wikipage type. For example: -o xml-de/ Examples: 1. Generate only WikiXML articles java -jar WikiXMLConverter-0.0.1-jar-with-dependencies.jar -l de -w dewiki-20130728-sample.xml -t articles -o xml-de/ 2. Generate both article and dicussion pages in XML: java -jar WikiXMLConverter-0.0.1-jar-with-dependencies.jar -l de -w dewiki-20130728-sample.xml -o xml-de/ 3.2. Indexing WikiXML corpus An index of the WikiXML pages in the WikiXML corpus is needed for converting the corpus into a Wiki I5 corpus. To create such an index, use the WikiXMLCorpusIndexer.sh which takes 3 mandatory parameters: [Wikipage type] [WikiXML corpus folder] [output file] Examples: WikiXMLCorpusIndexer.sh articles xml-de/ articleIndex.xml WikiXMLCorpusIndexer.sh discussions xml-de/ discussionIndex.xml An article index has the following structure: 179 ... ... 1 ... ... 3.3. Converting WikiXML to I5 The WikiI5converter requires the outputs (the WikiXML pages) of the Wikitext-to-XML converter. It also needs Saxon-EE and commons-cli-1.2.jar. We used Saxon-EE 9.4.0.3J. Saxon-EE together with its licence (saxon-license.lic), and commons-cli-1.2.jar must be put in lib/ in the same directory as the WikiI5Converter. WikiI5Converter has mandatory arguments: -x The folder path of the WikiXML articles or discussions For example: -x xml-de/articles -t The type of the pages [articles or discussions] For example: -t articles -i The WikiXML article or discussion index For example: -i articleIndex.xml -w The filename of the Wikipedia dump starting with: [2 character language code]wiki-[year][month][date] For example: -w dewiki-20130728-pages-meta-current.xml -o The output file For example: -o i5/dewiki-20130728-articles.i5 Besides, the following arguments are optional: -e The encoding of the output file [UTF-8 or ISO-8859-1]. The default is UTF-8. For example: -e UTF-8 -inf The inflective list For example: -inf german-inflectives.xml Examples: 1. Generate Wikipedia articles in I5 with inflectives java -jar WikiI5Converter-0.0.1.jar -x xml-de/articles -t articles -i articleIndex.xml -w dewiki-20130728-sample.xml -o i5/dewiki-20130728-articles.i5 -inf german-inflectives.xml 2. Generate Wikipedia discussions in I5 without inflectives java -jar WikiI5Converter-0.0.1.jar -x xml-de/discussions -t discussions -i discussionIndex.xml -w dewiki-20130728-sample.xml -o i5/dewiki-20130728-discussions.i5 -e utf-8 4. References Margaretha, E., and Lüngen,H. (2014). Building linguistic corpora from Wikipedia articles and discussions. Journal for Language Technologie and Computational Linguistics (JLCL), 2/2014. The source code of these tools is also available on Github (https://github.com/IDS-Mannheim/Wikipedia-Corpus- Converter).