THE GRAMMATICAL INTERPRETATION OF RUSSIAN INFLECTED FORMS USING A STEM DICTIONARY .sx by J. McDANIEL and S. WHELAN , National Physical Laboratory , Teddington , England .sx INTRODUCTION .sx THE NPL Russian-English automatic dictionary is organised on a stem-paradigm basis wherein there is for most nouns and adjectives a single entry for all their inflected forms and for most verbs only one or two entries .sx This is in contrast to the full-form type of dictionary organisation wherein each inflected form of every word has a separate entry .sx The decision to organise our dictionary on this basis was made so as to be able to accommodate it on the magnetic tape store available to us on the ACE digital electronic computer of our laboratory and , further , to minimise the look-up time per word on the computer without complicating the look-up procedure too much or investing too much programming effort in its compilation .sx The word content of the dictionary initially is to be 15,000 words from the Harvard University Automatic Dictionary .sx Our dictionary will have an average of about 1.5 entries per word , whereas a full-form dictionary would have about ten times that average .sx The operation of our stem-paradigm dictionary involves two extra processing steps as compared with the full-form type dictionary .sx Firstly , words referred to the dictionary are reduced to their stems so that they may be matched against the corresponding dictionary stem entries and , secondly , after matching of stems , that part of the referred word split off to give the stem requires interpretation to determine its grammatical significance for that stem .sx The first process is known as affix-splitting and consists of matching the end of a referred word against a list of recognised affixes having grammatical significance .sx The process is fully described in a companion paper to this .sx We shall refer to the results of these papers where necessary .sx The second process , affix interpretation , is the subject of this paper .sx The extra grammatical properties of the referred word revealed by affix identification , in addition to those identifiable in the stem of the word are as follows , for nouns , adjectives and verbs :sx - NOUNS :sx - .sx Number and case ADJECTIVES :sx - .sx Number , case , gender , short or long form VERBS :sx - .sx Person , number , tense , gender , mood , voice , and , for participles only , case and short or long form .sx Of course , not all combinations of these properties can occur .sx The majority of pronoun forms are treated like adjectives .sx The remaining pronoun forms and all indeclinable words are referred to full-form type dictionary entries , and do not participate in affix identification , although they undergo the splitting process .sx Affix interpretation is necessary for all stem type entries as its results form the basis of systems of syntactic analysis designed to improve a word-for-stem type " translation " of Russian into English .sx Rules of English inflection , insertion of prepositions and auxiliaries , suppression of Russian equivalents and variations of word order will all require the affix interpretation results .sx .sx PRINCIPLE OF INTERPRETATION .sx THE splitting process consists in matching the endings of text words against a list of affixes , and splitting off any matched affixes , so that the interpretation problem may be stated as the problem of giving a grammatical significance to each of these recognised affixes when they are found .sx Now some of the affixes will have varying significance depending on the stem from which they have been split .sx For instance , one of the affixes in the list is -A , and this can have five different interpretations :sx - .sx Genitive singular when split from some masculine noun stems .sx .sx Genitive singular and nominative plural when split from some other masculine noun stems and from neuter noun stems .sx .sx Nominative singular when split from feminine noun stems .sx .sx Feminine short form when split from adjective and participle stems .sx .sx Present Gerund when split from verb stems .sx So for these ambiguous affixes ( they are mostly noun affixes ) it is necessary to check the stem type from which the affix has been split before giving the grammatical significance .sx There is a further check , on the validity of a given split , which can be conveniently made during interpretation .sx This is to check that the matched dictionary stem includes the split-off affix in the declension or conjugation intended to be associated with it in the dictionary compilation stage .sx We call this check reconciliation of stem and affix , and it is necessary because of the occurrence of stem homographs and also because of the possibility of a text word whose true stem is not entered in the dictionary being falsely split and the resulting stem matching with a dictionary stem .sx We combine interpretation and reconciliation in one operation , making use of a paradigm indicator associated with each stem , and one or more role indicators associated with each affix .sx By speaking of the paradigm of a stem , we mean that set of our recognised affixes , all of which combine with that stem to form valid inflectional forms of one Russian word .sx Thus each stem entry in the dictionary contains a computer word , known as the paradigm indicator word ( PIW ) , which indicates by a binary pattern the paradigm of that stem .sx There are three different formats for the PIW for noun , adjective and verb stems .sx The verb format is used for two types of verb stems , but in each case it represents a different set of endings .sx This was only necessary in practice because one computer word ( the ACE word is 48 binary digits ( bits ) long ) is not long enough to represent all the verbal affixes .sx We shall consider the noun format of the PIW as a specific example .sx The word is divided into fields , one for each of the case and number combinations of nouns .sx Accusative plural is excluded , as its endings follow those of nominative plural or genitive plural depending on the animation of the noun .sx In each field , a bit position is associated with each affix that conveys the significance of that field with a noun stem .sx The noun format is shown in Figure 1 .sx ( # is our symbol for the null affix) .sx In the accusative singular field , only the feminine affixes are shown , the masculine and neuter affixes being implicit from the nominative singular , and genitive singular fields and the animation marker in bit position 43 .sx We could have repeated the masculine and neuter , nominative and genitive singular endings in the accusative singular field , but this would have required more bit positions than are available in an ACE word .sx So simply by indicating the animation of a noun stem , we can restrict the paradigm format to within one ACE word .sx The PIW for a particular noun stem is formed in general by inserting a binary digit 1 in the bit position corresponding to the appropriate affix in each field .sx For example , consider the stem entry and PIW resulting from the Russian word whose nominative singular is 1STOL ( table) .sx The stem entry will be 1STOL- and the set of affixes which give all the inflected forms of 1STOL is # , 11A , U , E , OM , Y , OV , AM , AKH , AMI .sx The PIW will thus have " ones " in positions 1 , 11 , 15 , 19 , 21 , 26 , 32 , 37 , 39 and 41 .sx The absence of a " one " in bit position 43 indicates the inanimate nature of the stem and hence implicitly indicates the accusative singular and accusative plural endings .sx A stem which takes alternative affixes in a given field will have " ones " in the bit positions of both affixes e.g. the stem 1VOLOS ( hair ) has the alternative affixes 1Y and 1A in the nominative plural form .sx Where a stem is not common to all inflected forms of a word , only those fields to which that stem applies will have a " one " in them e.g. the stem 1BRAT- ( brother ) applies to the singular inflected forms only ( 1 , 11 , 15 , 19 , 21 , 43 ) while the stem 1BRAT'- applies to the plural forms ( 29 , 33 , 36 , 38 , 40 , 43) .sx The formats for adjectives and verbs are shown in Figure 2 and in principle are similar to the noun format .sx They all have more fields than the noun format , but have much less variety of affixes within each field .sx The two verb formats have identical fields , but mostly different affixes in those fields .sx They include fields for participle affixes , but the affixes in these fields are only the participle stem-building affixes .sx However , as participle adjectival endings follow a perfectly regular pattern , they need not be explicitly stated in the PIW .sx Nearly all nouns and adjectives will require only one stem and PIW to represent all their inflected forms .sx Approximately 2/3 of Russian verbs will need only one stem , most of the rest requiring two stems , and only the irregular verbs more than two .sx The PIW are compiled by the computer from data sheets ( dictionary entry forms ) one of which is manually completed for each word to be entered into the dictionary .sx There is a different data sheet for each of several broad classes of noun declension , so as to limit the linguistic decisions to be made in completing the sheets , but all noun data sheets refer to the one standard format for the noun PIW .sx There are similar data sheets for adjectives and the two types of verbs , in these cases only one type of data sheet per format , because of the lesser variety of inflection .sx With the provision of a PIW in each stem entry in the dictionary , the problem of interpretation of an affix which has occurred on a given stem as a text word , is resolved into spotlighting the occurrences ( if any ) of that affix in the PIW for that stem and noting the fields ( grammatical properties ) in which they occur .sx This is most easily done by having , for that affix , a masking pattern containing a " one " bit corresponding to each occurrence of it in the PIW format .sx Then , by performing a " logical and " operation between this mask and the PIW of the given stem , the result will contain a " one " bit in each field where that affix has significance for the given stem .sx Of course , if the result was zero , this would mean that the affix and stem were incompatible i.e. the stem did not combine with the affix in any meaningful inflection .sx This situation may arise with stem homographs and with words whose true stems are not yet compiled into the dictionary and are falsely split .sx In the latter case the PIW would not contain the falsely split affix .sx The masking pattern referred to above we call the role indicator word ( RIW ) for the given affix .sx Some affixes have significance with more than one of the PIW formats , and for these there will need to be more than one RIW e.g. 1I has significance for and appears in each of the four PIW formats , so it will have four RIW .sx In order to be able to match the appropriate RIW to a given PIW in an interpretation , the format types are given a type number ( digits 47 and 48 ) and the RIW which relate to these types are given the corresponding type no .sx There are identical 1I and 1E verb RIW for each of 10 verbal affixes 11(U , JU , I , J , ' , JTE , 'TE , A , JA , ENN ) and so we save some space in storing the RIW by having only one verb RIW for each of these 10 and indicating its dual utility .sx Let us consider two examples of interpretation of noun forms 1AVTOMOBILI and 1NEDELI , which would be matched against the dictionary stems 1AVTOMOBIL- and 1NEDEL- respectively , with 1I as the affix to be interpreted in both cases .sx The PIW for the noun stem 1AVTOMOBIL- and the noun type RIW for 1I would be as shown in Figure 3 .sx The " logical-and " of these two computer words would give a " one " bit in position 28 only i.e. in the nominative plural field .sx The PIW for 1NEDEL- is also shown in Figure 3 and the result of " anding " this word with the RIW for 1I would be a " one " bit in positions 14 and 28 i.e. in the genitive singular and nominative plural fields .sx