Sample J32 from Hugh Kelly and Ted Ziehe, "Glossary Lookup Made Easy" Proceedings of the National Symposium on Machine Translation, University of California at Los Angeles. Englewood Cliffs, New Jersey: Prentice-Hall, Inc., 1961. Pp. 326-331. Reproduction permitted for any purpose of the U.S. Government. 0010-1750 A part of the XML version of the Brown Corpus1,990 words 21 symbols 48 formulasJ32

Hugh Kelly and Ted Ziehe, "Glossary Lookup Made Easy" Proceedings of the National Symposium on Machine Translation, University of California at Los Angeles. Englewood Cliffs, New Jersey: Prentice-Hall, Inc., 1961. Pp. 326-331. Reproduction permitted for any purpose of the U.S. Government. 0010-1750

Typographical Error: voume [0050] give [for gives] [0390]

Header auto-generated for TEI version

The many linguistic techniques for reducing the amount of dictionary information that have been proposed all organize the dictionary's contents around prefixes , stems , suffixes , etc. . A significant reduction in the voume of store information is thus realized , especially for a highly inflected language such as Russian . For English the reduction in size is less striking . This approach requires that : ( 1 ) each text word be separated into smaller elements to establish a correspondence between the occurrence and dictionary entries , and ( 2 ) the information retrieved from several entries in the dictionary be synthesized into a description of the particular word . The logical scheme used to accomplish the former influences the placement of information in the dictionary file . Implementation of the latter requires storage of information needed only for synthesis .

We suggest the application of certain data-processing techniques as a solution to the problem . But first , we must define two terms so that their meaning will be clearly understood : form -- any unique sequence of alphabetic characters that can appear in a language preceded and followed by a space ; ; occurrence -- an instance of a form in text .

We propose a method for selecting only dictionary information required by the text being translated and a means for passing the information directly to the occurrences in text . We accomplish this by compiling a list of text forms as text is read by the computer . A random-storage scheme , based on the spelling of forms , provides an economical way to compile this text-form list . Dictionary forms found to match forms in the text list are marked . A location in the computer store is also named for each marked form ; ; dictionary information about the form stored at this location can be retrieved directly by occurrences of the form in text . Finally , information is retrieved from the dictionary as required by stages of the translation process -- the grammatical description for sentence-structure determination , equivalent-choice information for semantic analysis , and target-language equivalents for output construction .

The dictionary is a form dictionary , at least in the sense that complete forms are used as the basis for matching text occurrences with dictionary entries . Also , the dictionary is divided into at least two parts : the list of dictionary forms and the file of information that pertains to these forms . A more detailed description of dictionary operations -- text lookup and dictionary modification -- gives a clearer picture .

Text lookup , as we will describe it , consists of three steps . The first is compiling a list of text forms , assigning an information cell to each , and replacing text occurrences with the information cell assigned to the form of each occurrence . For this step the computer memory is separated into three regions : cells in the W-region are used for storage of the forms in the text-form list ; ; cells in the X-region and Y region are reserved as information cells for text forms .

When an occurrence Af is isolated during text reading , a random memory address Af , the address of a cell in the X-region , is computed from the form of Af . Let Af denote the form of Af . If cell Af has not previously been assigned as the information cell of a form in the text-form list , it is now assigned as the information cell of Af . The form itself is stored in the next available cells of the W-region , beginning in cell Af . The address Af and the number of cells required to store the form are written in Af ; ; the information cell Af is saved to represent the text occurrence . Text reading continues with the next occurrence .

Let us assume that Af is identical to the form of an occurrence Af which preceded Af in the text . When this situation exists , the address Af will equal Af which was produced from Af . If Af was assigned as the information cell for Af , the routine can detect that Af is identical to Af by comparing Af with the form stored at location Af . The address Af is stored in the cell Af . When , as in this case , the two forms match , the address Af is saved to represent the occurrence Af . Text reading continues with the next occurrence .

A third situation is possible . The formula for computing random addresses from the form of each occurrence will not give a distinct address for each distinct form . Thus , when more than one distinct form leads to a particular cell in the X-region , a chain of information cells must be created to accommodate the forms , one cell in the chain for each form . If Af leads to an address Af that is equal to the address computed from Af , even though Af does not match Af , the chain of information cells is extended from Af by storing the address of the next available cell in the Y-region , Af , in Af . The cell Af becomes the second information cell in the chain and is assigned as the information cell of Af . A third cell can be added by storing the address of another Y-cell in Af ; ; similarly , as many cells are added as are required . Each information cell in the chain contains the address of the Y-cell where the form to which it is assigned is stored . Each cell except the last in the chain also contains the address of the Y-cell that is the next element of the chain ; ; the absence of such a link in the last cell indicates the end of the chain . Hence , when the address Af is computed from Af , the cell Af and all Y-cells in its chain must be inspected to determine whether Af is already in the form list or whether it should be added to the form list and the chain . When the information cell for Af has been determined , it is saved as a representation of Af . Text reading continues with the next occurrence .

Text reading is terminated when a pre-determined number of forms have been stored in the text-form list . This initiates the second step of glossary lookup -- connecting the information cell of forms in the text-form list to dictionary forms . Each form represented by the dictionary is looked up in the text-form list . Each time a dictionary form matches a text form , the information cell of the matching text form is saved . The number of dictionary forms skipped since the last one matched is also saved . These two pieces of information for each dictionary form that is matched by a text form constitute the table of dictionary usage . If each text form is marked when matched with a dictionary form , the text forms not contained in the dictionary can be identified when all dictionary forms have been read . The appropriate action for handling these forms can be taken at that time .

Each dictionary form is looked up in the text-form list by the same method used to look up a new text occurrence in the form list during text reading . A random address Af that lies within the X-region of memory mentioned earlier is computed from the i-th dictionary form . If cell Af is an information cell , it and any information cells in the Y-region that have been linked to Af each contain an address in the W-region where a potentially matching form is stored . The dictionary form is compared with each of these text forms . When a match is found , an entry is made in the table of dictionary usage . If cell Af is not an information cell we conclude that the i-th dictionary form is not in the text list .

These two steps essentially complete the lookup operation . The final step merely uses the table of dictionary usage to select the dictionary information that pertains to each form matched in the text-form list , and uses the list of information cells recorded in text order to attach the appropriate information to each occurrence in text . The list of text forms in the W-region of memory and the contents of the information cells in the X and Y-regions are no longer required . Only the assignment of the information cells is important .

The first stage of translation after glossary lookup is structural analysis of the input text . The grammatical description of each occurrence in the text must be retrieved from the dictionary to permit such an analysis . A description of this process will serve to illustrate how any type of information can be retrieved from the dictionary and attached to each text occurrence .

The grammatical descriptions of all forms in the dictionary are recorded in a separate part of the dictionary file . The order is identical to the ordering of the forms they describe . When entries are being retrieved from this file , the table of dictionary usage indicates which entries to skip and which entries to store in the computer . This selection-rejection process takes place as the file is read . Each entry that is selected for storage is written into the next available cells of the Aj . The address of the first cell and the number of cells used is written in the information cell for the form . ( The address of the information cell is also supplied by the table of dictionary usage . ) When the complete file has been read , the grammatical descriptions for all text forms found in the dictionary have been stored in the W-region ; ; the information cell assigned to each text form contains the address of the grammatical description of the form it represents . Hence , the description of each text occurrence can be retrieved by reading the list of text-ordered information-cell addresses and outputting the description indicated by the information cell for each occurrence .

The only requirements on dictionary information made by the text-lookup operation are that each form represented by the dictionary be available for lookup in the text-form list and that information for each form be available in a sequence identical with the sequence of the forms . This leaves the ordering of entries variable . ( Here an entry is a form plus the information that pertains to it .

Two very useful ways for modifying a form-dictionary are the addition to the dictionary of complete paradigms rather than single forms and the application of a single change to more than one dictionary form . The former is intended to decrease the amount of work necessary to extend dictionary coverage . The latter is useful for modifying information about some or all forms of a word , hence reducing the work required to improve dictionary contents . Applying the techniques developed at Harvard for generating a paradigm from a representative form and its classification , we can add all forms of a word to the dictionary at once . An extension of the principle would permit entering a grammatical description of each form . Equivalents could be assigned to the paradigm either at the time it is added to the dictionary or after the word has been studied in context . Thus , one can think of a dictionary entry as a word rather than a form .

If all forms of a paradigm are grouped together within the dictionary , a considerable reduction in the amount of information required is possible . For example , the inflected forms of a word can be represented , insofar as regular inflection allows , by a stem and a set of endings to be attached . ( Indeed , the set of endings can be replaced by the name of a set of endings . ) The full forms can be derived from such information just prior to the lookup of the form in the text-form list . Similarly , if the equivalents for the forms of a word do not vary , the equivalents need be entered only once with an indication that they apply to each form . The dictionary system is in no way dependent upon such summarization or designed around it . When irregularity and variation prevent summarizing , information is written in complete detail . Entries are summarized only when by doing so the amount of information retained in the dictionary is reduced and the time required for dictionary operations is decreased .