Open main menu

SCI/Specifications/SCI in action/Parser

< SCI‎ | Specifications‎ | SCI in action
Revision as of 19:47, 16 January 2009 by Timofonic (talk | contribs) (Merging of the SCI documentation. Work in progress. Formatting needs improving.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

The Parser

Vocabulary file formats

Originally written by Lars Skovlund

The main vocabulary (VOCAB.000)

The file begins with a list of 26 offsets. Each index corresponds to a letter in the (English) alphabet, and points to the first word starting with that letter. The offset is set to 0 if no words start with that letter. If an input word starts with an alphabetical letter, this table is used to speed up the vocabulary searching - though not strictly necessary, this speeds up the lookup process somewhat.

After the offset table are the actual words. A word definition consists of two parts: The actual text of the word, compressed in a special way, and a 24-bit (yes, three bytes) ID. The ID divided in 2 12-bit quantities, a word class (grammatically speaking) mask, and a group number. The class mask is used, among other things, for throwing away unnecessary words. "Take book", for instance, is a valid sentence in parser'ese, while it isn't in English.

The possible values are arranged as a bit field to allow for class masks, see later. Only one bit is actually tested by the interpreter. If a word class equals to 0xff ("anyword"), the word is excluded (allowing for parser'ese). The values go like this:

0x001
Number (not found in the vocabulary, set internally)
0x002
Special
0x004
Special
0x008
Special[1]
0x010
Preposition
0x020
Article
0x040
Qualifying adjective
0x080
Relative pronoun
0x100
Noun
0x200
Indicative verb )such as "is","went" as opposed to _do_ this or that, which is imperative)
0x400
Adverb
0x800
Imperative verb

The group number is used to implement synonyms (words with the same meaning), as well as by the Said instruction to identify words. There is also a way of using synonyms in code, see the appropriate document.

The compression works in this way: Each string starts with a byte-sized copy count. This many characters are retained from the previous string. The actual text comes after, in normal low-ascii. The last character in the text has its high bit set (no null termination!).

Here is an example of the compression scheme:

apple 0,appl\0xE5


The byte count is 0 because we assume that "apple" is the first word beginning with an a (not likely, though!). 0xE5 is 0x65 (the ascii value for 'e') | 0x80. Watch now the next word:

athlete 1,thlet\0xE5


Here, the initial letter is identical to that of its predecessor, so the copy count is 1. Another example:

atrocious 2,rociou\0xF3


The suffix vocabulary (VOCAB.901)

For the following section, a reference to a grammar book may be advisable.

The suffix vocabulary is structurally much simpler. It consists of variably-sized records with this layout:

NULL-TERMINATED Suffix string
WORD The class mask for the suffix
NULL-TERMINATED Reduced string
WORD The output word class


The suffix vocabulary is used by the interpreter in order to parse compound words, and other words which consist of more than one part. For instance, a simple plural noun like "enemies" is reduced to its singular form "enemy", "stunning" is converted to "stun" etc. The point is that the interpreter gets a second chance at figuring out the meaning if the word can not be identified as entered. A word which changes its class does might end up as a different word class, the correct class is always retained. Thus, "carefully", an adverb, is reduced to its adjectival form "careful", and found in the vocabulary as such, but it is still marked as an adverb.

The suffix vocabulary consists of variably-sized records with this layout:

NULL-TERMINATED Suffix string
WORD The output word class
NULL-TERMINATED Reduced string
WORD The allowed class mask for the reduced

An asterisk (*) represents the word stem. Taking the above example with "enemies", the interpreter finds this record:

*ies
0x100
*y
0x100

Word class 0x100 being a noun.

The interpreter then tries to replace "enemies" with "enemy" and finds that word in the vocabulary. "Enemy" is a noun (class 1), which it is also supposed to be, according to the suffix vocabulary. Since we succeeded, the word class is set to the output value (which is, incidentally, also 1).

Numbers

If the word turns out to be a number (written with numbers, that is), and that number is not listed explicitly in the vocabulary, it gets an ID of 0xFFD, and a word class of 0x100.

The tree vocabulary (VOCAB.900)

This vocabulary is used solely for building parse trees. It consists of a series of word values which end up in the data nodes on the tree. It doesn't make much sense without the original parsing code.

  1. The three special classes are apparently used for words with very specific semantics, such as "if", "not", "and" etc. It is unknown as of yet whether they receive special treatment by the parser.