Difference between revisions of "SCI/Specifications/SCI in action/Parser"

From ScummVM :: Wiki
Jump to navigation Jump to search
(Merging of the SCI documentation. Work in progress.)
 
(14 intermediate revisions by 2 users not shown)
Line 1: Line 1:
''Document conversion incomplete. Work in progress.''
=The Parser=
=The Parser=


Line 22: Line 20:
;0x080 : Relative pronoun
;0x080 : Relative pronoun
;0x100 : Noun
;0x100 : Noun
;0x200 : Indicative verb )such as "is","went" as opposed to _do_ this or that, which is imperative)
;0x200 : Indicative verb (such as "is","went" as opposed to _do_ this or that, which is imperative)
;0x400 : Adverb
;0x400 : Adverb
;0x800 : Imperative verb
;0x800 : Imperative verb
Line 54: Line 52:
|}
|}
<br>
<br>
===The suffix vocabulary (VOCAB.901)===
===The suffix vocabulary (VOCAB.901)===
For the following section, a reference to a grammar book may be advisable.
For the following section, a reference to a grammar book may be advisable.
Line 155: Line 154:
After tokenizing, looking up, and finally aliasing the data found in the parsed input string, the interpreter proceeds to build a parse tree <!-- <math>T_\pi</math> --> <i>T</i><sub>&pi;</sub> from the input tokens
After tokenizing, looking up, and finally aliasing the data found in the parsed input string, the interpreter proceeds to build a parse tree <!-- <math>T_\pi</math> --> <i>T</i><sub>&pi;</sub> from the input tokens


<br><i>I :=</i> &omega;<sub>0</sub>, &omega;<sub>1</sub>,&omega;<sub>2</sub> &hellip; &omega;<sub>n-1</sub> <!-- Mediawiki LaTeX <math>I:=~\omega_0,\omega_1,\omega_2 \ldots \omega_{n-1}</math> --><br>
:<math>I~:=~\omega_0,\omega_1,\omega_2 \ldots \omega_{n-1}</math>


where
where


* <i>&omega;<sub>j</sub> &isin; W</i>
* <math>\omega_j \in W</math>, with <math>W</math> being the set of all words;
* <i>&gamma;<sub>j</sub> &isin; &Gamma;</i>
* <math>\gamma_j \in \Gamma</math>, with <math>\Gamma</math> being the set of all word groups;
* <i>&mu;<sub>j</sub> &isin; 2<sup>c</sup></i>
* <math>\mu_j \in 2^C</math>, with <math>C</math> being the set of all class masks <nowiki>{0x1,0x2,0x4,0x8,0x10,0x20,0x40,0x80,0x100}</nowiki>, and <math>2^C</math> being the set of all subsets of <math>C</math>;
* <i>&omega;<sub>j</sub> = (&gamma;<sub>j</sub>, &mu;<sub>j</sub></i>
* <math>\omega_j = (\gamma_j, \mu_j)</math>, with <math>\gamma_j</math> being the word group <math>\omega_j</math> belongs to, and <math>\mu_j</math> being its class mask.
<!-- Math formulas
 
* <math>\omega_j \in W</math>
Note that elements of <math>2^C</math> (i.e. sets of class masks) can be identified with the ORed value of these class masks. So the set <nowiki>{0x2,0x4,0x80}</nowiki> can be identified with the value 0x86.
*<math>\gamma_j \in \Gamma</math>
*<math>\mu_j \in 2^C</math>
*<math>\omega_j = (\gamma_j, \mu_j</math> -->


With <i>W</i> being the set of all words, <i>&Gamma;</i> being the set of all word groups, <i>C</i> being the set of all class masks <nowiki>{1,2,4,8,10,20,40,80,100}</nowiki>, &gamma;<sub>j</sub> being the word group <i>&omega;<sub>j</sub></i> belongs to, and <i>&omega;<sub>j</sub></i> being its class mask, as described above.
For the following sections, we define two maps and a set:


For the following sections, we define
* <math>\operatorname{group}: W \to \Gamma  : (\gamma,\mu)  \mapsto \gamma</math>
* <math>\operatorname{classes}: W \to 2^C : (\gamma,\mu) \mapsto \mu</math>
* <math>C_x = \{ \omega \in W \mid x \in \operatorname{classes}(\omega)\}</math>


* group <i>W &rarr; &Gamma;.group : (&gamma;,&mu;) (&gamma;,&mu;) &#8614; &gamma;</i>
To do that, it uses the class masks <math>M</math> as input for a pushdown automaton (PDA) A built from a parser grammar; if <math>M</math> was accepted by <math>A</math>, the parse tree <math>T_pi</math> will be built from the matching syntax tree to represent the semantics.
* classes: <i>W &rarr; C</i>.classes: (&gamma;,&mu;) &#8614; &mu;
* <i>C<sub>x</sub> = {&omega;|&omega; &isin; class(&omega;)}</i>


To do that, it uses the class masks <i>M</i> as input for a pushdown automaton (PDA) A built from a parser grammar; if <i>M</i> was accepted by <i>A</i>, the parse tree <i>T</i><sub>&pi;</sub> will be built from the matching syntax tree to represent the semantics.  
The PDA is defined by a grammar <math>G = (V,\Sigma,P,s)</math> where
* <math>V</math> is finite set of <i>nonterminal symbols</i>,
* <math>\Sigma</math> is a finite set <i>terminal symbols</i> that is disjoint from <math>V</math>,
* <math>P</math> is a finite set of <i>production rules</i>,
* <math>s \in V</math> is the <i>start symbol</i>.


The PDA is defined by a grammar <i>G = (V,&Sigma;,P,s)</i> most of which, along with its semantics, is stored in <tt>vocab.900</tt>. This resource contains a parser rule at every 20 bytes, starting with a non-terminal symbol &upsilon; (one word) and a null-terminated list of up to five tuples <i>&sigma;<sub>i</sub>,m<sub>i</sub></i> , both of which are words. In these tuples, <i>m<sub>i</sub></i> is a terminal or non-terminal symbol (determined by <i>&sigma;<sub>i</sub></i> ), and <i>&sigma;<sub>i</sub></i> is the meaning of <i>m<sub>i</sub></i> :
Most of this grammar, along with its semantics, is stored in <tt>vocab.900</tt>. This resource contains a parser rule at every 20 bytes, starting with a non-terminal symbol <math>\upsilon</math> (one word) and a null-terminated list of up to five tuples <math>(\sigma_i, m_i)</math>, both of which are words. In these tuples, <math>m_i</math> is a terminal or non-terminal symbol (determined by <math>\sigma_i</math> ), and <math>\sigma_i</math> is the meaning of <math>m_i</math> as described in the following table:


{| border="1"  
{| border="1" style="margin: 1em auto 1em auto"
|<i>&sigma;<sub>i</sub></i>
|<math>\sigma_i</math>
|Type
|Type
|Meaning
|Meaning
Line 204: Line 204:
|0x146
|0x146
|Terminal
|Terminal
|Match on class mask: Matches if <span style="white-space:nowrap"><i>m<sub>i</sub> &isin; classes(&omega;<sub>j</sub>)</i></span>
|Match on class mask: Matches if <math>m_i\in\operatorname{classes}(\omega_j)</math>
|----
|----
|0x14d
|0x14d
|Terminal
|Terminal
|Match on word group: Matches if <span style="white-space:nowrap"><i>m<sub>i</sub> &#61; group(&omega;<sub>j</sub>)</i></span>
|Match on word group: Matches if <math>m_i=\operatorname{group}(\omega_j)</math>
|----
|----
|0x154
|0x154
Line 216: Line 216:
|}
|}


With the notable exception of the first rule, these rules constitute  <span style="white-space:nowrap"><i>P. V := {&chi;|&exist;R &isin; P.&chi; &exist R}</i></span>; typically, <span style="white-space:nowrap"><i> V = {0&chi;12&fnof;&hellip;0&chi;13&fnof;}&sdot; s = m<sub>0</sub></i></span> of the first rule encountered; in all games observed, it was set to 0x13c. &Sigma; contains all word groups and class masks. For the sake of simplicity, we will consider rules matching composite class masks to be several rules. Here is a simplified example of what such a grammar might look like (the hexadecimal prefix '0x' is omitted for brevity):
With the notable exception of the first rule, these rules constitute  the set of production rules <math>P</math>. Furthermore we have <math>V := \{x \mid \exists R \in P : x \in R\}</math>; typically, <math>V = \{ \mathrm{0x12f}, \ldots, \mathrm{0x13f}\}</math> and  <math>s = m_0</math> of the first rule encountered; in all games observed, it was set to 0x13c. <math>\Sigma</math> contains all word groups and class masks. For the sake of simplicity, we will consider rules matching composite class masks to be several rules. Here is a simplified example of what such a grammar might look like (the hexadecimal prefix '0x' is omitted for brevity):


In addition to this grammar, each right-hand non-terminal <i>m<sub>i</sub></i> carries its semantic value <i>&rho;<sub>i</sub></i> , which is not relevant for constructing a syntax tree, but must be considered for the semantic tree <i>T</i><sub>&pi;</sub>. These values were omitted in the example above. As in the example above, the grammar is a context-free (type 2) grammar, almost in Chomsky Normal Form (CNF) in SCI; constructing a grammar with CNF rules from it would be trivial.<ref>FreeSCI constructs a GNF (Greibach Normal Form) representation from these rules for parsing.</ref>
In addition to this grammar, each right-hand non-terminal <math>m_i</math> carries its semantic value <math>\rho_i</math> , which is not relevant for constructing a syntax tree, but must be considered for the semantic tree <math>T_\pi</math>. These values were omitted in the example above. As in the example above, the grammar is a context-free (type 2) grammar, almost in Chomsky Normal Form (CNF) in SCI; constructing a grammar with CNF rules from it would be trivial.<ref>FreeSCI constructs a GNF (Greibach Normal Form) representation from these rules for parsing.</ref>


Obviously, G is an ambiguous grammar. In SCI, rule precedence is implied by rule order, so the resulting left derivation tree is well-defined (in the example, it would be defined by <i>D<sub>0</sub></i>.<ref>In FreeSCI, you can use the ”parse” console command to retreive all possible left derivation trees.</ref>
Obviously, <math>G</math> is an ambiguous grammar. In SCI, rule precedence is implied by rule order, so the resulting left derivation tree is well-defined (in the Parser example further down, it would be defined by <math>D_0</math>.<ref>In FreeSCI, you can use the ”parse” console command to retrieve all possible left derivation trees.</ref>
<br>
<br>


====Parser grammar example====
====Parser grammar example====
<span style="white-space:nowrap">G = &lang;{&rang;12&fnof;&hellip;13e,(C<sub>1</sub>,C<sub>2</sub>,C<sub>4</sub>,&hellip;,C<sub>1</sub>00},P,13c)</span>
<math>G = \langle \{ \mathrm{12f}, \ldots, \mathrm{13e} \}, \{ C_1, C_2, C_4, \ldots, C_{100} \}, P, \mathrm{13c} \rangle</math>
{| <i>
{| <i>
|----
|----
|
|P = {
|13c
|13c
|&#8614;
|&rarr;
|13b 134
|13b 134
|----
|----
|
|
|13c
|13c
|&#8614;
|&rarr;
|13b 13d 133
|13b 13d 133
|----
|----
|
|
|13c
|13c
|&#8614;
|&rarr;
|13b 13d
|13b 13d
|----
|----
|
|
|13c
|13c
|&#8614;
|&rarr;
|13c
|13b
|----
|----
|
|
|13c
|13c
|&#8614;
|&rarr;
|13b 13d 13b 13d
|13b 13d 13b 13d
|----
|----
|
|
|13b
|13b
|&#8614;
|&rarr;
|131 134
|131 134
|----
|----
|
|
|13b
|13b
|&#8614;
|&rarr;
|131 13d  13d
|131 13d  13d
|----
|----
|rowspan=2| P = {
|
|13b
|13b
|&#8614;
|&rarr;
|131
|131
|----
|----
|
|13d
|13d
|&#8614;
|&rarr;
|134
|134
|----
|----
|
|
|131
|131
|&#8614;  
|&rarr;  
|C<sub>8</sub>0
|C<sub>80</sub>
|----
|----
|
|
|133
|133
|&#8614;
|&rarr;
|C<sub>2</sub>0
|C<sub>20</sub>
|----
|----
|
|
|134
|134
|&#8614;
|&rarr;
|C<sub>1</sub>0}
|C<sub>10</sub> }
|----
|----
|}</i>
|}</i>


===Semantics===
===Semantics===
This is important, since the parser does much more than just accept or discard input. Using the semantic tags applied to each non-terminal on the right-hand side of a rule, it constructs what I will call the semantic parse tree <i>T</i><sub>&pi;</sub> , which attempts to describe what the input means. For each non-terminal rule
This is important, since the parser does much more than just accept or discard input. Using the semantic tags applied to each non-terminal on the right-hand side of a rule, it constructs what I will call the semantic parse tree <math>T_\pi</math> , which attempts to describe what the input means. For each non-terminal rule


<i>r = &upsilon;<sub>0</sub> &#8614; &upsilon;<sub>1</sub>&upsilon;<sub>2</sub>&hellip;&upsilon;<sub>n</sub></i>
<math>r = v_0 \to v_1 v_2 \ldots v_n</math>


there are semantic tags <i>&sigma;<sub>r,1</sub>,&sigma;<sub>r,2</sub>&hellip;&sigma;<sub>r,n</sub> &isin; S</i> , as explained above. <i>T</i><sub>&pi;</sub> is now constructed from the resulting derivation and the semantic tags assiociated with each non-terminal of the rule used. The construction algorithm is explained below with <i>T</i><sub>&pi;</sub> being constructed from nodes, which have the following structure:
there are semantic tags <math>\sigma_{r,1}, \sigma_{r,2} \ldots \sigma_{r,n} \in S</math>, as explained above. <math>T_\pi</math> is now constructed from the resulting derivation and the semantic tags assiociated with each non-terminal of the rule used. The construction algorithm is explained below with <math>T_\pi</math> being constructed from nodes, which have the following recursive structure:


<i>NODE = {&#9671;} &cup; S &#215; V &#215; (NODE &cup; &Gamma;);*</i>
<math>
\newcommand{\Node}{\textrm{NODE}}
\Node = \{\diamond\} \cup S \times V \times (\Node \cup \Gamma)*;</math>


Where S is the set of possible semantic values, and V is the set of non-terminals as defined in the grammar. We will also use the sequence <i>&gamma;<sub>0</sub>,&gamma<sub>1</sub>,&gamma<sub>2</sub> &hellip; &gamma<sub>k-1</sub></i> which will represent the word groups the input tokens belonged to (in the exact order they were accepted), and the sequence <i>r<sub>0</sub>,r<sub>1</sub>,r<sub>2</sub> &hellip; r<sub>l-1</sub></i> , which will be the list of rules used to create the left derivation tree as described in the previous section.
Where <math>S</math> is the set of possible semantic values, and <math>V</math> is the set of non-terminals as defined in the grammar. We will also use the sequence <math>\gamma_{0}, \gamma_{1}, \gamma_{2} \ldots \gamma_{k-1}</math> which will represent the word groups the input tokens belonged to (in the exact order they were accepted), and the sequence <math>r_{0}, r_{1}, r_{2} \ldots r_{l-1}</math>, which will be the list of rules used to create the left derivation tree as described in the previous section.


 
<pre>
<syntax type="?">
  Helper function sci_said_recursive: S &times; V &times; (V &cup; &Sigma;)* &rarr; NODE
              Helper function sci_said_recursive: S \times V \times (V \cup \Sigma)* \to \Node
  Parameters: s &isin; S, Rule r &isin; V &times; (V &cup; &Sigma;): v0 &rarr; v1 v2 ... vi
              Parameters: s \in S, Rule r \in V \times (V \cup \Sigma): v0 \to v1 v2 ... vi
  cnmr = cnr
              cnmr = cnr
  NODE n := s, v0
              \Node n := s, v0
  FOR j := 1 TO i
              FOR j := 1 TO i
    IF (vj &isin; &Sigma;) THEN
                        IF (vj \in \Sigma) THEN
      n := n, &gamma;cn&gamma;
                                n := n, \gammacn\gamma
      cn&gamma; := cn&gamma; + 1
                                cn\gamma := cn\gamma + 1
    ELSE
                        ELSE
      cnoldr := cnr
                                cnoldr := cnr
      cnr := cnr + 1
                                cnr := cnr + 1
      n := n, sci_said_recursive(&Sigma;rmr,j, rcnoldr)
                                n := n, sci_said_recursive(\sigmarmr,j, rcnoldr)
    FI
                        FI
  ROF
              ROF
  RETURN (n)
              RETURN (n)
 
 
  Helper function get_children: NODE &rarr; NODE*
 
          get_children((s, v, n0, n1 ... nm)) := n0, n1 ... nm
              Helper function get_children: \Node \to \Node*
 
                              get_children((s, v, n0, n1 ... nm)) := n0, n1 ... nm
 
 
  Algorithm SCI-SAID-TREE
 
  cn&gamma; := 0;
              Algorithm SCI-SAID-TREE
  cnr := 1;
              cn\gamma := 0;
  ntemp := ntemp, SCI-SAID-RECURSIVE(0, r0)
              cnr := 1;
  root(T&Pi;) := (141, 13f, get_children(ntemp))
              ntemp := ntemp, SCI-SAID-RECURSIVE(0, r0)
</pre>
              root(T\Pi) := (141, 13f, get_children(ntemp))
</syntax>


Here is an example, based on the previous one:
Here is an example, based on the previous one:
Line 333: Line 334:
====Example: Parser example====
====Example: Parser example====
Parse is called with <i>"open door"</i>.
Parse is called with <i>"open door"</i>.
* <i>"open" &isin; &lang;842,{C<sub>8</sub>0}&rang;</i> (an imperative word of the word group 0x842)
* <math>\textrm{"open"} \in \langle 842, \{ C_{80} \} \rangle</math> (an imperative word of the word group 0x842)
* <i>"door" &isin; &lang;917,{C<sub>1</sub>0}&rang; (a substantive of the word group 0x917)</i>
* <math>\textrm{"door"} \in \langle 917, \{ C_{10} \} \rangle</math> (a substantive of the word group 0x917)
* <i>I = &lang;842,{C<sub>8</sub>0}&rang;,&lang;917,{C<sub>1</sub>0}&rang;</i>
* <math>I = \langle 842, \{C_{80} \} \rangle, \langle 917, \{C_{10} \} \rangle</math>


<i>I</i> is clearly accepted by automatons based on the grammar described above, There are two possible derivations:
<math>I</math> is clearly accepted by automatons based on the grammar described above, There are two possible derivations:


{|
{|
|D<sub>0</sub> = 13c
|colspan="2" align="center"| D<sub>0</sub> = 13c
|width="50em"|&nbsp;
|colspan="2" align="center"| D<sub>1</sub> = 13c
|----
|----
|(13c &#8614; 13b134)
|(13c &#8614; 13b 134)
|&#8866; 13b 134
|&#8866; 13b 134
|
|(13c &#8614; 13b)
|&#8866; 13b
|----
|----
|(13b &#8614; 131)
|(13b &#8614; 131)
|&#8866; 131 134
|
|(13b &#8614; 131 134)
|&#8866; 131 134
|&#8866; 131 134
|----
|----
|(131 &#8614; C<sub>80</sub>
|&#8866; C<sub>80</sub> 134
|
|(131 &#8614; C<sub>80</sub>
|(131 &#8614; C<sub>80</sub>
|&#8866; C<sub>80</sub> 134
|&#8866; C<sub>80</sub> 134
Line 353: Line 365:
|(134 &#8614; C<sub>10</sub>
|(134 &#8614; C<sub>10</sub>
|&#8866; C<sub>80</sub> C<sub>10</sub>
|&#8866; C<sub>80</sub> C<sub>10</sub>
|----
|
|}
 
<br>
 
{|
|D<sub>1</sub> = 13c
|----
|(13c &#8614; 13b)
|&#8866; 13b
|----
|(13b &#8614; 131134)
|&#8866; 131 134
|----
|(131 &#8614; C<sub>80</sub>
|&#8866; C<sub>80</sub> 134
|----
|(134 &#8614; C<sub>10</sub>
|(134 &#8614; C<sub>10</sub>
|&#8866; C<sub>80</sub> C<sub>10</sub>
|&#8866; C<sub>80</sub> C<sub>10</sub>
|----
|----
|}
|}


====Example: Semantic tree example====
====Example: Semantic tree example====
Line 452: Line 447:
This sequence of operators and word groups is now used to build the Said tree <i>T</i><sub>&Sigma;</sub> . I will describe the algorithm used to generate <i>T</i><sub>&Sigma;</sub> by providing a grammar <i>G</i><sub>&Sigma;</sub> , with <i>L(G<sub>&Sigma;</sub>) </i>containing all valid Said specs. The semantics will be provided under each rule with a double arrow:
This sequence of operators and word groups is now used to build the Said tree <i>T</i><sub>&Sigma;</sub> . I will describe the algorithm used to generate <i>T</i><sub>&Sigma;</sub> by providing a grammar <i>G</i><sub>&Sigma;</sub> , with <i>L(G<sub>&Sigma;</sub>) </i>containing all valid Said specs. The semantics will be provided under each rule with a double arrow:


<syntax type="?">
<pre>
         G\Sigma = ({saidspec, optcont, leftspec, midspec, rightspec, word, cwordset, wordset, expr, cwordrefset, wordrefset, recref}, \Gamma, P, saidspec)
         G\Sigma = ({saidspec, optcont, leftspec, midspec, rightspec, word, cwordset, wordset, expr, cwordrefset, wordrefset, recref}, \Gamma, P, saidspec)


Line 543: Line 538:
                                 \Rightarrow (141 144 (144 14f wordset))
                                 \Rightarrow (141 144 (144 14f wordset))
         }
         }
</syntax>
</pre>


==Matching the trees==
==Matching the trees==
The exact algorithm used to compare <i>T</i><sub>&pi;</sub> to <i>T</i><sub>&Sigma;</sub> is not known yet. The one described here is based on the approximation used in FreeSCI, which is very similar to the original SCI one.
The exact algorithm used to compare <math>T_\pi</math> to <math>T_\Sigma</math> is not known yet. The one described here is based on the approximation used in FreeSCI, which is very similar to the original SCI one.
 
First, we need to describe a set of functions for traversing the nodes of <math>T_\Sigma</math> and <math>T_\pi</math>, and doing some work. They will be operating on the sets <math>\mathbb{N}</math> (all non-negative integral numbers), <math>\mathbb{B}</math> (Booleans), and <math>\mathrm{Node}</math> (which we defined earlier).
 
<math>
\newcommand{\tuple}[1]{\langle #1 \rangle}
\newcommand{\nat}{\mathbb{N}}
\newcommand{\bool}{\mathbb{B}}
\newcommand{\Btt}{\textrm{tt}}
\newcommand{\Bff}{\textrm{ff}}
\newcommand{\Node}{\mathrm{Node}}
\begin{align*}
\textrm{first}: & \Node \to S \\
\textrm{first}: & \tuple{s, v, n_{0}, n_{1} \ldots n_{i}} \mapsto s \\
\\
\textrm{second}: & \Node \to V \\
\textrm{second}: & \tuple{s, v, n_{0}, n_{1} \ldots n_{i}} \mapsto v \\
\\
\textrm{word}: & \Node \to \Gamma \\
\textrm{word}: & \tuple{s, v, \gamma} \mapsto \gamma \\
\\
\textrm{children}: & \Node \to \Node* \\
\textrm{children}: & \tuple{s, v, n_{0}, n_{1} \ldots n_{i}} \mapsto \{ m\in \Node \mid m\in\{ n_{0}, n_{1} \ldots n_{i} \} \} \\
\\
\textrm{all-children}: & \Node \to \Node* \\
\textrm{all-children}: & n \mapsto \textrm{children}(n) \cup \{ m \mid \exists l \in\textrm{all-children}(n) . m\in l \} \\
\\
\textrm{is-word} : & \Node \to \bool \\
\textrm{is-word} : & \tuple{s, v, n_{0}, n_{1} \ldots n_{i}} \mapsto \Btt \iff (i = 0) \land n_{0} \in \Gamma \\
\textrm{verify-sentence-part-elements}: & \Node \times \Node \to \bool \\
\textrm{verify-sentence-part-elements}: & \tuple{n_{p}, n_{s}} \mapsto \Btt \iff (\textrm{first}(n_{s} = 152) \land \\
& ((\forall m \in \Node.\textrm{verify-sentence-part-elements}(m, n_{s}) \iff \\
& \{ w \mid \exists t \in \textrm{all-children}(m).w = \textrm{word}(t)\} = \emptyset) \lor \\
& \exists m \in \textrm{children}(n_{s}).\textrm{verify-sentence-part-elements}(m, n_{s})) ) \lor \\
& ((\textrm{second}(n_{s}) = 153) \land (\exists m.m \in \textrm{children}(n_{s}).(\exists o \in \textrm{all-children}(n_{s}).\\
& (\textrm{first}(o) = \textrm{first}(n_{p})) \land \textrm{word}(o) = \textrm{word}(m))) ) \lor \\
& ((\textrm{second}(n_{s}) \in \{144, 14c\}) \land
(\exists m \in \textrm{children}(n_{s}).\textrm{verify-sentence-part}(m, n_{s}))) \\
\\
\textrm{verify-sentence-part}: & \Node \times \Node \to \bool \\
\textrm{verify-sentence-part}: & \tuple{n_{p}, n_{s}} \mapsto \Btt \iff \forall n \in \textrm{children}(n_{s}):\exists m\in\textrm{children}(n_{p}). \\
& (\textrm{first}(m) = \textrm{first}(n)) \land \textrm{verify-sentence-part-elements}(n, m) \\
\\
\textrm{verify-sentence-part-brackets}: & \Node \times \Node \to \bool \\
\textrm{verify-sentence-part-brackets}: & \tuple{n_{p}, n_{s}} \mapsto \Btt \iff
(\textrm{first}(n_{p}) = 152 \land
(\forall m\in \Node. (\textrm{first}(m) = \textrm{first}(n_{s})) \land\\
& (\textrm{second}(m) = \textrm{second}(n_{s})). \textrm{verify-sentence-part}(n_{p}, m) \iff\\
& \{ w \mid \exists t \in \textrm{all-children}(m).w = \textrm{word}(t)\} = \emptyset)) \lor \\
& ((\textrm{first}(n_{p}) \in \{141, 142, 143\}) \land
    \textrm{verify-sentence-part}(n_{p}, n_{s})) \\
\end{align*}
</math>
 
With these functions, we can now define an algorithm for augmenting <math>T_\pi</math> and <math>T_\Sigma</math>:
 
  Algorithm SCI-AUGMENT:
  matched := tt
  claim_on_match := tt
  FOREACH n &isin; root(<math>T_\Sigma</math>)
    IF ((first(n) = 14b) &and; (second(n) = f900)) THEN
      claim_on_match := ff
    ELSE IF &not;verify_sentence_part_brackets(n, root(<math>T_\pi</math>)) THEN
      matched := ff
    ENDIF
  ENDFOREACH


First, we need to describe a set of functions for traversing the nodes of <i>T</i><sub>&Sigma;</sub> and <i>T</i><sub>&pi;</sub> , and doing some work. They will be operating on the sets &#8469; (all non-negative integral numbers), &#120121; (Booleans), and <i>NODE</i> (which we defined earlier).  
Augmenting succeeded if matched = tt; in this case, <math>T_\pi</math> is one of the trees accepted by the description provided by <math>T_\Sigma</math>. In this case, Said() will return 1. It will also claim the event previously provided to Parse(), unless claim_on_match = ff.


==Notes==
==Notes==
<references/>
<references/>

Latest revision as of 15:00, 10 February 2010

The Parser

Vocabulary file formats

Originally written by Lars Skovlund

The main vocabulary (VOCAB.000)

The file begins with a list of 26 offsets. Each index corresponds to a letter in the (English) alphabet, and points to the first word starting with that letter. The offset is set to 0 if no words start with that letter. If an input word starts with an alphabetical letter, this table is used to speed up the vocabulary searching - though not strictly necessary, this speeds up the lookup process somewhat.

After the offset table are the actual words. A word definition consists of two parts: The actual text of the word, compressed in a special way, and a 24-bit (yes, three bytes) ID. The ID divided in 2 12-bit quantities, a word class (grammatically speaking) mask, and a group number. The class mask is used, among other things, for throwing away unnecessary words. "Take book", for instance, is a valid sentence in parser'ese, while it isn't in English.

The possible values are arranged as a bit field to allow for class masks, see later. Only one bit is actually tested by the interpreter. If a word class equals to 0xff ("anyword"), the word is excluded (allowing for parser'ese). The values go like this:

0x001
Number (not found in the vocabulary, set internally)
0x002
Special
0x004
Special
0x008
Special[1]
0x010
Preposition
0x020
Article
0x040
Qualifying adjective
0x080
Relative pronoun
0x100
Noun
0x200
Indicative verb (such as "is","went" as opposed to _do_ this or that, which is imperative)
0x400
Adverb
0x800
Imperative verb

The group number is used to implement synonyms (words with the same meaning), as well as by the Said instruction to identify words. There is also a way of using synonyms in code, see the appropriate document.

The compression works in this way: Each string starts with a byte-sized copy count. This many characters are retained from the previous string. The actual text comes after, in normal low-ascii. The last character in the text has its high bit set (no null termination!).

Here is an example of the compression scheme:

apple 0,appl\0xE5


The byte count is 0 because we assume that "apple" is the first word beginning with an a (not likely, though!). 0xE5 is 0x65 (the ascii value for 'e') | 0x80. Watch now the next word:

athlete 1,thlet\0xE5


Here, the initial letter is identical to that of its predecessor, so the copy count is 1. Another example:

atrocious 2,rociou\0xF3


The suffix vocabulary (VOCAB.901)

For the following section, a reference to a grammar book may be advisable.

The suffix vocabulary is structurally much simpler. It consists of variably-sized records with this layout:

NULL-TERMINATED Suffix string
WORD The class mask for the suffix
NULL-TERMINATED Reduced string
WORD The output word class


The suffix vocabulary is used by the interpreter in order to parse compound words, and other words which consist of more than one part. For instance, a simple plural noun like "enemies" is reduced to its singular form "enemy", "stunning" is converted to "stun" etc. The point is that the interpreter gets a second chance at figuring out the meaning if the word can not be identified as entered. A word which changes its class does might end up as a different word class, the correct class is always retained. Thus, "carefully", an adverb, is reduced to its adjectival form "careful", and found in the vocabulary as such, but it is still marked as an adverb.

The suffix vocabulary consists of variably-sized records with this layout:

NULL-TERMINATED Suffix string
WORD The output word class
NULL-TERMINATED Reduced string
WORD The allowed class mask for the reduced

An asterisk (*) represents the word stem. Taking the above example with "enemies", the interpreter finds this record:

*ies
0x100
*y
0x100

Word class 0x100 being a noun.

The interpreter then tries to replace "enemies" with "enemy" and finds that word in the vocabulary. "Enemy" is a noun (class 1), which it is also supposed to be, according to the suffix vocabulary. Since we succeeded, the word class is set to the output value (which is, incidentally, also 1).

Numbers

If the word turns out to be a number (written with numbers, that is), and that number is not listed explicitly in the vocabulary, it gets an ID of 0xFFD, and a word class of 0x100.

The tree vocabulary (VOCAB.900)

This vocabulary is used solely for building parse trees. It consists of a series of word values which end up in the data nodes on the tree. It doesn't make much sense without the original parsing code.

The black box: The magic behind Sierra's text parser

Original document by Lars Skovlund. Document incomplete by the author.

This document describes the process of parsing user input and relating it to game actions. This document does not describe the process of the user typing his command; only the "behind-the-scenes" work is described, hence the title.

The process of parsing is two-fold, mainly for speed reasons. The Parse kernel function takes the actual input string and generates a special "said" event (type 0x80) from it. This function is only called once per line. Parse can either accept or reject the input.

A rejection can only occur if Parse fails to identify a word in the sentence.

Even if Parse accepts the sentence, it does not need to make sense. Still, syntax checks are made - see later.

Assuming that the parsing succeeded, the User object (which encapsulates the parser) then goes on to call the relevant event handlers. These event hand- lerrs in turn call the Said kernel function. This function is potentially called hundreds or even thousands of times, so it must execute as quickly as possible. Said simply determines from the pre-parsed input line whether or not a specific command is desired.

The Parse function must always work on an internal copy of the actual string, because the user must be able to recall his exact last input using the F3 key. The parser's first step is to convert the input line to pure lower case. This is because the vocabulary words are entered in lower case. The parser then searches the main vocabulary (VOCAB.000), hoping to find the word.

This doesn't necessarily happen yet. Consider, for example, the meaning of the word "carefully", which does not appear in the vocabulary, but is found anyway. This is due to the so-called suffix vocabulary, which is discussed in another document.

If the word still can't be found, the interpreter copies the failing word into a buffer temporarily allocated on the heap (remember, the interpreter operates on its own local buffers). It then calls the Game::wordFail method which prints an appropriate message. The interpreter then deallocates the buffer and exits (it does, however, still return an event. The claimed property of that event is set to TRUE to indicate that the event has already been responded to (error message printed)).

If the interpreter succeeds in identifying all the words, it then goes on to check the syntax of the sentence - it builds a parse tree. See the appropri- ate document.

If the syntax of the sentence is invalid, the interpreter calls Game::syntaxFail, passing the entire input line. As for the error situation, the event is claimed.

As mentioned in the beginning of this text, this function generates an event. This event, apart from its type id, does not contain any data. Rather, all pertinent data is kept in the interpreter.

The Said kernel function is called for each command which the game might respond to at any given time. Its only parameter is a pointer to a said information block which resides in script space. This block is described below (see the Said specs section).

The Said function first does some sanity checking on the event pointer which Parse stored earlier. It must be a said event (type property), and it must not have been handled by an earlier call to Said (claimed property).

It then word-extends the passed said block into a temporary buffer (command codes are byte-sized, remember?). This is supposedly just for convenience/speed, and not really needed.

The Parse tree

This and the two following sections borrow some ideas and structures from abstract language theory. Readers might want to consider related literature.

Most of the information explained here was gathered by Lars Skovlund, and, before that, Dark Minister.

After tokenizing, looking up, and finally aliasing the data found in the parsed input string, the interpreter proceeds to build a parse tree Tπ from the input tokens

[math]\displaystyle{ I~:=~\omega_0,\omega_1,\omega_2 \ldots \omega_{n-1} }[/math]

where

  • [math]\displaystyle{ \omega_j \in W }[/math], with [math]\displaystyle{ W }[/math] being the set of all words;
  • [math]\displaystyle{ \gamma_j \in \Gamma }[/math], with [math]\displaystyle{ \Gamma }[/math] being the set of all word groups;
  • [math]\displaystyle{ \mu_j \in 2^C }[/math], with [math]\displaystyle{ C }[/math] being the set of all class masks {0x1,0x2,0x4,0x8,0x10,0x20,0x40,0x80,0x100}, and [math]\displaystyle{ 2^C }[/math] being the set of all subsets of [math]\displaystyle{ C }[/math];
  • [math]\displaystyle{ \omega_j = (\gamma_j, \mu_j) }[/math], with [math]\displaystyle{ \gamma_j }[/math] being the word group [math]\displaystyle{ \omega_j }[/math] belongs to, and [math]\displaystyle{ \mu_j }[/math] being its class mask.

Note that elements of [math]\displaystyle{ 2^C }[/math] (i.e. sets of class masks) can be identified with the ORed value of these class masks. So the set {0x2,0x4,0x80} can be identified with the value 0x86.

For the following sections, we define two maps and a set:

  • [math]\displaystyle{ \operatorname{group}: W \to \Gamma  : (\gamma,\mu) \mapsto \gamma }[/math]
  • [math]\displaystyle{ \operatorname{classes}: W \to 2^C : (\gamma,\mu) \mapsto \mu }[/math]
  • [math]\displaystyle{ C_x = \{ \omega \in W \mid x \in \operatorname{classes}(\omega)\} }[/math]

To do that, it uses the class masks [math]\displaystyle{ M }[/math] as input for a pushdown automaton (PDA) A built from a parser grammar; if [math]\displaystyle{ M }[/math] was accepted by [math]\displaystyle{ A }[/math], the parse tree [math]\displaystyle{ T_pi }[/math] will be built from the matching syntax tree to represent the semantics.

The PDA is defined by a grammar [math]\displaystyle{ G = (V,\Sigma,P,s) }[/math] where

  • [math]\displaystyle{ V }[/math] is finite set of nonterminal symbols,
  • [math]\displaystyle{ \Sigma }[/math] is a finite set terminal symbols that is disjoint from [math]\displaystyle{ V }[/math],
  • [math]\displaystyle{ P }[/math] is a finite set of production rules,
  • [math]\displaystyle{ s \in V }[/math] is the start symbol.

Most of this grammar, along with its semantics, is stored in vocab.900. This resource contains a parser rule at every 20 bytes, starting with a non-terminal symbol [math]\displaystyle{ \upsilon }[/math] (one word) and a null-terminated list of up to five tuples [math]\displaystyle{ (\sigma_i, m_i) }[/math], both of which are words. In these tuples, [math]\displaystyle{ m_i }[/math] is a terminal or non-terminal symbol (determined by [math]\displaystyle{ \sigma_i }[/math] ), and [math]\displaystyle{ \sigma_i }[/math] is the meaning of [math]\displaystyle{ m_i }[/math] as described in the following table:

[math]\displaystyle{ \sigma_i }[/math] Type Meaning
0x141 Non-terminal Predicate part: his identifies the first part of a sentence
0x142 Non-terminal Subject part: This identifies the second part of a sentence
0x143 Non-terminal Suffix part: This identifies the third and last part of a sentence
0x144 Non-terminal Reference part: This identifies words that reference another word in the same sentence part
0x146 Terminal Match on class mask: Matches if [math]\displaystyle{ m_i\in\operatorname{classes}(\omega_j) }[/math]
0x14d Terminal Match on word group: Matches if [math]\displaystyle{ m_i=\operatorname{group}(\omega_j) }[/math]
0x154 Terminal "Force storage": Apparently, this was only used for debugging

With the notable exception of the first rule, these rules constitute the set of production rules [math]\displaystyle{ P }[/math]. Furthermore we have [math]\displaystyle{ V := \{x \mid \exists R \in P : x \in R\} }[/math]; typically, [math]\displaystyle{ V = \{ \mathrm{0x12f}, \ldots, \mathrm{0x13f}\} }[/math] and [math]\displaystyle{ s = m_0 }[/math] of the first rule encountered; in all games observed, it was set to 0x13c. [math]\displaystyle{ \Sigma }[/math] contains all word groups and class masks. For the sake of simplicity, we will consider rules matching composite class masks to be several rules. Here is a simplified example of what such a grammar might look like (the hexadecimal prefix '0x' is omitted for brevity):

In addition to this grammar, each right-hand non-terminal [math]\displaystyle{ m_i }[/math] carries its semantic value [math]\displaystyle{ \rho_i }[/math] , which is not relevant for constructing a syntax tree, but must be considered for the semantic tree [math]\displaystyle{ T_\pi }[/math]. These values were omitted in the example above. As in the example above, the grammar is a context-free (type 2) grammar, almost in Chomsky Normal Form (CNF) in SCI; constructing a grammar with CNF rules from it would be trivial.[2]

Obviously, [math]\displaystyle{ G }[/math] is an ambiguous grammar. In SCI, rule precedence is implied by rule order, so the resulting left derivation tree is well-defined (in the Parser example further down, it would be defined by [math]\displaystyle{ D_0 }[/math].[3]

Parser grammar example

[math]\displaystyle{ G = \langle \{ \mathrm{12f}, \ldots, \mathrm{13e} \}, \{ C_1, C_2, C_4, \ldots, C_{100} \}, P, \mathrm{13c} \rangle }[/math]

P = { 13c 13b 134
13c 13b 13d 133
13c 13b 13d
13c 13b
13c 13b 13d 13b 13d
13b 131 134
13b 131 13d 13d
13b 131
13d 134
131 C80
133 C20
134 C10 }

Semantics

This is important, since the parser does much more than just accept or discard input. Using the semantic tags applied to each non-terminal on the right-hand side of a rule, it constructs what I will call the semantic parse tree [math]\displaystyle{ T_\pi }[/math] , which attempts to describe what the input means. For each non-terminal rule

[math]\displaystyle{ r = v_0 \to v_1 v_2 \ldots v_n }[/math]

there are semantic tags [math]\displaystyle{ \sigma_{r,1}, \sigma_{r,2} \ldots \sigma_{r,n} \in S }[/math], as explained above. [math]\displaystyle{ T_\pi }[/math] is now constructed from the resulting derivation and the semantic tags assiociated with each non-terminal of the rule used. The construction algorithm is explained below with [math]\displaystyle{ T_\pi }[/math] being constructed from nodes, which have the following recursive structure:

[math]\displaystyle{ \newcommand{\Node}{\textrm{NODE}} \Node = \{\diamond\} \cup S \times V \times (\Node \cup \Gamma)*; }[/math]

Where [math]\displaystyle{ S }[/math] is the set of possible semantic values, and [math]\displaystyle{ V }[/math] is the set of non-terminals as defined in the grammar. We will also use the sequence [math]\displaystyle{ \gamma_{0}, \gamma_{1}, \gamma_{2} \ldots \gamma_{k-1} }[/math] which will represent the word groups the input tokens belonged to (in the exact order they were accepted), and the sequence [math]\displaystyle{ r_{0}, r_{1}, r_{2} \ldots r_{l-1} }[/math], which will be the list of rules used to create the left derivation tree as described in the previous section.

  Helper function sci_said_recursive: S × V × (V ∪ Σ)* → NODE
  Parameters: s ∈ S, Rule r ∈ V × (V ∪ Σ): v0 → v1 v2 ... vi
  cnmr = cnr
  NODE n := s, v0
  FOR j := 1 TO i
    IF (vj ∈ Σ) THEN
      n := n, γcnγ
      cnγ := cnγ + 1
    ELSE
      cnoldr := cnr
      cnr := cnr + 1
      n := n, sci_said_recursive(Σrmr,j, rcnoldr)
    FI
  ROF
  RETURN (n)
  
  Helper function get_children: NODE → NODE*
          get_children((s, v, n0, n1 ... nm)) := n0, n1 ... nm
  
  
  Algorithm SCI-SAID-TREE
  cnγ := 0;
  cnr := 1;
  ntemp := ntemp, SCI-SAID-RECURSIVE(0, r0)
  root(TΠ) := (141, 13f, get_children(ntemp))

Here is an example, based on the previous one:

Example: Parser example

Parse is called with "open door".

  • [math]\displaystyle{ \textrm{"open"} \in \langle 842, \{ C_{80} \} \rangle }[/math] (an imperative word of the word group 0x842)
  • [math]\displaystyle{ \textrm{"door"} \in \langle 917, \{ C_{10} \} \rangle }[/math] (a substantive of the word group 0x917)
  • [math]\displaystyle{ I = \langle 842, \{C_{80} \} \rangle, \langle 917, \{C_{10} \} \rangle }[/math]

[math]\displaystyle{ I }[/math] is clearly accepted by automatons based on the grammar described above, There are two possible derivations:

D0 = 13c   D1 = 13c
(13c ↦ 13b 134) ⊢ 13b 134 (13c ↦ 13b) ⊢ 13b
(13b ↦ 131) ⊢ 131 134 (13b ↦ 131 134) ⊢ 131 134
(131 ↦ C80 ⊢ C80 134 (131 ↦ C80 ⊢ C80 134
(134 ↦ C10 ⊢ C80 C10 (134 ↦ C10 ⊢ C80 C10

Example: Semantic tree example

  • k = 2
  • γ0 = 842
  • γ1 = 917
  • l = 4
  • r0 = 13c ↦ 13b 134
  • σr0,1 = 141
  • σr0,2 = 142
  • r1 = 13b ↦ 131
  • σ11,1 = 141
  • r2 = 131 ↦ C80
  • r3 = 134 ↦ C10

The resulting tree would look like this:

(141 13f
	(141 13b
		(141 131 842)
	)
	(142 134 917)
)

Said specs

To test what the player wanted to say, SCI compares Tπ with a second tree, TΣ , which is built from a so-called Said spec. A Said spec is a variable-sized block in SCI memory which consists of a set of byte-sized operators and special tokens (stored in the range 0xf0 to 0xf9) and word groups (in big-endian notation, so that they don't conflict with the operators); it is terminated by the special token 0xff. The meanings of the operators and special tokens are as follows:

Operator Byte representation Meaning
, f0 "OR". Used to specify alternatives to words, such as "take, get".
& f1 Unknown. Probably used for debugging.
/ f2 Sentence part separator. Only two of these tokens may be used, since sentences are split into a maximum of three parts.
( f3 Used together with ')' for grouping
) f4 See '('
[ f5 Used together with ']' for optional grouping. "[a]" means "either a or nothing".
] f6 See '['.
# f7 Unknown assumed to have been used for debugging, if at all.
< f8 Semantic reference operator (as in "get < up").
> f9 Instructs Said() not to claim the event passed to the previous Parse() call on a match. Used for succesive matching.


This sequence of operators and word groups is now used to build the Said tree TΣ . I will describe the algorithm used to generate TΣ by providing a grammar GΣ , with L(GΣ) containing all valid Said specs. The semantics will be provided under each rule with a double arrow:

        G\Sigma = ({saidspec, optcont, leftspec, midspec, rightspec, word, cwordset, wordset, expr, cwordrefset, wordrefset, recref}, \Gamma, P, saidspec)

        P := {
          saidspec \to   leftspec optcont
                                \Rightarrow (141 13f leftspec optcont)
                        | leftspec midspec optcont
                                \Rightarrow (141 13f leftspec midspec optcont)
                        | leftspec midspec rightspec optcont
                                \Rightarrow (141 13f leftspec midspec rightspec optcont)



          optcont \to   e
                                \Rightarrow
                        | >
                                \Rightarrow (14b f900 f900)



          leftspec \to  e
                                \Rightarrow
                        | expr
                                \Rightarrow (141 149 expr)



          midspec \to    / expr
                                \Rightarrow (142 14a expr)
                        | [ / expr ]
                                \Rightarrow (152 142 (142 14a expr))
                        | /
                                \Rightarrow



          rightspec \to  / expr
                                \Rightarrow (143 14a expr)
                        | [ / expr ]
                                \Rightarrow (152 143 (143 14a expr))
                        | /
                                \Rightarrow


          word \to       \gamma \in \Gamma
                                \Rightarrow (141 153 \gamma)


          cwordset \to   wordset
                                \Rightarrow (141 14f wordset)
                        | [ wordset ]
                                \Rightarrow (141 14f (152 14c (141 14f wordset)))


          wordset \to    word
                                \Rightarrow word
                        | ( expr )
                                \Rightarrow (141 14c expr)
                        | wordset , wordset
                                \Rightarrow wordset wordset
                        | wordset , [ wordset ]
                                \Rightarrow wordset wordset


          expr \to               cwordset cwordrefset
                                \Rightarrow cwordset cwordrefset
                        | cwordset
                                \Rightarrow cwordset
                        | cwordrefset
                                \Rightarrow cwordrefset
        

          cwordrefset \to        wordrefset
                                \Rightarrow wordrefset
                        | [ wordrefset ]
                                \Rightarrow (152 144 wordrefset)


          wordrefset \to        < wordset recref
                                \Rightarrow (144 14f word) recref
                        | < wordset
                                \Rightarrow (144 14f word)
                        | < [ wordset ]
                                \Rightarrow (152 144 (144 14f wordset))


          recref \to    < wordset recref
                                \Rightarrow (141 144 (144 14f wordset) recref)
                        | < wordset
                                \Rightarrow (141 144 (144 14f wordset))
        }

Matching the trees

The exact algorithm used to compare [math]\displaystyle{ T_\pi }[/math] to [math]\displaystyle{ T_\Sigma }[/math] is not known yet. The one described here is based on the approximation used in FreeSCI, which is very similar to the original SCI one.

First, we need to describe a set of functions for traversing the nodes of [math]\displaystyle{ T_\Sigma }[/math] and [math]\displaystyle{ T_\pi }[/math], and doing some work. They will be operating on the sets [math]\displaystyle{ \mathbb{N} }[/math] (all non-negative integral numbers), [math]\displaystyle{ \mathbb{B} }[/math] (Booleans), and [math]\displaystyle{ \mathrm{Node} }[/math] (which we defined earlier).

[math]\displaystyle{ \newcommand{\tuple}[1]{\langle #1 \rangle} \newcommand{\nat}{\mathbb{N}} \newcommand{\bool}{\mathbb{B}} \newcommand{\Btt}{\textrm{tt}} \newcommand{\Bff}{\textrm{ff}} \newcommand{\Node}{\mathrm{Node}} \begin{align*} \textrm{first}: & \Node \to S \\ \textrm{first}: & \tuple{s, v, n_{0}, n_{1} \ldots n_{i}} \mapsto s \\ \\ \textrm{second}: & \Node \to V \\ \textrm{second}: & \tuple{s, v, n_{0}, n_{1} \ldots n_{i}} \mapsto v \\ \\ \textrm{word}: & \Node \to \Gamma \\ \textrm{word}: & \tuple{s, v, \gamma} \mapsto \gamma \\ \\ \textrm{children}: & \Node \to \Node* \\ \textrm{children}: & \tuple{s, v, n_{0}, n_{1} \ldots n_{i}} \mapsto \{ m\in \Node \mid m\in\{ n_{0}, n_{1} \ldots n_{i} \} \} \\ \\ \textrm{all-children}: & \Node \to \Node* \\ \textrm{all-children}: & n \mapsto \textrm{children}(n) \cup \{ m \mid \exists l \in\textrm{all-children}(n) . m\in l \} \\ \\ \textrm{is-word} : & \Node \to \bool \\ \textrm{is-word} : & \tuple{s, v, n_{0}, n_{1} \ldots n_{i}} \mapsto \Btt \iff (i = 0) \land n_{0} \in \Gamma \\ \textrm{verify-sentence-part-elements}: & \Node \times \Node \to \bool \\ \textrm{verify-sentence-part-elements}: & \tuple{n_{p}, n_{s}} \mapsto \Btt \iff (\textrm{first}(n_{s} = 152) \land \\ & ((\forall m \in \Node.\textrm{verify-sentence-part-elements}(m, n_{s}) \iff \\ & \{ w \mid \exists t \in \textrm{all-children}(m).w = \textrm{word}(t)\} = \emptyset) \lor \\ & \exists m \in \textrm{children}(n_{s}).\textrm{verify-sentence-part-elements}(m, n_{s})) ) \lor \\ & ((\textrm{second}(n_{s}) = 153) \land (\exists m.m \in \textrm{children}(n_{s}).(\exists o \in \textrm{all-children}(n_{s}).\\ & (\textrm{first}(o) = \textrm{first}(n_{p})) \land \textrm{word}(o) = \textrm{word}(m))) ) \lor \\ & ((\textrm{second}(n_{s}) \in \{144, 14c\}) \land (\exists m \in \textrm{children}(n_{s}).\textrm{verify-sentence-part}(m, n_{s}))) \\ \\ \textrm{verify-sentence-part}: & \Node \times \Node \to \bool \\ \textrm{verify-sentence-part}: & \tuple{n_{p}, n_{s}} \mapsto \Btt \iff \forall n \in \textrm{children}(n_{s}):\exists m\in\textrm{children}(n_{p}). \\ & (\textrm{first}(m) = \textrm{first}(n)) \land \textrm{verify-sentence-part-elements}(n, m) \\ \\ \textrm{verify-sentence-part-brackets}: & \Node \times \Node \to \bool \\ \textrm{verify-sentence-part-brackets}: & \tuple{n_{p}, n_{s}} \mapsto \Btt \iff (\textrm{first}(n_{p}) = 152 \land (\forall m\in \Node. (\textrm{first}(m) = \textrm{first}(n_{s})) \land\\ & (\textrm{second}(m) = \textrm{second}(n_{s})). \textrm{verify-sentence-part}(n_{p}, m) \iff\\ & \{ w \mid \exists t \in \textrm{all-children}(m).w = \textrm{word}(t)\} = \emptyset)) \lor \\ & ((\textrm{first}(n_{p}) \in \{141, 142, 143\}) \land \textrm{verify-sentence-part}(n_{p}, n_{s})) \\ \end{align*} }[/math]

With these functions, we can now define an algorithm for augmenting [math]\displaystyle{ T_\pi }[/math] and [math]\displaystyle{ T_\Sigma }[/math]:

 Algorithm SCI-AUGMENT:
 matched := tt
 claim_on_match := tt
 FOREACH n ∈ root([math]\displaystyle{ T_\Sigma }[/math])
   IF ((first(n) = 14b) ∧ (second(n) = f900)) THEN
     claim_on_match := ff
   ELSE IF ¬verify_sentence_part_brackets(n, root([math]\displaystyle{ T_\pi }[/math])) THEN
     matched := ff
   ENDIF
 ENDFOREACH

Augmenting succeeded if matched = tt; in this case, [math]\displaystyle{ T_\pi }[/math] is one of the trees accepted by the description provided by [math]\displaystyle{ T_\Sigma }[/math]. In this case, Said() will return 1. It will also claim the event previously provided to Parse(), unless claim_on_match = ff.

Notes

  1. The three special classes are apparently used for words with very specific semantics, such as "if", "not", "and" etc. It is unknown as of yet whether they receive special treatment by the parser.
  2. FreeSCI constructs a GNF (Greibach Normal Form) representation from these rules for parsing.
  3. In FreeSCI, you can use the ”parse” console command to retrieve all possible left derivation trees.