1,079
edits
(→The Parse tree: more tweaks and corrections) |
(→The Parse tree: Started to convert things to <math>; clarified some parts of the text) |
||
Line 154: | Line 154: | ||
After tokenizing, looking up, and finally aliasing the data found in the parsed input string, the interpreter proceeds to build a parse tree <!-- <math>T_\pi</math> --> <i>T</i><sub>π</sub> from the input tokens | After tokenizing, looking up, and finally aliasing the data found in the parsed input string, the interpreter proceeds to build a parse tree <!-- <math>T_\pi</math> --> <i>T</i><sub>π</sub> from the input tokens | ||
: | :<math>I~:=~\omega_0,\omega_1,\omega_2 \ldots \omega_{n-1}</math> | ||
where | where | ||
* < | * <math>\omega_j \in W</math>, with <math>W</math> being the set of all words; | ||
* < | * <math>\gamma_j \in \Gamma</math>, with <math>\Gamma</math> being the set of all word groups; | ||
* < | * <math>\mu_j \in 2^C</math>, with <math>C</math> being the set of all class masks <nowiki>{0x1,0x2,0x4,0x8,0x10,0x20,0x40,0x80,0x100}</nowiki>, and <math>2^C</math> being the set of all subsets of <math>C</math>; | ||
* < | * <math>\omega_j = (\gamma_j, \mu_j)</math>, with <math>\gamma_j</math> being the word group <math>\omega_j</math> belongs to, and <math>\mu_j</math> being its class mask. | ||
Note that elements of <math>2^C</math> (i.e. sets of class masks) can be identified with the ORed value of these class masks. So the set <nowiki>{0x2,0x4,0x80}</nowiki> can be identified with the value 0x86. | |||
For the following sections, we define two maps and a set: | |||
* <math>\operatorname{group}: W \to \Gamma : (\gamma,\mu) \mapsto \gamma</math> | |||
* <math>\operatorname{classes}: W \to 2^C : (\gamma,\mu) \mapsto \mu</math> | |||
* <math>C_x = \{ \omega \in W \mid x \in \operatorname{classes}(\omega)\}</math> | |||
To do that, it uses the class masks <math>M</math> as input for a pushdown automaton (PDA) A built from a parser grammar; if <math>M</math> was accepted by <math>A</math>, the parse tree <math>T_pi</math> will be built from the matching syntax tree to represent the semantics. | |||
The PDA is defined by a grammar <math>G = (V,\Sigma,P,s)</math> where | |||
* <math>V</math> is finite set of <i>nonterminal symbols</i>, | |||
* <math>\Sigma</math> is a finite set <i>terminal symbols</i> that is disjoint from <math>V</math>, | |||
* <math>P</math> is a finite set of <i>production rules</i>, | |||
* <math>s \in V</math> is the <i>start symbol</i>. | |||
Most of this grammar, along with its semantics, is stored in <tt>vocab.900</tt>. This resource contains a parser rule at every 20 bytes, starting with a non-terminal symbol <math>\upsilon</math> (one word) and a null-terminated list of up to five tuples <math>(\sigma_i, m_i)</math>, both of which are words. In these tuples, <math>m_i</math> is a terminal or non-terminal symbol (determined by <math>\sigma_i</math> ), and <math>\sigma_i</math> is the meaning of <math>m_i</math> as described in the following table: | |||
{| border="1" style="margin: 1em auto 1em auto" | |||
|<math>\sigma_i</math> | |||
|Type | |Type | ||
|Meaning | |Meaning | ||
Line 202: | Line 204: | ||
|0x146 | |0x146 | ||
|Terminal | |Terminal | ||
|Match on class mask: Matches if < | |Match on class mask: Matches if <math>m_i\in\operatorname{classes}(\omega_j)</math> | ||
|---- | |---- | ||
|0x14d | |0x14d | ||
|Terminal | |Terminal | ||
|Match on word group: Matches if < | |Match on word group: Matches if <math>m_i=\operatorname{group}(\omega_j)</math> | ||
|---- | |---- | ||
|0x154 | |0x154 | ||
Line 214: | Line 216: | ||
|} | |} | ||
With the notable exception of the first rule, these rules constitute < | With the notable exception of the first rule, these rules constitute the set of production rules <math>P</math>. Furthermore we have <math>V := \{x \mid \exists R \in P : x \in R\}</math>; typically, <math>V = \{ \mathrm{0x12f}, \ldots, \mathrm{0x13f}\}</math> and <math>s = m_0</math> of the first rule encountered; in all games observed, it was set to 0x13c. <math>\Sigma</math> contains all word groups and class masks. For the sake of simplicity, we will consider rules matching composite class masks to be several rules. Here is a simplified example of what such a grammar might look like (the hexadecimal prefix '0x' is omitted for brevity): | ||
In addition to this grammar, each right-hand non-terminal <i>m<sub>i</sub></i> carries its semantic value <i>ρ<sub>i</sub></i> , which is not relevant for constructing a syntax tree, but must be considered for the semantic tree <i>T</i><sub>π</sub>. These values were omitted in the example above. As in the example above, the grammar is a context-free (type 2) grammar, almost in Chomsky Normal Form (CNF) in SCI; constructing a grammar with CNF rules from it would be trivial.<ref>FreeSCI constructs a GNF (Greibach Normal Form) representation from these rules for parsing.</ref> | In addition to this grammar, each right-hand non-terminal <i>m<sub>i</sub></i> carries its semantic value <i>ρ<sub>i</sub></i> , which is not relevant for constructing a syntax tree, but must be considered for the semantic tree <i>T</i><sub>π</sub>. These values were omitted in the example above. As in the example above, the grammar is a context-free (type 2) grammar, almost in Chomsky Normal Form (CNF) in SCI; constructing a grammar with CNF rules from it would be trivial.<ref>FreeSCI constructs a GNF (Greibach Normal Form) representation from these rules for parsing.</ref> |
edits