Open main menu

Difference between revisions of "SCI/Specifications/SCI in action/Parser"

→‎The Parse tree: Started to convert things to <math>; clarified some parts of the text
(→‎The Parse tree: more tweaks and corrections)
(→‎The Parse tree: Started to convert things to <math>; clarified some parts of the text)
Line 154: Line 154:
After tokenizing, looking up, and finally aliasing the data found in the parsed input string, the interpreter proceeds to build a parse tree <!-- <math>T_\pi</math> --> <i>T</i><sub>&pi;</sub> from the input tokens
After tokenizing, looking up, and finally aliasing the data found in the parsed input string, the interpreter proceeds to build a parse tree <!-- <math>T_\pi</math> --> <i>T</i><sub>&pi;</sub> from the input tokens


:<i>I :=</i> &omega;<sub>0</sub>, &omega;<sub>1</sub>, &omega;<sub>2</sub> &hellip; &omega;<sub>n-1</sub> <!-- Mediawiki LaTeX <math>I:=~\omega_0,\omega_1,\omega_2 \ldots \omega_{n-1}</math> --><br>
:<math>I~:=~\omega_0,\omega_1,\omega_2 \ldots \omega_{n-1}</math>


where
where


* <i>&omega;<sub>j</sub> &isin; W</i>, with <i>W</i> being the set of all words;
* <math>\omega_j \in W</math>, with <math>W</math> being the set of all words;
* <i>&gamma;<sub>j</sub> &isin; &Gamma;</i>, with <i>&Gamma;</i> being the set of all word groups;
* <math>\gamma_j \in \Gamma</math>, with <math>\Gamma</math> being the set of all word groups;
* <i>&mu;<sub>j</sub> &isin; 2<sup>C</sup> </i>, with <i>C</i> being the set of all class masks <nowiki>{0x1,0x2,0x4,0x8,0x10,0x20,0x40,0x80,0x100}</nowiki>, and <i>2<sup>C</sup></i> being the set of all subsets of <i>C</i>;
* <math>\mu_j \in 2^C</math>, with <math>C</math> being the set of all class masks <nowiki>{0x1,0x2,0x4,0x8,0x10,0x20,0x40,0x80,0x100}</nowiki>, and <math>2^C</math> being the set of all subsets of <math>C</math>;
* <i>&omega;<sub>j</sub> = (&gamma;<sub>j</sub>, &mu;<sub>j</sub>)</i>, with <i>&gamma;<sub>j</sub></i> being the word group <i>&omega;<sub>j</sub></i> belongs to, and <i>&mu;<sub>j</sub></i> being its class mask.
* <math>\omega_j = (\gamma_j, \mu_j)</math>, with <math>\gamma_j</math> being the word group <math>\omega_j</math> belongs to, and <math>\mu_j</math> being its class mask.
<!-- Math formulas
* <math>\omega_j \in W</math>
*<math>\gamma_j \in \Gamma</math>
*<math>\mu_j \in 2^C</math>
*<math>\omega_j = (\gamma_j, \mu_j</math> -->
Note that elements of <i>2<sup>C</sup></i> (i.e. sets of class masks) can be identified with the ORed value of these class masks. So the set <nowiki>{0x2,0x4,0x80}</nowiki> can be identified with the value 0x86.


For the following sections, we define
Note that elements of <math>2^C</math> (i.e. sets of class masks) can be identified with the ORed value of these class masks. So the set <nowiki>{0x2,0x4,0x80}</nowiki> can be identified with the value 0x86.


* group: <i>W &rarr; &Gamma;  : (&gamma;,&mu;)  &#8614; &gamma;</i>
For the following sections, we define two maps and a set:
* classes: <i>W &rarr; 2<sup>C</sup> : (&gamma;,&mu;) &#8614; &mu;</i>
* <i>C<sub>x</sub> = {&omega; &isin; W | x &isin; classes(&omega;)}</i>


To do that, it uses the class masks <i>M</i> as input for a pushdown automaton (PDA) A built from a parser grammar; if <i>M</i> was accepted by <i>A</i>, the parse tree <i>T</i><sub>&pi;</sub> will be built from the matching syntax tree to represent the semantics.
* <math>\operatorname{group}: W \to \Gamma  : (\gamma,\mu) \mapsto \gamma</math>
* <math>\operatorname{classes}: W \to 2^C : (\gamma,\mu) \mapsto \mu</math>
* <math>C_x = \{ \omega \in W \mid x \in \operatorname{classes}(\omega)\}</math>


The PDA is defined by a grammar <i>G = (V,&Sigma;,P,s)</i> most of which, along with its semantics, is stored in <tt>vocab.900</tt>. This resource contains a parser rule at every 20 bytes, starting with a non-terminal symbol &upsilon; (one word) and a null-terminated list of up to five tuples <i>&sigma;<sub>i</sub>,m<sub>i</sub></i> , both of which are words. In these tuples, <i>m<sub>i</sub></i> is a terminal or non-terminal symbol (determined by <i>&sigma;<sub>i</sub></i> ), and <i>&sigma;<sub>i</sub></i> is the meaning of <i>m<sub>i</sub></i> :
To do that, it uses the class masks <math>M</math> as input for a pushdown automaton (PDA) A built from a parser grammar; if <math>M</math> was accepted by <math>A</math>, the parse tree <math>T_pi</math> will be built from the matching syntax tree to represent the semantics.


{| border="1"
The PDA is defined by a grammar <math>G = (V,\Sigma,P,s)</math> where
|<i>&sigma;<sub>i</sub></i>
* <math>V</math> is finite set of <i>nonterminal symbols</i>,
* <math>\Sigma</math> is a finite set <i>terminal symbols</i> that is disjoint from <math>V</math>,
* <math>P</math> is a finite set of <i>production rules</i>,
* <math>s \in V</math> is the <i>start symbol</i>.
 
Most of this grammar, along with its semantics, is stored in <tt>vocab.900</tt>. This resource contains a parser rule at every 20 bytes, starting with a non-terminal symbol <math>\upsilon</math> (one word) and a null-terminated list of up to five tuples <math>(\sigma_i, m_i)</math>, both of which are words. In these tuples, <math>m_i</math> is a terminal or non-terminal symbol (determined by <math>\sigma_i</math> ), and <math>\sigma_i</math> is the meaning of <math>m_i</math> as described in the following table:
 
{| border="1" style="margin: 1em auto 1em auto"
|<math>\sigma_i</math>
|Type
|Type
|Meaning
|Meaning
Line 202: Line 204:
|0x146
|0x146
|Terminal
|Terminal
|Match on class mask: Matches if <span style="white-space:nowrap"><i>m<sub>i</sub> &isin; classes(&omega;<sub>j</sub>)</i></span>
|Match on class mask: Matches if <math>m_i\in\operatorname{classes}(\omega_j)</math>
|----
|----
|0x14d
|0x14d
|Terminal
|Terminal
|Match on word group: Matches if <span style="white-space:nowrap"><i>m<sub>i</sub> = group(&omega;<sub>j</sub>)</i></span>
|Match on word group: Matches if <math>m_i=\operatorname{group}(\omega_j)</math>
|----
|----
|0x154
|0x154
Line 214: Line 216:
|}
|}


With the notable exception of the first rule, these rules constitute  <span style="white-space:nowrap"><i>P. V := {x | &exist;R &isin; P. x &isin; R}</i></span>; typically, <span style="white-space:nowrap"><i> V = {0x12f,&hellip;,0x13f}&sdot; s = m<sub>0</sub></i></span> of the first rule encountered; in all games observed, it was set to 0x13c. &Sigma; contains all word groups and class masks. For the sake of simplicity, we will consider rules matching composite class masks to be several rules. Here is a simplified example of what such a grammar might look like (the hexadecimal prefix '0x' is omitted for brevity):
With the notable exception of the first rule, these rules constitute  the set of production rules <math>P</math>. Furthermore we have <math>V := \{x \mid \exists R \in P x \in R\}</math>; typically, <math>V = \{ \mathrm{0x12f}, \ldots, \mathrm{0x13f}\}</math> and  <math>s = m_0</math> of the first rule encountered; in all games observed, it was set to 0x13c. <math>\Sigma</math> contains all word groups and class masks. For the sake of simplicity, we will consider rules matching composite class masks to be several rules. Here is a simplified example of what such a grammar might look like (the hexadecimal prefix '0x' is omitted for brevity):


In addition to this grammar, each right-hand non-terminal <i>m<sub>i</sub></i> carries its semantic value <i>&rho;<sub>i</sub></i> , which is not relevant for constructing a syntax tree, but must be considered for the semantic tree <i>T</i><sub>&pi;</sub>. These values were omitted in the example above. As in the example above, the grammar is a context-free (type 2) grammar, almost in Chomsky Normal Form (CNF) in SCI; constructing a grammar with CNF rules from it would be trivial.<ref>FreeSCI constructs a GNF (Greibach Normal Form) representation from these rules for parsing.</ref>
In addition to this grammar, each right-hand non-terminal <i>m<sub>i</sub></i> carries its semantic value <i>&rho;<sub>i</sub></i> , which is not relevant for constructing a syntax tree, but must be considered for the semantic tree <i>T</i><sub>&pi;</sub>. These values were omitted in the example above. As in the example above, the grammar is a context-free (type 2) grammar, almost in Chomsky Normal Form (CNF) in SCI; constructing a grammar with CNF rules from it would be trivial.<ref>FreeSCI constructs a GNF (Greibach Normal Form) representation from these rules for parsing.</ref>
1,079

edits