A grammar for a translator generated with the STT
is an STT
Grammar. Either the native .stt format or XML
can be used
to express the required structure.
Though XML
can be used as to write a grammar, it is far more
verbose than the native stt format. Since the abstract structure of an
XML
grammar instance and an stt grammar instance are
interchangeable, no formal description of the XML
format is
given; consult the DTD
and the examples in the distribution. The
use of XML was basically a bootstrap mechanism. It is still
occasionally required when some part of the translation machinery is
broken due to development, disabling the native pathway.
XML
instances must conform to the grammar.dtd
document
type.
A grammar file consists of a set of sections, some of which are optional. Each section consists of one or more statements terminated by a semicolon.
Comments and whitespace are discarded. Comments are typical unix-style;
they start with a pound sign (#
) and end with a newline.
The sections are:
The grammar declaration defines the name of the grammar and the version. It looks like this:
# format: this is <NAME> version <VERSION>; this is syntacs version 0.1.0;
Properties are key:value pairs that are put into a hashtable and used throughout grammar processing. See section Properties for a listing of these properties. They are enclosed in double-quotes.
# format: property <NAME> = "<VALUE>"; property namespace = "com.inxar.syntacs.translator.regexp";
Terminals need to be declared before they can be defined. A declaration establishes that name as a terminal. There may be multiple terminal statements, each of which may declare multiple names.
Terminals and Nonterminals share the same namespace, meaning there cannot be a terminal and a nonterminal having the same name. By convention, terminals identifiers are all caps and nonterminal identifiers are capitalized, but it is up to the preference of the grammar author...
# format: terminal <NAME>; # format: terminal <NAME>, <NAME>, <NAME>; terminal IDENT; terminal T1, T2;
Terminal definitions are regular definitions; they associate a name with
an expression. A regular expression is enclosed in double-quotes;
whitespace within the string is insignificant. See section Regular Expression Syntax about how regular expressions are written in STT
.
# format: <TERMINAL> matches "regexp"; IDENT matches " [_a-zA-Z0-9] [-_a-zA-Z0-9]* ";
Nonterminal declarations are identical to terminal declarations with the exception of the keyword. Nonterminal identifiers are by convention capitalized.
# format: nonterminal <NAME>; # format: nonterminal <NAME>, <NAME>, <NAME>; nonterminal Goal; nonterminal IdentList, Name, Statement;
Nonterminal definitions are productions: each production relates a nonterminal to a sequence of grammar symbols; when that sequence of grammar symbols (terminals or nonterminal) appears the top of the parse stack, the parser will reduce it to the nonterminal named in the production (i.e. the nonterminal definition).
# format: reduce <NONTERMINAL> when <SYMBOL> <SYMBOL> <SYMBOL>; reduce Term when Term PLUS Factor;
This section consists of a single statement that states what "goal symbol" must be reduced in order for the grammar to signal acceptance of the input. The goal symbol must be a declared nonterminal. The convention is "Goal".
# format: accept when <NONTERMINAL>; accept when Goal;
The context declarations and definitions are optional. See section Lexical Context for an explanation of what a "context" is.
The context declarations section is similar to the terminal declarations section and nonterminal declarations section.
# format: context <NAME>; # format: context <NAME>, <NAME>, <NAME>; context comment; context special1, special2;
Identifiers used for contexts have their own namespace, each one must be unique only within the set of context declarations. The context names "default" and "all" have special meaning.
A context definition determines what subset of terminals in the full set of terminals is included in the context. If a terminal is included within a particular context, its corresponding DFA will recognize the appropriate character sequence (given the opportunity).
Each context definition statement consists of a context name and a list
of one or more context stack instructions. A stack instruction
can say one of three things: "when terminal X
is matched, do
nothing", "when terminal X
is matched, switch into context
Y
", and "when terminal X
is matched, return to the
previous context".
# Implicit PEEK instructions for R, S, and T in context "default". default includes R, S, T;
shifts
instruction changes the lexer context to the named
context.
# Implicit PEEK instruction for X; PUSH for Y in context "default". default includes X, Y shifts special;
unshifts
instruction changes the lexer context to the previous
context.
# POP instruction for Z in context "special". special includes Z unshifts;
The following example demonstrates the use of context switching through context stack instructions:
# format: <NAME> includes <INSTRUCTION>; terminal WHITESPACE, START_COMMENT, COMMENT_DATA, END_COMMENT; context default, comment; default includes START_COMMENT shifts comment, WHITESPACE; comment includes COMMENT_DATA, END_COMMENT unshifts;
This section defines what the starting context will be. When omitted, the default context is "default".
# format: start with context <NAME>; start with context special;
After the grammar is parsed, some processing is done to initialize each context with the terminals that will be included in it.
The simplest case is when no context information has been explicitly provided -- the grammar consists of terminal and nonterminal declarations/definitions only.
In this circumstance, the processor implicitly adds in a single context "default", and all terminals are added to that context. The lexer acts on the corresponding DFA and no context switching is done.
terminal WHITESPACE, DATA; nonterminal data; reduce data when DATA; accept when data
In this circumstance the user declares more or more contexts. The "default" context is always implicitly declared, but can be declared explicitly with no error.
terminal WHITESPACE, DATA, START_QUOTE, QUOTE_DATA, END_QUOTE; nonterminal data, quote; context quoted_context; default includes WHITESPACE, DATA, START_QUOTE shifts quoted_context; quoted_context includes WHITESPACE, QUOTE_DATA, END_QUOTE unshifts;
The "all" context is special in that it does not actually refer to a real context (a DFA), but rather is a syntactic convenience. Terminals included in the "all" context are placed into every other context after the grammar is parsed. The all context does not have to be declared.
terminal WHITESPACE, DATA, START_QUOTE, QUOTE_DATA, END_QUOTE; nonterminal data, quote; context quoted_context; all includes WHITESPACE; default includes DATA, START_QUOTE shifts quoted_context; quoted_context includes QUOTE_DATA, END_QUOTE unshifts;
Go to the first, previous, next, last section, table of contents.