The concept of lexer states was introduced by the flex scanner generator as a means to subdivide the lexer into multiple layers such that only a subset of the token definitions can be matched at a particular time. This is useful when a syntax has widely variable "topography"; certain sections of a file are syntatically very different from one another. Attempts to cram all the required regular expressions into a a single lexer state (a single DFA) result in collisions within the grammar. Under these circumstances it becomes advantageous to partition the token definitions into subsets and switch between them at the appropriate time.
Note: I would imagine that the concept of lexer states was almost certainly known well before flex came along, but my knowledge of the subject is pretty limited so don't quote me on historical accuracy.
In the STT
, this notion of lexer states is recast into the
term lexical context, or more commonly just context, for
short. Note however its meaning here is unrelated to the term
"context-free parsing"; that is, "lexical context" does not imply
any sort of "context-ful parsing". This is because the notion of
lexical context does not affect the theory or algorithms used in the
parsing phase (syntactic analysis).
It is useful to remember the formula that one lexical context
equals one DFA. With this in mind then we can restate what it means to
be a lexical analyzer in the STT
: A lexical analyzer is a machine
that transforms a stream of characters into a stream of tokens; to
accomplish its job it requires a set of contexts and a
context transition function. The transition function is a mapping
from (context, token) --> context
.
Therefore, when a DFA says "hey, we found the beginning of a comment!", the context transition function is consulted to check if the context needs to be changed.
STT
lexers have an additional feature in that they maintain a
context stack. I fibbed when I said that the transition function
is a mapping from (context, token) --> context
; it is actually a
mapping (context, token) --> context action
where the
context action
is a context stack instruction. A context
action is a tuple (instruction, register)
and hence the full
definition for the transition function is a tuple to tuple mapping
(context, token) --> (instruction, register)
.
The instruction is a constant that is one of:
PEEK
: This is the no-op instruction; it says to change
contexts to the one you are currently in, meaning "do nothing". The
register
for a PEEK
instruction does not hold a meaningful
value.
PUSH
: This instruction switches to the context named in
the register
and places it on the top of the stack.
POP
: This instruction removes the top element of the
context stack, throws it away, and then switches into the context at the
new top of the stack. This has the effect of going back to the context
you were in before a PUSH
. The register
for a POP
instruction does not hold a meaningful value.
To take advantage of multiple lexical contexts you need to declare the names of the contexts to be used in your grammar. Then, define which terminals will be included in what context such that the finite automata can be correctly assembled for each context. The context definition is also where PUSH or POP instructions are made. See section Grammar Syntax for more information.
Go to the first, previous, next, last section, table of contents.