This document describes the behavior of the lexing function. A transition diagram helps provide a more formal characterization. To aid in the discussion, variables and functions are used. This reading is NOT REQUIRED for general use of the STT.
Variables are written like shell or perl variables within text for
contrast. For example: the value of $x
is
1
.
m,n
m',n'
p,q
$p
denotes current state and $q
denotes the next state.
dead
i
t
err
c
Functions are written in all UPPERCASE.
RETCH
$i
to $c
.
GOTO
($p,$c)
to $q
.
OUTPUT
$p
to $t
.
A transition diagram can be used to formally describe the lexer. This
directed graph characterizes what is known about the the character
sequence to the immediate left of the input cursor. In this context,
immediate left refers to a closed interval [$m,$n]
in the
input such that the beginning of the interval is at offset $m
,
the end of the interval is at offset $n
, and the cursor is
between offset $n
and ${n+1
}. This can be drawn
(vis means visually):
vis: |-----------) m n
The pipe character "|"
denotes all the input before the interval
being immediately considered. It includes all the tokens that have
already been characterized, or nothing (if the parse has just started).
Because it shows the starting point of an character interval currently
being lexically analyzed, it is referred to as epoch.
The parenthesis ")"
denotes the cursor, or the frontier
of the input. Because this shows the leading edge of the known lexing
universe, it is referred to as horizon.
Each character between the epoch and horizon denotes a single input
character at offset $i
.
If $c = RETCH($i)
, $q = GOTO($p,$c)
, and
$q
is not $dead
, then the character is drawn
as a dash ("-"
). Therefore, a dash is used to indicate a
character that did not move the DFA to the dead state, referred to as
a live char.
If the converse is true, that is, if $c = RETCH($i)
and
$dead = GOTO($p,$c)
, then the character is drawn as an asterisk
("*"
). Therefore, an asterisk is used to indicate a character
that killed the DFA, also called a dead char.
Moreover, if $c = RETCH($i)
, $q = GOTO($p,$c)
, $t =
OUTPUT($q)
, and $t
is not $err
, then the character is
drawn as an equals sign ("="
) rather than a dash. Therefore, an
equals sign is used to indicate a character which signals the DFA to
accept, hence the label accepting char. Note that an accepting
char does not necessarily indicate the end-boundary of a token.
This section describes the state of the lexer at each state in the transition diagram.
In the tabla-rasa state (or start state), the lexer is at epoch. This means that either the translation run has just started or that the lexer has just completed the recognition of a token and the cycle has begun anew.
vis: |
Without consuming any input, the lexer immediately moves from tabla-rasa to pluripotent.
In the pluripotent lexer state, input traversal has proceeded such that the DFA has not transitioned to the dead state, but no output has been reported from the DFA either. This means that not enough input has been to decide whether the sequence is good or bad (is a token or is an error). Therefore, the state is "pluripotent".
vis: |-----)
If the lexer is pluripotent, $q = GOTO($p,$c)
, and $t
= OUTPUT($q)
, then the lexer transitions to the open token state.
In this state, it has become certain that the interval represents a
token, but it is uncertain whether the interval represents the longest
possible match; more input must be seen to decide.
vis: |-------=) m n?
If the lexer is in the open token state and the next input char
transitions to $dead
, then the lexer transitions to the
closed token state; the end-boundary of the token is now known.
At this point the LexerListener
is notified of the matched token
at offset $m
and length $n - $m + 1
.
vis: |-------=*) m n vis: |-------=----*) m n vis: |-------=-----==-*) m n
Without consuming any input, the lexer then immediately moves back to tabla-rasa and the cycle begins again.
If the lexer is in the open token state and the lext input is
live, it moves to the phoenix token state. When this
happens, the lexer says "Well, I know that I will at least be able to
match token $t
back at the last accepting char. However, I
haven't yet hit a dead state, so I can't say for certain where the end
of the token is quite yet. But since I have now just seen a live char,
the identity of the token that is matched could change". This would be
the case with a grammar that has regular expressions matching
`hour' and `hourglass'.
Therefore, the pheonix token state represents a situation where a token y may "rise out" of the sequence matching token x.
vis: |-------=---) m n?
Analogous to open token, the open error state occurrs when
the lexer is in the pluripotent state and the next input char
takes the DFA to $dead
. It has become certain that the input
interval is not recognized by the DFA, but it is not known whether this
represents the longest possible error. Since the lexer is designed to
try to "match" the longest possible error before notifying the listener,
more input needs to be analyzed.
vis: |------*) m' n'? vis: |------***---) m' n' ? vis: |------***-----*) m' n'?
If the lexer is in the open error state and the next input
character is an accepting char, then the lexer moves to the
closed error state; it is certain that a token will be matched and
therefore can be concluded that end-boundary of the error is at
$n'
.
vis: |------*-=) m' n' mn? vis: |*********-----=) m' n' m n?
When the lexer moves to closed error, it then immediately transitions to open token since it has become certain that a token will be matched.
The pheonix error is very similar to the open token state --- when an accepting char is traversed the extent of the error sequence will become known and the lexer will move into the closed error state. The phoenix error is a bit special however. When the transition from open error to phoenix error is actuated over a live char, the beginning of token marker is set. Therefore, when this state transition occurrs, the lexer can say "if we end by finding a token, here's where it starts."
From phoenix error, the lexer can move back into open error in which case the beginning-of-token marker that was set becomes unset
As you know, the phoenix is a mythical bird that rose from the embers of its own ashes. Analogously, the pheonix state marks the starting point where a token may rise out of the ashes of an error.
vis: |------*-) m' n'? m
This section illutrates the transition diagram drawn as a directed graph. The actions that are performed by edge traversal are shown below.
/\ * /\ .-----> <ct> ---> <tr> | \/ \/ dead | | ^ | * | dead | v .-- /\ live /\ /\ __. live | <pt> <--- <ot> <--- <pl> | live `-> \/ ---> \/ acc \/ <-' acc ^ | dead * | v /\ /\ __. <ce> <--- <oe> | dead \/ acc \/ <-' ^ ^| live acc | dead |v | | /\ __. `------ <pe> | live \/ <-' -- nodes tr: tabla-rasa pl: pluripotent ot: open-token pt: phoenix-token ct: closed-token oe: open-error pe: pheonix-error ce: closed-error -- edges *: automatic (epsilon, no input consumed) live: live-char dead: dead-char acc: accepting-char
This document was generated on 6 July 2001 using texi2html 1.56k.