Lexer Internals

A Transition Diagram

Paul Cody Johnston


Table of Contents


Introduction

This document describes the behavior of the lexing function. A transition diagram helps provide a more formal characterization. To aid in the discussion, variables and functions are used. This reading is NOT REQUIRED for general use of the STT.

Variables

Variables are written like shell or perl variables within text for contrast. For example: the value of $x is 1.

m,n
Offsets used to describe the beginning and end of a token.
m',n'
Offsets used to describe the beginning and end of an error.
p,q
DFA states. In general, $p denotes current state and $q denotes the next state.
dead
A special DFA state in which all roads lead back to itself.
i
An offset in the input.
t
A symbol which uniquely identifies a token.
err
A special token symbol which indicates an error.
c
An input character.

Functions

Functions are written in all UPPERCASE.

RETCH
The input char accessor function (RETurn CHar); a mapping from $i to $c.
GOTO
The dfa transition function; a mapping from ($p,$c) to $q.
OUTPUT
The dfa output function; a mapping from $p to $t.

Visual Notation

A transition diagram can be used to formally describe the lexer. This directed graph characterizes what is known about the the character sequence to the immediate left of the input cursor. In this context, immediate left refers to a closed interval [$m,$n] in the input such that the beginning of the interval is at offset $m, the end of the interval is at offset $n, and the cursor is between offset $n and ${n+1}. This can be drawn (vis means visually):

vis: |-----------)
      m         n

Special Symbols

Epoch: `|'

The pipe character "|" denotes all the input before the interval being immediately considered. It includes all the tokens that have already been characterized, or nothing (if the parse has just started). Because it shows the starting point of an character interval currently being lexically analyzed, it is referred to as epoch.

Horizon: `)'

The parenthesis ")" denotes the cursor, or the frontier of the input. Because this shows the leading edge of the known lexing universe, it is referred to as horizon.

Character Symbols

Each character between the epoch and horizon denotes a single input character at offset $i.

Live Input Character: `-'

If $c = RETCH($i), $q = GOTO($p,$c), and $q is not $dead, then the character is drawn as a dash ("-"). Therefore, a dash is used to indicate a character that did not move the DFA to the dead state, referred to as a live char.

Dead Input Character: `*'

If the converse is true, that is, if $c = RETCH($i) and $dead = GOTO($p,$c), then the character is drawn as an asterisk ("*"). Therefore, an asterisk is used to indicate a character that killed the DFA, also called a dead char.

Accepting Input Character: `='

Moreover, if $c = RETCH($i), $q = GOTO($p,$c), $t = OUTPUT($q), and $t is not $err, then the character is drawn as an equals sign ("=") rather than a dash. Therefore, an equals sign is used to indicate a character which signals the DFA to accept, hence the label accepting char. Note that an accepting char does not necessarily indicate the end-boundary of a token.

Instantaneous States

This section describes the state of the lexer at each state in the transition diagram.

Tabla-Rasa

In the tabla-rasa state (or start state), the lexer is at epoch. This means that either the translation run has just started or that the lexer has just completed the recognition of a token and the cycle has begun anew.

vis: |

Without consuming any input, the lexer immediately moves from tabla-rasa to pluripotent.

Pluripotent

In the pluripotent lexer state, input traversal has proceeded such that the DFA has not transitioned to the dead state, but no output has been reported from the DFA either. This means that not enough input has been to decide whether the sequence is good or bad (is a token or is an error). Therefore, the state is "pluripotent".

vis: |-----)

Open Token

If the lexer is pluripotent, $q = GOTO($p,$c), and $t = OUTPUT($q), then the lexer transitions to the open token state. In this state, it has become certain that the interval represents a token, but it is uncertain whether the interval represents the longest possible match; more input must be seen to decide.

vis: |-------=)
      m      n?

Closed Token

If the lexer is in the open token state and the next input char transitions to $dead, then the lexer transitions to the closed token state; the end-boundary of the token is now known. At this point the LexerListener is notified of the matched token at offset $m and length $n - $m + 1.

vis: |-------=*)
      m      n

vis: |-------=----*)
      m      n

vis: |-------=-----==-*)
      m             n

Without consuming any input, the lexer then immediately moves back to tabla-rasa and the cycle begins again.

Pheonix Token

If the lexer is in the open token state and the lext input is live, it moves to the phoenix token state. When this happens, the lexer says "Well, I know that I will at least be able to match token $t back at the last accepting char. However, I haven't yet hit a dead state, so I can't say for certain where the end of the token is quite yet. But since I have now just seen a live char, the identity of the token that is matched could change". This would be the case with a grammar that has regular expressions matching `hour' and `hourglass'.

Therefore, the pheonix token state represents a situation where a token y may "rise out" of the sequence matching token x.

vis: |-------=---)
      m      n?

Open Error

Analogous to open token, the open error state occurrs when the lexer is in the pluripotent state and the next input char takes the DFA to $dead. It has become certain that the input interval is not recognized by the DFA, but it is not known whether this represents the longest possible error. Since the lexer is designed to try to "match" the longest possible error before notifying the listener, more input needs to be analyzed.

vis: |------*)
      m'    n'?

vis: |------***---)
      m'      n'  ?

vis: |------***-----*)
      m'            n'?

Closed Error

If the lexer is in the open error state and the next input character is an accepting char, then the lexer moves to the closed error state; it is certain that a token will be matched and therefore can be concluded that end-boundary of the error is at $n'.

vis: |------*-=)
      m'    n'
             mn?

vis: |*********-----=)
      m'      n'
               m    n?

When the lexer moves to closed error, it then immediately transitions to open token since it has become certain that a token will be matched.

Pheonix Error

The pheonix error is very similar to the open token state --- when an accepting char is traversed the extent of the error sequence will become known and the lexer will move into the closed error state. The phoenix error is a bit special however. When the transition from open error to phoenix error is actuated over a live char, the beginning of token marker is set. Therefore, when this state transition occurrs, the lexer can say "if we end by finding a token, here's where it starts."

From phoenix error, the lexer can move back into open error in which case the beginning-of-token marker that was set becomes unset

As you know, the phoenix is a mythical bird that rose from the embers of its own ashes. Analogously, the pheonix state marks the starting point where a token may rise out of the ashes of an error.

vis: |------*-)
      m'    n'?
             m

Transition Diagram

This section illutrates the transition diagram drawn as a directed graph. The actions that are performed by edge traversal are shown below.

lexer


                    /\   *    /\  
           .-----> <ct> ---> <tr> 
           |        \/        \/  
      dead |                         
           |         ^         | *
           |    dead |         v    
                                 
     .--  /\  live  /\        /\  __.
live |   <pt> <--- <ot> <--- <pl>   | live
     `->  \/  --->  \/   acc  \/  <-'
              acc                         
                     ^         | dead
                   * |         v     
                                       
                    /\        /\  __. 
                   <ce> <--- <oe>   | dead
                    \/   acc  \/  <-' 
                                  
                     ^        ^| live
                 acc |   dead |v
                     |
                     |        /\  __.
                     `------ <pe>   | live
                              \/  <-'

 -- nodes          

     tr: tabla-rasa   
     pl: pluripotent  
     ot: open-token   
     pt: phoenix-token
     ct: closed-token 
     oe: open-error   
     pe: pheonix-error
     ce: closed-error 

 -- edges 

     *:    automatic (epsilon, no input consumed)
     live: live-char
     dead: dead-char
     acc:  accepting-char


This document was generated on 6 July 2001 using texi2html 1.56k.