inf.compilers
Class LexicalAnalyzer

java.lang.Object
  extended by inf.compilers.LexicalAnalyzer

public class LexicalAnalyzer
extends java.lang.Object

This class is a relatively simple lexical analyzer that can be used to break a given input stream of characters into tokens. Token types can be defined using regular expressions. However, in order to keep matching efficient, the algorithm used for matching has certain limitations.

Token Types

Before any input can be tokenized the token types have to be defined. This is done by defining the regular expressions that represent the single tokens. To create a new TokenType the function createTokenType(...) has to be used. This function takes one or two Strings as arguments: the fixed String that is the token, e.g. "while", or the name of the TokenType and the regular expression that defines it, e.g. "integer" and "(+|-)?[0-9]+". Usually TokenTypes have to be stored for later use:

 TokenType intTT = LexicalAnalyzer.createTokenType("integer", "(+|-)?[0-9]+");
 

Fixed String Tokens

The simplest TokenType is one defined by a fixed String, i.e. it matches the given String and nothing else. This is always case-sensitive.

Single Characters

The simplest component of a regular expression is a single character. A single character may be a single input character that passes the Character.isLetterOrDigit(...) test, a backslash followed by an arbitrary character (except alphanumeric characters), or a backslash followed by a lower-case u followed by exactly 4 hexadecimal digits.

 <char> ::= 'char' | '\''char' | '\''u' hexdigitˆ4
 

A regular expression of this type matches the single character defined in it.

Character Classes

The other primitive component of a regular expression is a character class. This must be defined as a set of characters enclosed in angle brackets '[' and ']'. Inside the set can be single characters (as defined above) or ranges of characters.

 <char-class> ::= '['{<char>|<char-range>}*']'
 <char-range> ::= <char>'-'<char>
 

A regular expression of this type matches a single character enumerated in the class or a single character that lies within any of the defined ranges.

Repeated Regular Expressions

A regular expression may be followed by a repetition character to indicate that this expression may occur multiple times. The repetition characters are:

In addition, round brackets around the inner expression may be used to indicate operator precedence in case the content is a sequence or alternative set of expressions.

 <rpt-expr> ::= <expression> ['?' | '+' | '*']
 

Matching is greedy in the sense that a sub-expression of this type will try to match as much as possible of the input and there will be no backtracking. So, for example, the expression "a*a" will not match the input "aa" as the first sub-expression, "a*", already matches the whole input leaving nothing to match for the second sub-expression. However, "aa*" would match the input "aa".

Sequences of Regular Expressions

A sequence of regular expressions is simply defined by the concatenation of the the individual sub-expressions. Again, round brackets may be used to indicate precedence.

 <sequence> ::= <expression>+
 

Matching against a sequence simply matches against the sub-expressions in order.

Alternatives of Regular Expressions

A set of alternatives of regular expressions is defined by separating the sub-expressions from each other with the vertical bar character. Again, round brackets may be used to indicate precedences.

 <sequence> ::= <expression>'|'<expression>+
 

Matching against an alternative is again done without backtracking: the first sub-expression that represents a successful match against the input is taken to be the chosen alternative. So, for example, the expression "(a|aa)b" will not match the input "aab" as the first alternative sub-expression, "a", already matches the first a of the input. Thus, this alternative is taken and the remaining input, "ab" does not match "b". However, "(aa|a)b" would match both, "ab" and "aab".

Using a LexicalAnalyzer

The first step towards using the LexicalAnalyzer class is creating an instance of it. The constructor takes one argument which is the size of the buffer that is used for reading characters to be tokenized. This has to be large enough to hold every single token! As it does not decrease the efficiency, it may be set fairly generously:

 LexicalAnalyzer scanner = new LexicalAnalyzer(1024);
 

The next step is the definition of the token types the scanner knows about. This can be done using the function createTokenType(...) and the regular expression syntax as explained above:

 tType1TT = LexicalAnalyzer.createTokenType("tType1");
     ...
 tTypeNTT = LexicalAnalyzer.createTokenType("tTypeN", " ... ");
 

This LexicalAnalyzer can now be used to tokenize multiple input streams of characters. However, they have to be tokenized one after another, not in parallel! Input has to be given to the LexicalAnalyzer in form of a Reader:

 scanner.setInput(reader);
 

Now the input can be tokenized. This often needs to be done until there is no more input available. Do not use the underlying Reader to test for this as the LexicalAnalyzer maintains its own buffer. Use the function hasMore() to test whether more characters may be tokenized:

 while (scanner.hasMore()) { ... }
 

Getting the token from the LexicalAnalyzer is a two-stage process. First we need to test whether a token of a given type can be matched from the current input position. The function matchesNext(...) performs this test. If the test succeeds the function getMatchedString(...) can be used to retrieve the String that matched the regular expression defined for the token type, and the input position is advanced accordingly:

 if (scanner.matchesNext(tTypesITT)) {
           token = scanner.getMatchedString(tTypesITT);
 }
 

Note that matchesNext(...) and getMatchedString(...) should always occur in pairs. A call to getMatchedString(...) without a prior call to matchesNext(...) or multiple calls to getMatchedString(...) after a single call to matchesNext(...) will result in an Exception. Similarly, omitting the call to getMatchedString() means the Reader for the input is not advanced and input is tokenized from the same position as before. This can be used to test for multiple token types from the current position.

Finally, the LexicalAnalyzer should be closed. This can be done at any stage, including when there are more characters to be read. Closing is important if the LexicalAnalyzer is to be reused for tokenizing another input stream and calls to setInput(...) will fail while the current LexicalAnalyzer is not closed:

 scanner.close();
 

White Space

To correctly keep track of the current line number the LexicalAnalyzer needs to know which character represents the newline character. This can be set using the function setNewLineChar(...) at any time. The default value is '\n'. Furthermore, only occurrences of the newline character in whitespace should be counted and thus, the LexicalAnalyzer needs to know which tokens are considered whitespace. The function setWhiteSpace(...) of a TokenType can be used to declare that a token is whitespace. The default is false, i.e. tokens are not considered whitespace. Only newline characters in whitespace will increase the line count.

todo: reading character blocks

Author:
Gerhard Wickler

Nested Class Summary
(package private) static class LexicalAnalyzer.AlternativesNode
           This class represents an alternatives node in a regular expression tree.
(package private) static class LexicalAnalyzer.CharClassNode
           This class represents a character class node in a regular expression tree.
(package private) static class LexicalAnalyzer.FixedStringNode
           This class represents a fixed string of characters.
(package private) static class LexicalAnalyzer.RegexNode
           Regular expressions are internally represented as trees.
(package private) static class LexicalAnalyzer.RegexParseState
           
(package private) static class LexicalAnalyzer.RptExprNode
           This class represents a repetition node in a regular expression tree.
(package private) static class LexicalAnalyzer.SequenceNode
           This class represents a sequence node in a regular expression tree.
(package private) static class LexicalAnalyzer.SingleCharNode
           This class represents a single character node in a regular expression tree.
static class LexicalAnalyzer.TokenType
           This class represents a single token type.
 
Constructor Summary
LexicalAnalyzer(int bufferSize)
           This constructor creates a new LexicalAnalyzer that uses an input buffer of the given number of characters.
 
Method Summary
protected  java.lang.Object clone()
          This class does not support cloning and an exception will be thrown if this method is called.
 void close()
           This function closes this LexicalAnalyser but not its underlying Reader.
static LexicalAnalyzer.TokenType createTokenType(java.lang.String str)
           This function must be called to create a new TokenType that corresponds to a fixed String.
static LexicalAnalyzer.TokenType createTokenType(java.lang.String name, java.lang.String regex)
           This function must be called to create a new TokenType that corresponds to a regular expression.
 boolean equals(java.lang.Object obj)
          This class does not support equality testing and an exception will be thrown if this method is called.
 int getLineNr()
           This function returns the line number on which the current token begins.
 java.lang.String getMatchedString(LexicalAnalyzer.TokenType tType)
           This function returns the String that has been matched by the previous call to matchesNext(...).
 int hashCode()
          This class does not support hashing and an exception will be thrown if this method is called.
 boolean hasMore()
           This function returns whether there is more input available that may be tokenized.
 java.lang.String ignoreToken(LexicalAnalyzer.TokenType ignorableTT)
           This function can be used to ignore comments and/or whitespace.
 boolean matchesNext(LexicalAnalyzer.TokenType tType)
           This function tests whether there is more input to be read for matching from the current Reader and returns true iff a token of the given TokenType can be read next.
 java.lang.String parseToken(LexicalAnalyzer.TokenType tt, java.lang.String msg)
           This function attempts to parse a token of the given type from this LexicalAnalyzer.
 void setInput(java.io.Reader r)
           This function sets the input for this LexicalAnalyzer.
 void setNewLineChar(char nlc)
           This function can be used to set the (single) character that is used to test for new lines.
 java.lang.String toString()
          This class does not support a printable representation and an exception will be thrown if this method is called.
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

LexicalAnalyzer

public LexicalAnalyzer(int bufferSize)

This constructor creates a new LexicalAnalyzer that uses an input buffer of the given number of characters. Note that this limits the length of a token as this has to fit into the input buffer.

Parameters:
bufferSize - the number of characters that can be held in the buffer
Method Detail

clone

protected java.lang.Object clone()
                          throws java.lang.CloneNotSupportedException

This class does not support cloning and an exception will be thrown if this method is called.

Overrides:
clone in class java.lang.Object
Returns:
nothing
Throws:
java.lang.CloneNotSupportedException - will be thrown

createTokenType

public static LexicalAnalyzer.TokenType createTokenType(java.lang.String str)

This function must be called to create a new TokenType that corresponds to a fixed String.

Parameters:
str - the String to be matched (also used as the TokenType's name)
Returns:
a new TokenType that can be used for lexical analysis

createTokenType

public static LexicalAnalyzer.TokenType createTokenType(java.lang.String name,
                                                        java.lang.String regex)
                                                 throws java.text.ParseException

This function must be called to create a new TokenType that corresponds to a regular expression.

Parameters:
name - an arbitrary name for the TokenType
regex - the String representation of the defining regular expression
Returns:
a new TokenType that can be used for lexical analysis
Throws:
java.text.ParseException - if the given String (regex) is not a syntactically correct regular expression

setNewLineChar

public void setNewLineChar(char nlc)

This function can be used to set the (single) character that is used to test for new lines. When found in whitespace and only in whitespace, the line number counter is increased. The default newline character is '\n'.

Parameters:
nlc - the new newline character

getLineNr

public int getLineNr()

This function returns the line number on which the current token begins.

Returns:
the current position in the input

setInput

public void setInput(java.io.Reader r)
              throws java.io.IOException

This function sets the input for this LexicalAnalyzer. Note that this LexicalAnalyzer must be closed before a new Reader can be set. This is to ensure that a LexicalAnalyser cannot be used from multiple Threads to tokenize multiple inputs at the same time. For efficiency this function is not synchronized, however.

Parameters:
r - the Reader from which to scan tokens
Throws:
java.io.IOException - if closing the underlying Reader fails

close

public void close()

This function closes this LexicalAnalyser but not its underlying Reader. This function must be called before new Input can be set.


hasMore

public boolean hasMore()

This function returns whether there is more input available that may be tokenized.

Returns:
true iff there are more input chars available

matchesNext

public boolean matchesNext(LexicalAnalyzer.TokenType tType)
                    throws java.io.IOException

This function tests whether there is more input to be read for matching from the current Reader and returns true iff a token of the given TokenType can be read next. If so, a call to getMatchedString(...) will return this token as a String.

Parameters:
tType - the expected TokenType for which is to be tested
Returns:
true iff a token of the expected type appears next in the input
Throws:
java.io.IOException - if reading fails

getMatchedString

public java.lang.String getMatchedString(LexicalAnalyzer.TokenType tType)
                                  throws java.util.NoSuchElementException

This function returns the String that has been matched by the previous call to matchesNext(...). Note that this function must not be called without a preceding matchesNext(...) call, and a call to matchesNext(...) has to be followed by a call to this function to read the input.

Parameters:
tType - the expected TokenType that the String has matched
Returns:
the String corresponding to the last matched TokenType
Throws:
java.util.NoSuchElementException - if no such token is currently available

parseToken

public java.lang.String parseToken(LexicalAnalyzer.TokenType tt,
                                   java.lang.String msg)
                            throws java.text.ParseException,
                                   java.io.IOException

This function attempts to parse a token of the given type from this LexicalAnalyzer. If the next token does not match this type a ParseException will be thrown. The message contained in this exception always reads &qout;expecting &qout; + token type name + given msg. If the next token matches the given type the matched String is returned.

Parameters:
tt - the TokenType of the token that is expected as next input
msg - the error String to be appended to the default message
Returns:
the matched String id the token type was as expected
Throws:
java.text.ParseException - if the input does not match the given TokenType
java.io.IOException - may be caused by the underlying Reader

ignoreToken

public java.lang.String ignoreToken(LexicalAnalyzer.TokenType ignorableTT)
                             throws java.io.IOException

This function can be used to ignore comments and/or whitespace. What constitutes a comment or whitespace is defined by the given TokenType.

Parameters:
ignorableTT - the TokenType defining a whitespace/comment
Returns:
the matched ignorable token or null if there was none
Throws:
java.io.IOException - may be caused by the underlying Reader

equals

public boolean equals(java.lang.Object obj)
               throws java.lang.UnsupportedOperationException

This class does not support equality testing and an exception will be thrown if this method is called.

Overrides:
equals in class java.lang.Object
Parameters:
obj - the object this should be compared to
Returns:
nothing
Throws:
java.lang.UnsupportedOperationException - will be thrown

hashCode

public int hashCode()
             throws java.lang.UnsupportedOperationException

This class does not support hashing and an exception will be thrown if this method is called.

Overrides:
hashCode in class java.lang.Object
Returns:
nothing
Throws:
java.lang.UnsupportedOperationException - will be thrown

toString

public java.lang.String toString()
                          throws java.lang.UnsupportedOperationException

This class does not support a printable representation and an exception will be thrown if this method is called.

Overrides:
toString in class java.lang.Object
Returns:
nothing
Throws:
java.lang.UnsupportedOperationException - will be thrown