|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object inf.compilers.LexicalAnalyzer
public class LexicalAnalyzer
This class is a relatively simple lexical analyzer that can be used to break a given input stream of characters into tokens. Token types can be defined using regular expressions. However, in order to keep matching efficient, the algorithm used for matching has certain limitations.
Before any input can be tokenized the token types have to be defined. This is done by defining the regular expressions that represent the single tokens. To create a new TokenType the function createTokenType(...) has to be used. This function takes one or two Strings as arguments: the fixed String that is the token, e.g. "while", or the name of the TokenType and the regular expression that defines it, e.g. "integer" and "(+|-)?[0-9]+". Usually TokenTypes have to be stored for later use:
TokenType intTT = LexicalAnalyzer.createTokenType("integer", "(+|-)?[0-9]+");
The simplest TokenType is one defined by a fixed String, i.e. it matches the given String and nothing else. This is always case-sensitive.
The simplest component of a regular expression is a single character. A single character may be a single input character that passes the Character.isLetterOrDigit(...) test, a backslash followed by an arbitrary character (except alphanumeric characters), or a backslash followed by a lower-case u followed by exactly 4 hexadecimal digits.
<char> ::= 'char' | '\''char' | '\''u' hexdigitˆ4
A regular expression of this type matches the single character defined in it.
The other primitive component of a regular expression is a character class. This must be defined as a set of characters enclosed in angle brackets '[' and ']'. Inside the set can be single characters (as defined above) or ranges of characters.
<char-class> ::= '['{<char>|<char-range>}*']' <char-range> ::= <char>'-'<char>
A regular expression of this type matches a single character enumerated in the class or a single character that lies within any of the defined ranges.
A regular expression may be followed by a repetition character to indicate that this expression may occur multiple times. The repetition characters are:
In addition, round brackets around the inner expression may be used to indicate operator precedence in case the content is a sequence or alternative set of expressions.
<rpt-expr> ::= <expression> ['?' | '+' | '*']
Matching is greedy in the sense that a sub-expression of this type will try to match as much as possible of the input and there will be no backtracking. So, for example, the expression "a*a" will not match the input "aa" as the first sub-expression, "a*", already matches the whole input leaving nothing to match for the second sub-expression. However, "aa*" would match the input "aa".
A sequence of regular expressions is simply defined by the concatenation of the the individual sub-expressions. Again, round brackets may be used to indicate precedence.
<sequence> ::= <expression>+
Matching against a sequence simply matches against the sub-expressions in order.
A set of alternatives of regular expressions is defined by separating the sub-expressions from each other with the vertical bar character. Again, round brackets may be used to indicate precedences.
<sequence> ::= <expression>'|'<expression>+
Matching against an alternative is again done without backtracking: the first sub-expression that represents a successful match against the input is taken to be the chosen alternative. So, for example, the expression "(a|aa)b" will not match the input "aab" as the first alternative sub-expression, "a", already matches the first a of the input. Thus, this alternative is taken and the remaining input, "ab" does not match "b". However, "(aa|a)b" would match both, "ab" and "aab".
The first step towards using the LexicalAnalyzer class is creating an instance of it. The constructor takes one argument which is the size of the buffer that is used for reading characters to be tokenized. This has to be large enough to hold every single token! As it does not decrease the efficiency, it may be set fairly generously:
LexicalAnalyzer scanner = new LexicalAnalyzer(1024);
The next step is the definition of the token types the scanner knows about. This can be done using the function createTokenType(...) and the regular expression syntax as explained above:
tType1TT = LexicalAnalyzer.createTokenType("tType1"); ... tTypeNTT = LexicalAnalyzer.createTokenType("tTypeN", " ... ");
This LexicalAnalyzer can now be used to tokenize multiple input streams of characters. However, they have to be tokenized one after another, not in parallel! Input has to be given to the LexicalAnalyzer in form of a Reader:
scanner.setInput(reader);
Now the input can be tokenized. This often needs to be done until there is no more input available. Do not use the underlying Reader to test for this as the LexicalAnalyzer maintains its own buffer. Use the function hasMore() to test whether more characters may be tokenized:
while (scanner.hasMore()) { ... }
Getting the token from the LexicalAnalyzer is a two-stage process. First we need to test whether a token of a given type can be matched from the current input position. The function matchesNext(...) performs this test. If the test succeeds the function getMatchedString(...) can be used to retrieve the String that matched the regular expression defined for the token type, and the input position is advanced accordingly:
if (scanner.matchesNext(tTypesITT)) { token = scanner.getMatchedString(tTypesITT); }
Note that matchesNext(...) and getMatchedString(...) should always occur in pairs. A call to getMatchedString(...) without a prior call to matchesNext(...) or multiple calls to getMatchedString(...) after a single call to matchesNext(...) will result in an Exception. Similarly, omitting the call to getMatchedString() means the Reader for the input is not advanced and input is tokenized from the same position as before. This can be used to test for multiple token types from the current position.
Finally, the LexicalAnalyzer should be closed. This can be done at any stage, including when there are more characters to be read. Closing is important if the LexicalAnalyzer is to be reused for tokenizing another input stream and calls to setInput(...) will fail while the current LexicalAnalyzer is not closed:
scanner.close();
To correctly keep track of the current line number the LexicalAnalyzer needs to know which character represents the newline character. This can be set using the function setNewLineChar(...) at any time. The default value is '\n'. Furthermore, only occurrences of the newline character in whitespace should be counted and thus, the LexicalAnalyzer needs to know which tokens are considered whitespace. The function setWhiteSpace(...) of a TokenType can be used to declare that a token is whitespace. The default is false, i.e. tokens are not considered whitespace. Only newline characters in whitespace will increase the line count.
todo: reading character blocks
Nested Class Summary | |
---|---|
(package private) static class |
LexicalAnalyzer.AlternativesNode
This class represents an alternatives node in a regular expression tree. |
(package private) static class |
LexicalAnalyzer.CharClassNode
This class represents a character class node in a regular expression tree. |
(package private) static class |
LexicalAnalyzer.FixedStringNode
This class represents a fixed string of characters. |
(package private) static class |
LexicalAnalyzer.RegexNode
Regular expressions are internally represented as trees. |
(package private) static class |
LexicalAnalyzer.RegexParseState
|
(package private) static class |
LexicalAnalyzer.RptExprNode
This class represents a repetition node in a regular expression tree. |
(package private) static class |
LexicalAnalyzer.SequenceNode
This class represents a sequence node in a regular expression tree. |
(package private) static class |
LexicalAnalyzer.SingleCharNode
This class represents a single character node in a regular expression tree. |
static class |
LexicalAnalyzer.TokenType
This class represents a single token type. |
Constructor Summary | |
---|---|
LexicalAnalyzer(int bufferSize)
This constructor creates a new LexicalAnalyzer that uses an input buffer of the given number of characters. |
Method Summary | |
---|---|
protected java.lang.Object |
clone()
This class does not support cloning and an exception will be thrown if this method is called. |
void |
close()
This function closes this LexicalAnalyser but not its underlying Reader. |
static LexicalAnalyzer.TokenType |
createTokenType(java.lang.String str)
This function must be called to create a new TokenType that corresponds to a fixed String. |
static LexicalAnalyzer.TokenType |
createTokenType(java.lang.String name,
java.lang.String regex)
This function must be called to create a new TokenType that corresponds to a regular expression. |
boolean |
equals(java.lang.Object obj)
This class does not support equality testing and an exception will be thrown if this method is called. |
int |
getLineNr()
This function returns the line number on which the current token begins. |
java.lang.String |
getMatchedString(LexicalAnalyzer.TokenType tType)
This function returns the String that has been matched by the previous call to matchesNext(...). |
int |
hashCode()
This class does not support hashing and an exception will be thrown if this method is called. |
boolean |
hasMore()
This function returns whether there is more input available that may be tokenized. |
java.lang.String |
ignoreToken(LexicalAnalyzer.TokenType ignorableTT)
This function can be used to ignore comments and/or whitespace. |
boolean |
matchesNext(LexicalAnalyzer.TokenType tType)
This function tests whether there is more input to be read for matching from the current Reader and returns true iff a token of the given TokenType can be read next. |
java.lang.String |
parseToken(LexicalAnalyzer.TokenType tt,
java.lang.String msg)
This function attempts to parse a token of the given type from this LexicalAnalyzer. |
void |
setInput(java.io.Reader r)
This function sets the input for this LexicalAnalyzer. |
void |
setNewLineChar(char nlc)
This function can be used to set the (single) character that is used to test for new lines. |
java.lang.String |
toString()
This class does not support a printable representation and an exception will be thrown if this method is called. |
Methods inherited from class java.lang.Object |
---|
finalize, getClass, notify, notifyAll, wait, wait, wait |
Constructor Detail |
---|
public LexicalAnalyzer(int bufferSize)
This constructor creates a new LexicalAnalyzer that uses an input buffer of the given number of characters. Note that this limits the length of a token as this has to fit into the input buffer.
bufferSize
- the number of characters that can be held in the bufferMethod Detail |
---|
protected java.lang.Object clone() throws java.lang.CloneNotSupportedException
This class does not support cloning and an exception will be thrown if this method is called.
clone
in class java.lang.Object
java.lang.CloneNotSupportedException
- will be thrownpublic static LexicalAnalyzer.TokenType createTokenType(java.lang.String str)
This function must be called to create a new TokenType that corresponds to a fixed String.
str
- the String to be matched (also used as the TokenType's name)
public static LexicalAnalyzer.TokenType createTokenType(java.lang.String name, java.lang.String regex) throws java.text.ParseException
This function must be called to create a new TokenType that corresponds to a regular expression.
name
- an arbitrary name for the TokenTyperegex
- the String representation of the defining regular expression
java.text.ParseException
- if the given String (regex) is not a syntactically
correct regular expressionpublic void setNewLineChar(char nlc)
This function can be used to set the (single) character that is used to test for new lines. When found in whitespace and only in whitespace, the line number counter is increased. The default newline character is '\n'.
nlc
- the new newline characterpublic int getLineNr()
This function returns the line number on which the current token begins.
public void setInput(java.io.Reader r) throws java.io.IOException
This function sets the input for this LexicalAnalyzer. Note that this LexicalAnalyzer must be closed before a new Reader can be set. This is to ensure that a LexicalAnalyser cannot be used from multiple Threads to tokenize multiple inputs at the same time. For efficiency this function is not synchronized, however.
r
- the Reader from which to scan tokens
java.io.IOException
- if closing the underlying Reader failspublic void close()
This function closes this LexicalAnalyser but not its underlying Reader. This function must be called before new Input can be set.
public boolean hasMore()
This function returns whether there is more input available that may be tokenized.
public boolean matchesNext(LexicalAnalyzer.TokenType tType) throws java.io.IOException
This function tests whether there is more input to be read for matching from the current Reader and returns true iff a token of the given TokenType can be read next. If so, a call to getMatchedString(...) will return this token as a String.
tType
- the expected TokenType for which is to be tested
java.io.IOException
- if reading failspublic java.lang.String getMatchedString(LexicalAnalyzer.TokenType tType) throws java.util.NoSuchElementException
This function returns the String that has been matched by the previous call to matchesNext(...). Note that this function must not be called without a preceding matchesNext(...) call, and a call to matchesNext(...) has to be followed by a call to this function to read the input.
tType
- the expected TokenType that the String has matched
java.util.NoSuchElementException
- if no such token is currently availablepublic java.lang.String parseToken(LexicalAnalyzer.TokenType tt, java.lang.String msg) throws java.text.ParseException, java.io.IOException
This function attempts to parse a token of the given type from this LexicalAnalyzer. If the next token does not match this type a ParseException will be thrown. The message contained in this exception always reads &qout;expecting &qout; + token type name + given msg. If the next token matches the given type the matched String is returned.
tt
- the TokenType of the token that is expected as next inputmsg
- the error String to be appended to the default message
java.text.ParseException
- if the input does not match the given TokenType
java.io.IOException
- may be caused by the underlying Readerpublic java.lang.String ignoreToken(LexicalAnalyzer.TokenType ignorableTT) throws java.io.IOException
This function can be used to ignore comments and/or whitespace. What constitutes a comment or whitespace is defined by the given TokenType.
ignorableTT
- the TokenType defining a whitespace/comment
java.io.IOException
- may be caused by the underlying Readerpublic boolean equals(java.lang.Object obj) throws java.lang.UnsupportedOperationException
This class does not support equality testing and an exception will be thrown if this method is called.
equals
in class java.lang.Object
obj
- the object this should be compared to
java.lang.UnsupportedOperationException
- will be thrownpublic int hashCode() throws java.lang.UnsupportedOperationException
This class does not support hashing and an exception will be thrown if this method is called.
hashCode
in class java.lang.Object
java.lang.UnsupportedOperationException
- will be thrownpublic java.lang.String toString() throws java.lang.UnsupportedOperationException
This class does not support a printable representation and an exception will be thrown if this method is called.
toString
in class java.lang.Object
java.lang.UnsupportedOperationException
- will be thrown
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |