Lex stands for Lexical Analyzer. Lex is a tool for generating scanners. Scanners are programs that recognize lexical patterns in text. These lexical patterns (or regular expressions) are defined in a particular syntax. A matched regular expression may have an associated action. This action may also include returning a token. When Lex receives input in the form of a file or text, it attempts to match the text with the regular expression. It takes input one character at a time and continues until a pattern is matched. If a pattern can be matched, then Lex performs the associated action (which may include returning a token). If, on the other hand, no regular expression can be matched, further processing stops and Lex displays an error message.
Lex and C are tightly coupled. A .lex file (files in Lex have the extension .lex) is passed through the lex utility, and produces output files in C. These file(s) are compiled to produce an executable version of the lexical analyzer.
Regular expressions in Lex
A regular expression is a pattern description using a meta language. An expression is made up of symbols. Normal symbols are characters and numbers, but there are other symbols that have special meaning in Lex. The following two tables define some of the symbols used in Lex and give a few typical examples.
Defining regular expressions in Lex
Character Meaning
A-Z, 0-9, a-z Characters and numbers that form part of the pattern.
. Matches any character except \n.
- Used to denote range. Example: A-Z implies all characters from A to Z.
[ ] A character class. Matches any character in the brackets. If the first character is ^ then it indicates a negation pattern. Example: [abC] matches either of a, b, and C.
* Match zero or more occurrences of the preceding pattern.
+ Matches one or more occurrences of the preceding pattern.
? Matches zero or one occurrences of the preceding pattern.
$ Matches end of line as the last character of the pattern.
{ } Indicates how many times a pattern can be present. Example: A{1,3} implies one or three occurrences of A may be present.
\ Used to escape meta characters. Also used to remove the special meaning of characters as defined in this table.
^ Negation.
| Logical OR between expressions.
"
/ Look ahead. Matches the preceding pattern only if followed by the succeeding expression. Example: A0/1 matches A0 only if A01 is the input.
( ) Groups a series of regular expressions.
Examples of regular expressions
Regular expression Meaning
joke[rs] Matches either jokes or joker.
A{1,2}shis+ Matches AAshis, Ashis, AAshi, Ashi.
(A[b-e])+ Matches zero or one occurrences of A followed by any character from b to e.
Tokens in Lex are declared like variable names in C. Every token has an associated expression. (Examples of tokens and expression are given in the following table.) Using the examples in our tables, we'll build a word-counting program. Our first task will be to show how tokens are declared.
Examples of token declarations
Token
Associated expression Meaning
number ([0-9])+ 1 or more occurrences of a digit
chars [A-Za-z] Any character
blank " " A blank space
word (chars)+ 1 or more occurrences of chars
variable (chars)+(number)*(chars)*( number)*
Programming in Lex
Programming in Lex can be divided into three steps:
1. Specify the pattern-associated actions in a form that Lex can understand.
2. Run Lex over this file to generate C code for the scanner.
3. Compile and link the C code to produce the executable scanner.
Note: If the scanner is part of a parser developed using Yacc, only steps 1 and 2 should be performed.
Now let's look at the kind of program format that Lex understands. A Lex program is divided into three sections: the first section has global C and Lex declarations, the second section has the patterns (coded in C), and the third section has supplemental C functions. main(), for example, would typically be found in the third section. These sections are delimited by %%. So, to get back to the word-counting Lex program, let's look at the composition of the various program sections.
Global C and Lex declarations
In this section we can add C variable declarations. We will declare an integer variable here for our word-counting program that holds the number of words counted by the program. We'll also perform token declarations of Lex.
Declarations for the word-counting program
%{
int wordCount = 0;
%}
chars [A-za-z\_\'\.\"]
numbers ([0-9])+
delim [" "\n\t]
whitespace {delim}+
words {chars}+
%%
The double percent sign implies the end of this section and the beginning of the second of the three sections in Lex programming.
Lex rules for matching patterns
Let's look at the Lex rules for describing the token that we want to match. (We'll use C to define what to do when a token is matched.) Continuing with our word-counting program, here are the rules for matching tokens.
Lex rules for the word-counting program
{words} { wordCount++; /*
increase the word count by one*/ }
{whitespace} { /* do
nothing*/ }
{numbers} { /* one may
want to add some processing here*/ }
%%
C code
The third and final section of programming in Lex covers C function declarations (and occasionally the main function) Note that this section has to include the yywrap() function. Lex has a set of functions and variables that are available to the user. One of them is yywrap. Typically, yywrap() is defined as shown in the example below.
C code section for the word-counting program
void main()
{
yylex(); /* start the analysis*/
printf(" No of words:
%d\n", wordCount);
}
int yywrap()
{
return 1;
}
In the preceding sections we've discussed the basic elements of Lex programming, which should help you in writing simple lexical analysis programs.
Putting it all together
The .lex file is Lex's scanner. It is presented to the Lex program as:
$ lex
This produces the lex.yy.c file, which can be compiled using a C compiler. It can also be used with a parser to produce an executable, or you can include the Lex library in the link step with the option –ll.
Here are some of Lex's flags:
• -c Indicates C actions and is the default.
• -t Causes the lex.yy.c program to be written instead to standard output.
• -v Provides a two-line summary of statistics.
• -n Will not print out the -v summary.
Advanced Lex
Lex has several functions and variables that provide different information and can be used to build programs that can perform complex functions. Some of these variables and functions, along with their uses, are listed in the following tables.
Lex variables
yyin Of the type FILE*. This points to the current file being parsed by the lexer.
yyout Of the type FILE*. This points to the location where the output of the lexer will be written. By default, both yyin and yyout point to standard input and output.
yytext The text of the matched pattern is stored in this variable (char*).
yyleng Gives the length of the matched pattern.
yylineno Provides current line number information. (May or may not be supported by the lexer.)
Lex functions
yylex() The function that starts the analysis. It is automatically generated by Lex.
yywrap() This function is called when end of file (or input) is encountered. If this function returns 1, the parsing stops. So, this can be used to parse multiple files. Code can be written in the third section, which will allow multiple files to be parsed. The strategy is to make yyin file pointer (see the preceding table) point to a different file until all the files are parsed. At the end, yywrap() can return 1 to indicate end of parsing.
yyless(int n) This function can be used to push back all but first ‘n’ characters of the read token.
yymore() This function tells the lexer to append the next token to the current token.