Lexical analysis

related topics
{math, number, function}
{language, word, form}
{system, computer, user}
{company, market, business}
{specie, animal, plant}

In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner. A lexer often exists as a single function which is called by a parser or another function.

Contents

Lexical grammar

The specification of a programming language will often include a set of rules which defines the lexer. These rules are usually called regular expressions and they define the set of possible character sequences that are used to form individual tokens or lexemes; whitespace is also defined by a regular expression and influences the recognition of other tokens, but does not itself contribute any tokens.

Token

A token is a string of characters, categorized according to the rules as a symbol (e.g. IDENTIFIER, NUMBER, COMMA, etc.). The process of forming tokens from an input stream of characters is called tokenization and the lexer categorizes them according to a symbol type. A token can look like anything that is useful for processing an input text stream or text file.

A lexical analyzer generally does nothing with combinations of tokens, a task left for a parser. For example, a typical lexical analyzer recognizes parenthesis as tokens, but does nothing to ensure that each '(' is matched with a ')'.

Consider this expression in the C programming language:

Tokenized in the following table:

Tokens are frequently defined by regular expressions, which are understood by a lexical analyzer generator such as lex. The lexical analyzer (either generated automatically by a tool like lex, or hand-crafted) reads in a stream of characters, identifies the lexemes in the stream, and categorizes them into tokens. This is called "tokenizing." If the lexer finds an invalid token, it will report an error.

Following tokenizing is parsing. From there, the interpreted data may be loaded into data structures for general use, interpretation, or compiling.

Scanner

The first stage, the scanner, is usually based on a finite state machine. It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are known as lexemes). For instance, an integer token may contain any sequence of numerical digit characters. In many cases, the first non-whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is known as the maximal munch rule, or longest match rule). In some languages[which?] the lexeme creation rules are more complicated and may involve backtracking over previously read characters.

Full article ▸

related documents
Empty set
Polish notation
Binomial theorem
Greatest common divisor
NP (complexity)
Direct product
Tensor
Affine transformation
Net (mathematics)
Unicode and HTML
Knapsack problem
Grover's algorithm
Ordered pair
Graph theory
Partially ordered set
Fundamental theorem of arithmetic
Selection sort
Befunge
Normal space
Cyclic group
Delaunay triangulation
Sheffer stroke
Pell's equation
Finite state machine
Topological vector space
Brute force attack
B-spline
A* search algorithm
Banach space
Naive Bayes classifier