Recursive Descent Parsing

For programming languages, one technique used by compilers is "recursive descent parsing" — the language is described by means of a "grammar" that is represented recursively.  Such a recursive definition naturally leads to a recursive program that processes text based on that grammar.

A very small part of any full programming language is the grammar representing an expression.  For the rest, you may encounter it in a Programming Languages course, but you most certainly will encounter it if you take the Compiler sequence for your senior concentration.

The actual text is broken into what are called "tokens".  The grammar drives how these tokens are combined.

The concepts involved are expression, term, factor, primary, and operand.  "Expression" denotes either a term or the result of the lowest precedence operators, + and –; "term" denotes either a factor or the result of the middle precedence operators, * and /; "factor" denotes either a primary or the result of the highest precedence operator, ^ [denoting exponentiation].  Finally, a primary is either a parenthesized expression or an operand (actual text representing a constant or a variable).

The associativity of each operator is controlled by the form of the recursive definition.  Right-associative operators have self-recursion to the right of the operator; left-associative operators have apparent self-recursion to the left of the operator.  This is stated as "apparent" because actual self-recursion would lead to infinite regress.

The grammar is represented by a set of "productions":  statements of what it means to be an expression, term, etc.  As written here, the grammar gives the word for the item being defined, followed by alternatives allowed by the grammar.  Word surrounded by angle brackets (like <expr>) represent categories rather than actual text, while symbols not in angle brackets represent actual text — the + - * / ^ ( and ).  I have not bothered to give the explicit grammar for constants and variables since you can use your own background in programming to provide that.

<expr>    ::=  <term> | <expr>+<term> | <expr>-<term>
<term>    ::=  <factor> | <term>*<factor> | <term>/<factor>
<factor>  ::=  <primary> | <primary>^<factor>
<primary> ::=  <operand> | (<expr>)
<operand> ::=  <constant> | <variable>

Thus "25" is an expression:  expr → term → factor → primary → operand → constant.

For each production involving a binary operator, the program logic determines which of the available productions is involved after it has processed the left operand.  Thus the code for expr, after processing the left operand, tests whether the next token is an addition-type operator; the code for term, after processing the left operand, tests whether the next token is a multiplication-type operator; and the code for factor, after processing the left operand, tests whether the next token is an exponentiation operator.  For all operators, the first step is processing the left operand.  For the right-associate operator, a simple if statement handles the exponentiation.  For the left-associative operators, however, we can handle the grammar through a while loop:  for expr, as long as the next token is "+" or "-", treat the previous work as having processed an expr, not a term, thus satisfying the grammar.

Into this logical structure goes the actual processing of the text.  For demonstration purposes, the specimen program builds a postfix expression from the parenthesized infix expression represented with no white space and allowing only single letter variable names.  The infix expression is contained in a static string accessible within the class, and the current token is represented by the current position within that string.  [If one were using white-space delimited text for tokens, an alternative representation would be required.]

// expr ::= term | expr+term | expr-term
static String expr()
{  char nxt;
   String result;

   result = term();            // first part of the postfix expression
   if ( current >= input.length() )
      return result;
// Note:  while loop gives left associativity
   nxt  = input.charAt(current);
// If next token in + -, treat result as expr, not term
   while ( nxt == '+' | nxt == '-' )
   {  current++;               // Advance to next token in input
      result += term() + nxt;  // Concatenate right operand and operator
      if ( current >= input.length() )
         break;
      nxt  = input.charAt(current);
   }
   return result;
}

The processing of term is nearly identical, of course, since the productions have the identical form.  The processing of a factor, however, involves right associativity, so that the while simply becomes an if.

// factor ::= primary | primary^factor
static String factor()
{  char nxt;
   String result;

   result = primary();         // first part of the postfix expression
   if ( current >= input.length() )
      return result;
   nxt  = input.charAt(current);
   if ( nxt == '^' )
   {  current++;               // Advance to next token in input
      result += factor() + nxt;// Concatenate right operand and operator
   }
   return result;
}

It is at the primary that we actually get an operand, or else we get a parenthesized expression.  The parentheses are, of course, discarded in the processing since the program is just generating a postfix expression.  This is where the data validation comes in, handled here by means of throwing exceptions either for missing the closing parenthesis or for encountering a non-alphabetic character for the variable name.

// primary ::= variable | (expr) [variable processed as a single letter]
static String primary()
{  char nxt;
   String result;

   if ( current >= input.length() )
      throw new RuntimeException("End of expr:  " + input);
   nxt = input.charAt(current++);     // ALWAYS consume a character
   if ( nxt == '(' )
   {  result = expr();
      if ( current >= input.length() )
         throw new RuntimeException("No close paren in " + input);
      nxt = input.charAt(current++);
      if ( nxt != ')' )
         throw new RuntimeException("No close paren in " + input);
   }
   else if ( Character.isLetter(nxt) )
      result = Character.toString(nxt);
   else
      throw new RuntimeException (nxt + " is illegal in " + input);
   return result;
}

BACK


InfixPostfix.java    The complete program, also given as text below.


InfixPostfix.java

/*
 * Recursive descent expression parsing.  Expression being parsed is
 * passed in the command line or entered by the user.  This supports
 * the operators "+ -" | "* /" | "^"
 *
 * Note that the command-line entry must be enclosed in quotation
 * marks (or else the symbol ^ disappears).
 *
 *
 * Author:  Timothy Rolfe
 */
public class InfixPostfix
{  static String input;
   static int    current = 0;

   public static void main ( String[] args )
   {  String postfix;

      if ( args.length == 0 )
      {  java.util.Scanner console = new java.util.Scanner (System.in);
         System.out.print ("Enter the expression:  ");
         input = console.nextLine();
      }
      else
      {  input = args[0];
         System.out.println("Expression:  " + input + '\n');
      }

      postfix = expr();
      System.out.println("Equivalent postfix expression:  " + postfix);
   }

   // expr == term | expr+term | expr-term
   static String expr()
   {  char nxt;
      String result;

      result = term();            // first part of the postfix expression
      if ( current >= input.length() )
         return result;
   // Note:  while loop gives left associativity
      nxt  = input.charAt(current);
   // If next token in + -, treat result as expr, not term
      while ( nxt == '+' | nxt == '-' )
      {  current++;               // Advance to next token in input
         result += term() + nxt;  // Concatenate right operand and operator
         if ( current >= input.length() )
            break;
         nxt  = input.charAt(current);
      }
      return result;
   }

   // term == factor | term*factor | term/factor
   static String term()
   {  char nxt;
      String result;

      result = factor();
      if ( current >= input.length() )
         return result;
   // Note:  while loop gives left associativity
      nxt  = input.charAt(current);
   // If next token in "*/", treat current as term, not factor
      while ( nxt == '*' | nxt == '/' )
      {  current++;          // Advance to next token in input
         result += factor() + nxt;
         if ( current >= input.length() )
            break;
         nxt  = input.charAt(current);
      }
      return result;
   }

   // factor == primary^factor | primary
   static String factor()
   {  char nxt;
      String result;

      result = primary();         // first part of the postfix expression
      if ( current >= input.length() )
         return result;
      nxt  = input.charAt(current);
      if ( nxt == '^' )
      {  current++;               // Advance to next token in input
         result += factor() + nxt;// Concatenate right operand and operator
      }
      return result;
   }

   // primary == variable | (expr) [variable processed as a single letter]
   static String primary()
   {  char nxt;
      String result;

      if ( current >= input.length() )
         throw new RuntimeException("End of expr:  " + input);
      nxt = input.charAt(current++);     // ALWAYS consume a character
      if ( nxt == '(' )
      {  result = expr();
         if ( current >= input.length() )
            throw new RuntimeException("No close paren in " + input);
         nxt = input.charAt(current++);
         if ( nxt != ')' )
            throw new RuntimeException("No close paren in " + input);
      }
      else if ( Character.isLetter(nxt) )
         result = Character.toString(nxt);
      else
         throw new RuntimeException (nxt + " is illegal in " + input);
      return result;
   }
}
/* Result of a run with command-line argument:

Expression:  (A+B-D*E*F)/(G-H)+I^J^K

Equivalent postfix expression:  AB+DE*F*-GH-/IJK^^+
*/