panthema / 2007 / stx-exparser / stx-exparser-0.7 / libstx-exparser / ExpressionParser.dox (Download File)
// $Id: ExpressionParser.dox 59 2007-07-17 14:43:23Z tb $

/*
 * STX Expression Parser C++ Framework v0.7
 * Copyright (C) 2007 Timo Bingmann
 *
 * This library is free software; you can redistribute it and/or modify it
 * under the terms of the GNU Lesser General Public License as published by the
 * Free Software Foundation; either version 2.1 of the License, or (at your
 * option) any later version.
 *
 * This library is distributed in the hope that it will be useful, but WITHOUT
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
 * for more details.
 *
 * You should have received a copy of the GNU Lesser General Public License
 * along with this library; if not, write to the Free Software Foundation,
 * Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
 */

/** \file ExpressionParser.dox
 * File containing the mainpage for the doxygen api documentation.
 */

/** \mainpage

The STX Expression Parser provides a C++ framework, which can process
user-specified expression strings containing program-specific variables. It can
be integrated into applications to allow user-customized data selection and
filtering. The expresssion strings are intuitive SQL-like WHERE-clauses and can
contain arbitrarily complex arithmetic. At the same time the expression
processing time is guaranteed to be fast enough to safely iterate over larger
data sets.

Originally I wrote this parser framework for my study thesis (see
http://idlebox.net/blogtags/study_thesis_large_graphs). In the thesis millions
of graph edges are processed and organized in an r-tree search index
structure. Each vertex and edge had a number of attributes like "color" or
"importance". These attributes could then be used in filter expressions to
determine which edges are returned by a graph server and displayed on a remote
client. The attributes of the filtered edges could be used to calculate the
returned data set using expressions similar to SQL column selection clauses.

Four instructive example applications are included in the package. See
\ref sec_examples for more.

\section sec_web Website / API Docs / Bugs / License

The current source distribution package can be downloaded from
http://idlebox.net/2007/stx-exparser/

The library source and example programs are extensively documented using
doxygen. The compiled doxygen HTML documentation can be found at
http://idlebox.net/2007/stx-exparser/stx-exparser-0.7-doxygen/
(if you are not reading it right now).

If bugs should become known they will be posted on the above web page together
with patches or corrected versions.

The complete source code is released under the GNU Lesser General Public
License v2.1 (LGPL) which can be found in the file COPYING.

\subsection subsec_webexample Online Examples

An online interactive CGI version of the parser can be accessed at
http://idlebox.net/2007/stx-exparser/online.htt . It displays the inner
workings of the library for a given expression string.

Furthermore a number of example CSV data files can be browsed, sorted, analyzed
and filtered using the csvtool example. See
http://idlebox.net/2007/stx-exparser/csvfilter.htt for an instructive use 
of this library.

A wxWidgets demo program is located in the directory wxparserdemo. Compiled
binary versions can be found at http://idlebox.net/2007/stx-exparser/demo.htt

\section sec_boostspirit Compilation and Boost.Spirit

The expression parser's grammar is implemented using the Boost.Spirit parser
framework (see http://spirit.sourceforge.net). Therefore Boost must be
installed to compile the expression parser library. But as Boost.Spirit is a
set template includes, the resulting static library has no external
dependencies. The second purpose of this library release is to show a
reasonably complex Spirit parser grammar. See \ref subsec_grammar for more
details.

\warning Compiling the library with <tt>gcc 3.3 -O3</tt> can take very long and
a huge amount of RAM. This is due to the complex template classes created by
Boost.Spirit. Compilation with <tt>gcc 4.x</tt> is much faster.

\section sec_exprexample Expression Examples

The expression parser allows arbitrarily complex arithmetic expressions like:
\li <tt>6 + 3 * 12</tt>
\li <tt>(5 + 3) * 5.25</tt>
\li <tt>(int)(30 * 1.4)</tt>
\li <tt>(5 + 1 + 1 + 1) * (4.25 + 0.4 * 2.5 / (3.1 - 0.525 * 4))</tt>

To process program-defined data functions and variables may be included in the
expression:
\li <tt>a * 5 + 3 * b + EXP( LOGN(2) ) + COS( PI() / 2 )</tt>

To enable the expressions to be used as filters many comparison operators and
boolean logic operators are defined:
\li <tt>6 * 9 == 42</tt>
\li <tt>a >= 5 OR (42 <= field2 AND field2 <= 48) || NOT(got == "yes")</tt>

\section sec_examples Extensive Example Programs Including Source Code

The distribution contains two well documented example programs:

\li \ref example_exprcalc "exprcalc: A Simple Expression Calculator"
\li \ref example_csvfilter "csvfilter: A CSV-File Record Filter"

and a third more complex (less documented) example program:

\li \ref example_csvtool "csvtool: An Enhanced CSV-File Record Filter and Sorter"

Furthermore a user-friendly graphical demonstration application is included:

\li \ref example_wxparserdemo "wxParserDemo: wxWidgets Demo Application"

\section sec_design Short Overview of the Library's Design

\subsection subsec_anyscalar Types and the AnyScalar Class

The parser operates on following scalar types:
\li boolean
\li 8-bit '<tt>char</tt>' integer, 16-bit '<tt>short</tt>', 32-bit '<tt>integer</tt>' and 64-bit '<tt>long</tt>' integer
\li 8-bit '<tt>byte</tt>', 16-bit '<tt>word</tt>', 32-bit '<tt>dword</tt>' and 64-bit '<tt>qword</tt>' unsigned integers
\li single and double precision floating point (<tt>float</tt> and <tt>double</tt>)
\li <tt>string</tt>

These data types are processed by the library using the stx::AnyScalar
class. It contains a (type, value) pair of one of the scalar type listed
above. These scalar values can then be added, subtracted, multiplied, divided
or compared using member functions. If the two composed scalar objects are of
unequal type, then the operation is calculated in the "higher" data type (very
similar to C) and returned as such. So a small unsigned integer can be added to
a larger integer or even a string.

The reason to include the smaller integer types is based on the original
purpose of this library. In my study thesis the resulting values were
transfered over a network socket from a graph server to its drawing client. For
this purpose it was important to convert the resulting values into smaller data
types and thus reduce network traffic. For online-filtering applications this
is probably not as important.

\subsection subsec_grammar Spirit Grammar and ParseTree Processing

The expression parser grammar is an extension of the more basic "arithmetic
calculator". It is extended to recognize floating point numbers, quoted string
constants, attribute placeholders, function calls, comparison and boolean
operators.

A user given input string is parsed by Spirit into an abstract syntax tree
(AST). The AST is then processed into a tree of stx::ParseNode objects. During
the construction of the stx::ParseNode objects all constant subtrees are folded
into constant objects. This way repeated evaluation of the ParseTree is
accelerated. At the end the top stx::ParseNode is returned in a stx::ParseTree
enclosure.

The stx::ParseTree's main method is \ref stx::ParseTree::evaluate "evaluate()",
which takes a stx::SymbolTable and recursively evaluates the contained
stx::ParseNode tree using the variable and functions contained in the symbol
table. The result is returned as an stx::AnyScalar object.

\subsection subsec_further Further Details

After this abstract design discussion it is probably best to read the first
\ref example_exprcalc. It contains a comprehensible walk-through of the
libraries interface function and classes.

\author Timo Bingmann
\date 2007-07-17



\page example_exprcalc Example Application: Simple Expression Calculator

The first example application is a simple calculator. It takes the arguments on
the command line as an expresssion string, parses it and evaluates the
resulting tree using a stx::BasicSymbolTable. This page will guide you through
the workings of the example.

\section sec1 Detailed Example Code Guide

\dontinclude simple/exprcalc.cc

\skipline ExpressionParser

The include file ExpressionParser.h contains the required classes and also
includes AnyScalar.h

\skip main
\until }

The first step in main is to collect all command line parameters into a single
std::string. When invoking the exprcalc program from the command line it is
better to quote the whole expression into one argument: <tt>./exprcalc "4 + 19
* 2"</tt>. This will work correctly with special symbols processed by the shell
like <tt>&lt;</tt>, <tt>&gt;</tt> or <tt>!</tt>.

\skip parse
\until }
\until }

Next the concatenated arguments are parsed by the expression parser using
stx::parseExpression(const std::string&). The parse function returns a
stx::ParseTree object by value; the stx::ParseTree object is actually a pimpl
front-end to the top-most stx::ParseNode of the tree.

Many thing can go wrong while parsing the user input string and will result in
parseExpression throwing an exception. The top exception class used by the
parser is stx::ExpressionParserException. If the user enters an invalid
expression like <tt>"3 + "</tt> then a stx::BadSyntaxException is thrown and
will be caught by the catch-block and the application terminated after printing
an error.

If the input could successfully be parsed then the program prints the resulting
parse tree using stx::ParseTree::toString(). This function converts the
internal tree structure back to a string, which can be re-parsed into the same
tree. Usually this string will include extra brackets like <tt>"(4 + (19 * x))"</tt>.

\skip simple symbol table
\until setVariable

Then an instance of stx::BasicSymbolTable is created. stx::BasicSymbolTable is
a derivative of the abstract stx::SymbolTable which is used by the evaluation
functions to fill in variables and functions. The stx::BasicSymbolTable
includes some basic mathematical functions like sin() and exp().

For our simple calculator we define a single variable: <tt>x</tt> is set to <tt>42</tt>.

\until }
\until }
\until }
\until }

Once the symbol table is filled in, the parse tree is evaluated using it. The
evaluation result is returned as a stx::AnyScalar and printed out to the user.

During evaluation variable placeholders are filled in. If the expression
requires a non-existing variable of function, then the standard behaviour of
stx::BasicSymbolTable is to throw a stx::UnknownSymbolException.

Further exceptions can be thrown during different operations on the
stx::AnyScalar objects. These operators will throw a stx::ConversionException
if incompatible types are used. For example boolean + integer is not allowed
and will throw.

\section sec1_complete Complete Example Source Code

The example can be found in the distribution in examples/simple/

\include simple/exprcalc.cc




\page example_csvfilter Example Application: CSV-File Record Filter

The second example is a CSV-file filter. It will read a tab-delimited csv data
file with column headers and apply the given filter expression to each
line. Each line is taken as a record of variables named by the column
header. For each line the expression is evaluated and if it returns a "true"
boolean value, then the line is copied to the output; if it returns a "false"
boolean, the line is skipped. In case the expression value is not a boolean,
the value is printed on stderr.

This example program is not supposed to be an "industrial-strength" CSV file
parser. It cannot handle escaped delimiters or quoted fields. Beware to use it
directly on Excel-generated files.

\section sec_exdata CSV sample files

The distribution contains three sample CSV files for the csvfilter. Two are
taken from MySQL's world sample database, which can be found at
http://dev.mysql.com/doc/world-setup/en/world-setup.html . The data contained
in the world database is originally Copyright Statistics Finland,
http://www.stat.fi/worldinfigures . The third CSV dataset contains some key
statistic features extracted from the
CIA World Factbook (https://www.cia.gov/library/publications/the-world-factbook/).

The mysql-world-city.csv file starts like this:

\verbatim
ID	Name	CountryCode	District	Population
1	Kabul	AFG	Kabol	1780000
2	Qandahar	AFG	Qandahar	237500
3	Herat	AFG	Herat	186800
4	Mazar-e-Sharif	AFG	Balkh	127800
5	Amsterdam	NLD	Noord-Holland	731200
6	Rotterdam	NLD	Zuid-Holland	593321
7	Haag	NLD	Zuid-Holland	440900
8	Utrecht	NLD	Utrecht	234323
9	Eindhoven	NLD	Noord-Brabant	201843
10	Tilburg	NLD	Noord-Brabant	193238
...
\endverbatim

The csvfilter operates on tab (or otherwise) delimited input streams. The first
line of the file is taken to be the column header; the header names can be used
within the filter expression as variables. For each of the following data rows
the expression is evaluated with variables set to the corresponding data
fields.

An example run of the csvfilter could be:
\li <tt>./csvfilter 'Population > 1000000 && CountryCode = "USA"' < mysql-world-city.csv</tt>

This will output the following debug information on stderr: 
\verbatim
Expression string: Population > 1000000 && CountryCode = "USA"
Parsed expression: ((Population > 1000000) && (CountryCode = "USA"))
Reading CSV column headers from input
Read 5 column headers.
Processed 4079 lines, copied 9 and skipped 4070 lines
\endverbatim

and the filtered CSV lines on stdout:
\verbatim
ID      Name    CountryCode     District        Population
3793    New York        USA     New York        8008278
3794    Los Angeles     USA     California      3694820
3795    Chicago USA     Illinois        2896016
3796    Houston USA     Texas   1953631
3797    Philadelphia    USA     Pennsylvania    1517550
3798    Phoenix USA     Arizona 1321045
3799    San Diego       USA     California      1223400
3800    Dallas  USA     Texas   1188580
3801    San Antonio     USA     Texas   1144646
\endverbatim

\section sec1 Detailed Example Code Guide

\dontinclude csvfilter/csvfilter.cc

\skipline ExpressionParser
\until map

First we include the file ExpressionParser.h, which contains the required
classes, and four Standard Template Library files.

\until delimiter =

The example program uses this constant as the delimiter character. It may be
changed if required.

\until }
\until }
\until }

As state above this example program is not supposed to be an
"industrial-strength" CSV file reader. The above function reads one line for
the input file stream and splits it into a std::vector of field strings. No
escaped delimiters or quoted fields are recognized.

\until }
\until }
\until }
\until }
\until }
\until }

This is the main meat of the example: the CSVRowSymbolTable is subclassed from
stx::BasicSymbolTable. This symbol table is used to represent the possible
variables in the expression formula. It is constructed from the header row of
the CSV file and a vector containing the current data row. However the symbol
table object only contains references to the actual vector and map. This way
the main program will be able to modify the data row and re-evaluate the parse
tree using the new row.

The idea behind this set up is to minimize the number of times the input CSV
data fields are copied: the vector containing the string fields is filled once
and then only referenced via the symbol table object if the variable is
actually used.

Another way to implement the symbol table would be to used an
stx::BasicSymbolTable and set the variables using \ref
stx::BasicSymbolTable::setVariable "setVariable()".  However this would be much
slower than the reference symbol table approach shown above.

\until }
\until }
\until }

As shown in \ref example_exprcalc "the first example application" all command
line arguments are collected into a single string. Then the parser in put into
action on the expression string and produces the stx::ParseTree. This time the
parse tree will be used multiple times.

\until headers.size()
\until "\n";

The csvfilter reads the CSV file from the stdin stream. The first line is read
and saved. It is considered to contain the column headers. This header row is
inserted into the header map for faster lookup by the symbol table. The header
row is also outputted to std::cout.

The following loop then iterates over the data rows read from the CSV
input. The symbol table is constructed only once and for each row the
referenced vector "datacolumns" is refilled. Using this symbol table the parse
tree is re-evaluated for each data row.

The evaluation result is checked for a boolean type. In this case a filter
expression was given and the row is either copied to std::cout or skipped
depending on the filter's result.

If the expression returned a non-boolean type, it is taken to be some
calculation and the result is printed on std::cerr.

\until }
\until }
\until }
\until }
\until }
\until }
\until }
\until }
\until }

\section sec1_complete Complete Example Source Code

The example can be found in the distribution in examples/csvfilter/. That
directory also contains three CSV files to test the csvfilter:
mysql-world-city.csv, mysql-world-country.csv and cia-world-factbook.csv

\include csvfilter/csvfilter.cc




\page example_csvtool Example Application: Enhanced CSV-File Record Filter and Sorter

The csvtool example program is an extended version of the \ref
example_csvfilter "csvfilter" program. It buffers all matching lines from the
csv after evaluation of the filter. This buffered table is then sorted using a
"natural sort" relation and outputted to stdout. Offset and limit of the
outputted region can be given.

This tool is used on the expression parser's web site for an Online CSV Filter
Demo: http://idlebox.net/2007/stx-exparser/csvfilter.htt

\section sec1_complete Complete Example Source Code

\include csvtool/csvtool.cc




\page example_wxparserdemo Demo wxWidgets Application: wxParserDemo

The wxParserDemo application is a user-friendly graphical user interface to
demonstrate the STX Expression Parser's functions. It is located in the source
package beneath the directory wxparserdemo.

The demo allows the user to enter an expression string and press "Evaluate" to
run the parser over it. If the expression could be parsed and evaluated
correctly, the result's value and type are displayed. Otherwise an error
message is shown. The demo also has a data input grid to specify variables,
which can be used in the expression.

The demo program uses the cross-platform wxWidgets toolkit and can be compiled
on Linux, Windows and MacOSX.

Some binary executables for Linux and Windows can be downloaded from
http://idlebox.net/2007/stx-exparser/demo.htt

\section sec1_usage Short Demo Usage By Example

The demo application has one large expression input field in the upper half of
the dialog. The given expression can be parsed and evaluated by pressing
"Evaluate". These results are printed in the "Result" tag, which automatically
gets activated.

\image html wxparserdemo1.png

The "Result" tab is separated into two parts: the upper group box contains the
resulting parse tree converted back to a string; or the syntax error if the
expression is invalid. The re-stringified expression usually contains more
brackets than your input expression. It is generated by the \ref
stx::ParseTree::toString() "toString()" function of stx::ParseNode or
stx::ParseTree.

The lower part of the "Result" tab shows the evaluation result, if it could be
computed. Both type and value of the resulting AnyScalar are displayed. If the
tree could not be evaluated, e.g. if an undefined variable was used, then the
exception message is shown.

\image html wxparserdemo2.png

The "XML Tree" tab contains Boost.Spirit's debug XML output. It shows in detail
which Spirit rules are used to compose the parse tree.

\image html wxparserdemo5.png

The demo applications allows input expressions to use variable names. The
(non-standard) variables set before evaluating a parse tree are shown in the
grid input on the "Variables" tab. They can be set by entering data into the
grid. Lines with empty names are ignored.

<table align="center" border="0"><tr>
<td>
\image html wxparserdemo3.png
</td>
<td>
\image html wxparserdemo4.png
</td>
</tr></table>


*/