// $Id: ExpressionParser.dox 59 2007-07-17 14:43:23Z tb $ /* * STX Expression Parser C++ Framework v0.7 * Copyright (C) 2007 Timo Bingmann * * This library is free software; you can redistribute it and/or modify it * under the terms of the GNU Lesser General Public License as published by the * Free Software Foundation; either version 2.1 of the License, or (at your * option) any later version. * * This library is distributed in the hope that it will be useful, but WITHOUT * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or * FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License * for more details. * * You should have received a copy of the GNU Lesser General Public License * along with this library; if not, write to the Free Software Foundation, * Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ /** \file ExpressionParser.dox * File containing the mainpage for the doxygen api documentation. */ /** \mainpage The STX Expression Parser provides a C++ framework, which can process user-specified expression strings containing program-specific variables. It can be integrated into applications to allow user-customized data selection and filtering. The expresssion strings are intuitive SQL-like WHERE-clauses and can contain arbitrarily complex arithmetic. At the same time the expression processing time is guaranteed to be fast enough to safely iterate over larger data sets. Originally I wrote this parser framework for my study thesis (see http://idlebox.net/blogtags/study_thesis_large_graphs). In the thesis millions of graph edges are processed and organized in an r-tree search index structure. Each vertex and edge had a number of attributes like "color" or "importance". These attributes could then be used in filter expressions to determine which edges are returned by a graph server and displayed on a remote client. The attributes of the filtered edges could be used to calculate the returned data set using expressions similar to SQL column selection clauses. Four instructive example applications are included in the package. See \ref sec_examples for more. \section sec_web Website / API Docs / Bugs / License The current source distribution package can be downloaded from http://idlebox.net/2007/stx-exparser/ The library source and example programs are extensively documented using doxygen. The compiled doxygen HTML documentation can be found at http://idlebox.net/2007/stx-exparser/stx-exparser-0.7-doxygen/ (if you are not reading it right now). If bugs should become known they will be posted on the above web page together with patches or corrected versions. The complete source code is released under the GNU Lesser General Public License v2.1 (LGPL) which can be found in the file COPYING. \subsection subsec_webexample Online Examples An online interactive CGI version of the parser can be accessed at http://idlebox.net/2007/stx-exparser/online.htt . It displays the inner workings of the library for a given expression string. Furthermore a number of example CSV data files can be browsed, sorted, analyzed and filtered using the csvtool example. See http://idlebox.net/2007/stx-exparser/csvfilter.htt for an instructive use of this library. A wxWidgets demo program is located in the directory wxparserdemo. Compiled binary versions can be found at http://idlebox.net/2007/stx-exparser/demo.htt \section sec_boostspirit Compilation and Boost.Spirit The expression parser's grammar is implemented using the Boost.Spirit parser framework (see http://spirit.sourceforge.net). Therefore Boost must be installed to compile the expression parser library. But as Boost.Spirit is a set template includes, the resulting static library has no external dependencies. The second purpose of this library release is to show a reasonably complex Spirit parser grammar. See \ref subsec_grammar for more details. \warning Compiling the library with <tt>gcc 3.3 -O3</tt> can take very long and a huge amount of RAM. This is due to the complex template classes created by Boost.Spirit. Compilation with <tt>gcc 4.x</tt> is much faster. \section sec_exprexample Expression Examples The expression parser allows arbitrarily complex arithmetic expressions like: \li <tt>6 + 3 * 12</tt> \li <tt>(5 + 3) * 5.25</tt> \li <tt>(int)(30 * 1.4)</tt> \li <tt>(5 + 1 + 1 + 1) * (4.25 + 0.4 * 2.5 / (3.1 - 0.525 * 4))</tt> To process program-defined data functions and variables may be included in the expression: \li <tt>a * 5 + 3 * b + EXP( LOGN(2) ) + COS( PI() / 2 )</tt> To enable the expressions to be used as filters many comparison operators and boolean logic operators are defined: \li <tt>6 * 9 == 42</tt> \li <tt>a >= 5 OR (42 <= field2 AND field2 <= 48) || NOT(got == "yes")</tt> \section sec_examples Extensive Example Programs Including Source Code The distribution contains two well documented example programs: \li \ref example_exprcalc "exprcalc: A Simple Expression Calculator" \li \ref example_csvfilter "csvfilter: A CSV-File Record Filter" and a third more complex (less documented) example program: \li \ref example_csvtool "csvtool: An Enhanced CSV-File Record Filter and Sorter" Furthermore a user-friendly graphical demonstration application is included: \li \ref example_wxparserdemo "wxParserDemo: wxWidgets Demo Application" \section sec_design Short Overview of the Library's Design \subsection subsec_anyscalar Types and the AnyScalar Class The parser operates on following scalar types: \li boolean \li 8-bit '<tt>char</tt>' integer, 16-bit '<tt>short</tt>', 32-bit '<tt>integer</tt>' and 64-bit '<tt>long</tt>' integer \li 8-bit '<tt>byte</tt>', 16-bit '<tt>word</tt>', 32-bit '<tt>dword</tt>' and 64-bit '<tt>qword</tt>' unsigned integers \li single and double precision floating point (<tt>float</tt> and <tt>double</tt>) \li <tt>string</tt> These data types are processed by the library using the stx::AnyScalar class. It contains a (type, value) pair of one of the scalar type listed above. These scalar values can then be added, subtracted, multiplied, divided or compared using member functions. If the two composed scalar objects are of unequal type, then the operation is calculated in the "higher" data type (very similar to C) and returned as such. So a small unsigned integer can be added to a larger integer or even a string. The reason to include the smaller integer types is based on the original purpose of this library. In my study thesis the resulting values were transfered over a network socket from a graph server to its drawing client. For this purpose it was important to convert the resulting values into smaller data types and thus reduce network traffic. For online-filtering applications this is probably not as important. \subsection subsec_grammar Spirit Grammar and ParseTree Processing The expression parser grammar is an extension of the more basic "arithmetic calculator". It is extended to recognize floating point numbers, quoted string constants, attribute placeholders, function calls, comparison and boolean operators. A user given input string is parsed by Spirit into an abstract syntax tree (AST). The AST is then processed into a tree of stx::ParseNode objects. During the construction of the stx::ParseNode objects all constant subtrees are folded into constant objects. This way repeated evaluation of the ParseTree is accelerated. At the end the top stx::ParseNode is returned in a stx::ParseTree enclosure. The stx::ParseTree's main method is \ref stx::ParseTree::evaluate "evaluate()", which takes a stx::SymbolTable and recursively evaluates the contained stx::ParseNode tree using the variable and functions contained in the symbol table. The result is returned as an stx::AnyScalar object. \subsection subsec_further Further Details After this abstract design discussion it is probably best to read the first \ref example_exprcalc. It contains a comprehensible walk-through of the libraries interface function and classes. \author Timo Bingmann \date 2007-07-17 \page example_exprcalc Example Application: Simple Expression Calculator The first example application is a simple calculator. It takes the arguments on the command line as an expresssion string, parses it and evaluates the resulting tree using a stx::BasicSymbolTable. This page will guide you through the workings of the example. \section sec1 Detailed Example Code Guide \dontinclude simple/exprcalc.cc \skipline ExpressionParser The include file ExpressionParser.h contains the required classes and also includes AnyScalar.h \skip main \until } The first step in main is to collect all command line parameters into a single std::string. When invoking the exprcalc program from the command line it is better to quote the whole expression into one argument: <tt>./exprcalc "4 + 19 * 2"</tt>. This will work correctly with special symbols processed by the shell like <tt><</tt>, <tt>></tt> or <tt>!</tt>. \skip parse \until } \until } Next the concatenated arguments are parsed by the expression parser using stx::parseExpression(const std::string&). The parse function returns a stx::ParseTree object by value; the stx::ParseTree object is actually a pimpl front-end to the top-most stx::ParseNode of the tree. Many thing can go wrong while parsing the user input string and will result in parseExpression throwing an exception. The top exception class used by the parser is stx::ExpressionParserException. If the user enters an invalid expression like <tt>"3 + "</tt> then a stx::BadSyntaxException is thrown and will be caught by the catch-block and the application terminated after printing an error. If the input could successfully be parsed then the program prints the resulting parse tree using stx::ParseTree::toString(). This function converts the internal tree structure back to a string, which can be re-parsed into the same tree. Usually this string will include extra brackets like <tt>"(4 + (19 * x))"</tt>. \skip simple symbol table \until setVariable Then an instance of stx::BasicSymbolTable is created. stx::BasicSymbolTable is a derivative of the abstract stx::SymbolTable which is used by the evaluation functions to fill in variables and functions. The stx::BasicSymbolTable includes some basic mathematical functions like sin() and exp(). For our simple calculator we define a single variable: <tt>x</tt> is set to <tt>42</tt>. \until } \until } \until } \until } Once the symbol table is filled in, the parse tree is evaluated using it. The evaluation result is returned as a stx::AnyScalar and printed out to the user. During evaluation variable placeholders are filled in. If the expression requires a non-existing variable of function, then the standard behaviour of stx::BasicSymbolTable is to throw a stx::UnknownSymbolException. Further exceptions can be thrown during different operations on the stx::AnyScalar objects. These operators will throw a stx::ConversionException if incompatible types are used. For example boolean + integer is not allowed and will throw. \section sec1_complete Complete Example Source Code The example can be found in the distribution in examples/simple/ \include simple/exprcalc.cc \page example_csvfilter Example Application: CSV-File Record Filter The second example is a CSV-file filter. It will read a tab-delimited csv data file with column headers and apply the given filter expression to each line. Each line is taken as a record of variables named by the column header. For each line the expression is evaluated and if it returns a "true" boolean value, then the line is copied to the output; if it returns a "false" boolean, the line is skipped. In case the expression value is not a boolean, the value is printed on stderr. This example program is not supposed to be an "industrial-strength" CSV file parser. It cannot handle escaped delimiters or quoted fields. Beware to use it directly on Excel-generated files. \section sec_exdata CSV sample files The distribution contains three sample CSV files for the csvfilter. Two are taken from MySQL's world sample database, which can be found at http://dev.mysql.com/doc/world-setup/en/world-setup.html . The data contained in the world database is originally Copyright Statistics Finland, http://www.stat.fi/worldinfigures . The third CSV dataset contains some key statistic features extracted from the CIA World Factbook (https://www.cia.gov/library/publications/the-world-factbook/). The mysql-world-city.csv file starts like this: \verbatim ID Name CountryCode District Population 1 Kabul AFG Kabol 1780000 2 Qandahar AFG Qandahar 237500 3 Herat AFG Herat 186800 4 Mazar-e-Sharif AFG Balkh 127800 5 Amsterdam NLD Noord-Holland 731200 6 Rotterdam NLD Zuid-Holland 593321 7 Haag NLD Zuid-Holland 440900 8 Utrecht NLD Utrecht 234323 9 Eindhoven NLD Noord-Brabant 201843 10 Tilburg NLD Noord-Brabant 193238 ... \endverbatim The csvfilter operates on tab (or otherwise) delimited input streams. The first line of the file is taken to be the column header; the header names can be used within the filter expression as variables. For each of the following data rows the expression is evaluated with variables set to the corresponding data fields. An example run of the csvfilter could be: \li <tt>./csvfilter 'Population > 1000000 && CountryCode = "USA"' < mysql-world-city.csv</tt> This will output the following debug information on stderr: \verbatim Expression string: Population > 1000000 && CountryCode = "USA" Parsed expression: ((Population > 1000000) && (CountryCode = "USA")) Reading CSV column headers from input Read 5 column headers. Processed 4079 lines, copied 9 and skipped 4070 lines \endverbatim and the filtered CSV lines on stdout: \verbatim ID Name CountryCode District Population 3793 New York USA New York 8008278 3794 Los Angeles USA California 3694820 3795 Chicago USA Illinois 2896016 3796 Houston USA Texas 1953631 3797 Philadelphia USA Pennsylvania 1517550 3798 Phoenix USA Arizona 1321045 3799 San Diego USA California 1223400 3800 Dallas USA Texas 1188580 3801 San Antonio USA Texas 1144646 \endverbatim \section sec1 Detailed Example Code Guide \dontinclude csvfilter/csvfilter.cc \skipline ExpressionParser \until map First we include the file ExpressionParser.h, which contains the required classes, and four Standard Template Library files. \until delimiter = The example program uses this constant as the delimiter character. It may be changed if required. \until } \until } \until } As state above this example program is not supposed to be an "industrial-strength" CSV file reader. The above function reads one line for the input file stream and splits it into a std::vector of field strings. No escaped delimiters or quoted fields are recognized. \until } \until } \until } \until } \until } \until } This is the main meat of the example: the CSVRowSymbolTable is subclassed from stx::BasicSymbolTable. This symbol table is used to represent the possible variables in the expression formula. It is constructed from the header row of the CSV file and a vector containing the current data row. However the symbol table object only contains references to the actual vector and map. This way the main program will be able to modify the data row and re-evaluate the parse tree using the new row. The idea behind this set up is to minimize the number of times the input CSV data fields are copied: the vector containing the string fields is filled once and then only referenced via the symbol table object if the variable is actually used. Another way to implement the symbol table would be to used an stx::BasicSymbolTable and set the variables using \ref stx::BasicSymbolTable::setVariable "setVariable()". However this would be much slower than the reference symbol table approach shown above. \until } \until } \until } As shown in \ref example_exprcalc "the first example application" all command line arguments are collected into a single string. Then the parser in put into action on the expression string and produces the stx::ParseTree. This time the parse tree will be used multiple times. \until headers.size() \until "\n"; The csvfilter reads the CSV file from the stdin stream. The first line is read and saved. It is considered to contain the column headers. This header row is inserted into the header map for faster lookup by the symbol table. The header row is also outputted to std::cout. The following loop then iterates over the data rows read from the CSV input. The symbol table is constructed only once and for each row the referenced vector "datacolumns" is refilled. Using this symbol table the parse tree is re-evaluated for each data row. The evaluation result is checked for a boolean type. In this case a filter expression was given and the row is either copied to std::cout or skipped depending on the filter's result. If the expression returned a non-boolean type, it is taken to be some calculation and the result is printed on std::cerr. \until } \until } \until } \until } \until } \until } \until } \until } \until } \section sec1_complete Complete Example Source Code The example can be found in the distribution in examples/csvfilter/. That directory also contains three CSV files to test the csvfilter: mysql-world-city.csv, mysql-world-country.csv and cia-world-factbook.csv \include csvfilter/csvfilter.cc \page example_csvtool Example Application: Enhanced CSV-File Record Filter and Sorter The csvtool example program is an extended version of the \ref example_csvfilter "csvfilter" program. It buffers all matching lines from the csv after evaluation of the filter. This buffered table is then sorted using a "natural sort" relation and outputted to stdout. Offset and limit of the outputted region can be given. This tool is used on the expression parser's web site for an Online CSV Filter Demo: http://idlebox.net/2007/stx-exparser/csvfilter.htt \section sec1_complete Complete Example Source Code \include csvtool/csvtool.cc \page example_wxparserdemo Demo wxWidgets Application: wxParserDemo The wxParserDemo application is a user-friendly graphical user interface to demonstrate the STX Expression Parser's functions. It is located in the source package beneath the directory wxparserdemo. The demo allows the user to enter an expression string and press "Evaluate" to run the parser over it. If the expression could be parsed and evaluated correctly, the result's value and type are displayed. Otherwise an error message is shown. The demo also has a data input grid to specify variables, which can be used in the expression. The demo program uses the cross-platform wxWidgets toolkit and can be compiled on Linux, Windows and MacOSX. Some binary executables for Linux and Windows can be downloaded from http://idlebox.net/2007/stx-exparser/demo.htt \section sec1_usage Short Demo Usage By Example The demo application has one large expression input field in the upper half of the dialog. The given expression can be parsed and evaluated by pressing "Evaluate". These results are printed in the "Result" tag, which automatically gets activated. \image html wxparserdemo1.png The "Result" tab is separated into two parts: the upper group box contains the resulting parse tree converted back to a string; or the syntax error if the expression is invalid. The re-stringified expression usually contains more brackets than your input expression. It is generated by the \ref stx::ParseTree::toString() "toString()" function of stx::ParseNode or stx::ParseTree. The lower part of the "Result" tab shows the evaluation result, if it could be computed. Both type and value of the resulting AnyScalar are displayed. If the tree could not be evaluated, e.g. if an undefined variable was used, then the exception message is shown. \image html wxparserdemo2.png The "XML Tree" tab contains Boost.Spirit's debug XML output. It shows in detail which Spirit rules are used to compose the parse tree. \image html wxparserdemo5.png The demo applications allows input expressions to use variable names. The (non-standard) variables set before evaluating a parse tree are shown in the grid input on the "Variables" tab. They can be set by entering data into the grid. Lines with empty names are ignored. <table align="center" border="0"><tr> <td> \image html wxparserdemo3.png </td> <td> \image html wxparserdemo4.png </td> </tr></table> */