It will read a tab-delimited csv data file with column headers and apply the given filter expression to each line. Each line is taken as a record of variables named by the column header. For each line the expression is evaluated and if it returns a "true" boolean value, then the line is copied to the output; if it returns a "false" boolean, the line is skipped. In case the expression value is not a boolean, the value is printed on stderr.
This example program is not supposed to be an "industrial-strength" CSV file parser. It cannot handle escaped delimiters or quoted fields. Beware to use it directly on Excel-generated files.
The mysql-world-city.csv file starts like this:
ID Name CountryCode District Population 1 Kabul AFG Kabol 1780000 2 Qandahar AFG Qandahar 237500 3 Herat AFG Herat 186800 4 Mazar-e-Sharif AFG Balkh 127800 5 Amsterdam NLD Noord-Holland 731200 6 Rotterdam NLD Zuid-Holland 593321 7 Haag NLD Zuid-Holland 440900 8 Utrecht NLD Utrecht 234323 9 Eindhoven NLD Noord-Brabant 201843 10 Tilburg NLD Noord-Brabant 193238 ...
The csvfilter operates on tab (or otherwise) delimited input streams. The first line of the file is taken to be the column header; the header names can be used within the filter expression as variables. For each of the following data rows the expression is evaluated with variables set to the corresponding data fields.
An example run of the csvfilter could be:
./csvfilter 'Population > 1000000 && CountryCode = "USA"' < mysql-world-city.csv
Expression string: Population > 1000000 && CountryCode = "USA" Parsed expression: ((Population > 1000000) && (CountryCode = "USA")) Reading CSV column headers from input Read 5 column headers. Processed 4079 lines, copied 9 and skipped 4070 lines
and the filtered CSV lines on stdout:
ID Name CountryCode District Population 3793 New York USA New York 8008278 3794 Los Angeles USA California 3694820 3795 Chicago USA Illinois 2896016 3796 Houston USA Texas 1953631 3797 Philadelphia USA Pennsylvania 1517550 3798 Phoenix USA Arizona 1321045 3799 San Diego USA California 1223400 3800 Dallas USA Texas 1188580 3801 San Antonio USA Texas 1144646
#include "ExpressionParser.h" #include <iostream> #include <string> #include <vector> #include <map>
First we include the file ExpressionParser.h, which contains the required classes, and four Standard Template Library files.
// use this as the delimiter. this can be changed to ';' or ',' if needed const char delimiter = '\t';
The example program uses this constant as the delimiter character. It may be changed if required.
// read one line from instream and split it into tab (or otherwise) delimited // columns. returns the number of columns read, 0 if eof. unsigned int read_csvline(std::istream &instream, std::vector<std::string> &columns) { columns.clear(); // read one line from the input stream std::string line; if (!std::getline(instream, line, '\n').good()) { return 0; } // parse line into tab separated columns, start with inital column columns.push_back(""); for (std::string::const_iterator si = line.begin(); si != line.end(); ++si) { if (*si == delimiter) columns.push_back(""); else // add non-delimiter to last column columns.back() += *si; } return columns.size(); }
As state above this example program is not supposed to be an "industrial-strength" CSV file reader. The above function reads one line for the input file stream and splits it into a std::vector of field strings. No escaped delimiters or quoted fields are recognized.
// subclass stx::BasicSymbolTable and return variable values from the current // csv row. the variable names are defined by the map containing the column // header. class CSVRowSymbolTable : public stx::BasicSymbolTable { public: // maps the column variable name to the vector index const std::map<std::string, unsigned int> &headersmap; // refernce to the reused data row vector. const std::vector<std::string> &datacolumns; CSVRowSymbolTable(const std::map<std::string, unsigned int> &_headersmap, const std::vector<std::string> &_datacolumns) : stx::BasicSymbolTable(), headersmap(_headersmap), datacolumns(_datacolumns) { } virtual stx::AnyScalar lookupVariable(const std::string &varname) const { // look if the variable name is defined by the CSV file std::map<std::string, unsigned int>::const_iterator varfind = headersmap.find(varname); if (varfind == headersmap.end()) { // if not, let BasicSymbolTable check if it knows it return stx::BasicSymbolTable::lookupVariable(varname); } // return the variable value from the current vector. convert it to a // stx::AnyScalar but use the automatic type recognition for input // strings. if(varfind->second < datacolumns.size()) { return stx::AnyScalar().setAutoString( datacolumns[varfind->second] ); } else { return ""; // happens when a data row has too few delimited // fields. } } };
This is the main meat of the example: the CSVRowSymbolTable is subclassed from stx::BasicSymbolTable. This symbol table is used to represent the possible variables in the expression formula. It is constructed from the header row of the CSV file and a vector containing the current data row. However the symbol table object only contains references to the actual vector and map. This way the main program will be able to modify the data row and re-evaluate the parse tree using the new row.
The idea behind this set up is to minimize the number of times the input CSV data fields are copied: the vector containing the string fields is filled once and then only referenced via the symbol table object if the variable is actually used.
Another way to implement the symbol table would be to used an stx::BasicSymbolTable and set the variables using setVariable(). However this would be much slower than the reference symbol table approach shown above.
int main(int argc, char *argv[]) { // collect expression by joining all remaining input arguments std::string args; for(int i = 1; i < argc; i++) { if (!args.empty()) args += " "; args += argv[i]; } std::cerr << "Expression string: " << args << "\n"; // parse expression into a parse tree stx::ParseTree pt; try { pt = stx::parseExpression(args); std::cerr << "Parsed expression: " << pt.toString() << "\n"; } catch (stx::ExpressionParserException &e) { std::cerr << "ExpressionParserException: " << e.what() << "\n"; return 0; }
As shown in the first example application all command line arguments are collected into a single string. Then the parser in put into action on the expression string and produces the stx::ParseTree. This time the parse tree will be used multiple times.
// read first line of CSV input as column headers std::cerr << "Reading CSV column headers from input\n"; std::vector<std::string> headers; if (read_csvline(std::cin, headers) == 0) { std::cerr << "Error read column headers: no input\n"; return 0; } std::cerr << "Read " << headers.size() << " column headers.\n"; // create a header column lookup map for CSVRowSymbolTable and output the // column header line to std::cout std::map<std::string, unsigned int> headersmap; for(unsigned int headnum = 0; headnum < headers.size(); ++headnum) { headersmap[ headers[headnum] ] = headnum; if (headnum != 0) std::cout << delimiter; std::cout << headers[headnum]; } std::cout << "\n";
The csvfilter reads the CSV file from the stdin stream. The first line is read and saved. It is considered to contain the column headers. This header row is inserted into the header map for faster lookup by the symbol table. The header row is also outputted to std::cout.
The following loop then iterates over the data rows read from the CSV input. The symbol table is constructed only once and for each row the referenced vector "datacolumns" is refilled. Using this symbol table the parse tree is re-evaluated for each data row.
The evaluation result is checked for a boolean type. In this case a filter expression was given and the row is either copied to std::cout or skipped depending on the filter's result.
If the expression returned a non-boolean type, it is taken to be some calculation and the result is printed on std::cerr.
// iterate over the data lines of the CSV input unsigned int linesprocessed = 0, linesskipped = 0; std::vector<std::string> datacolumns; CSVRowSymbolTable csvsymboltable(headersmap, datacolumns); while( read_csvline(std::cin, datacolumns) > 0 ) { // evaluate the expression for each row using the headers/datacolumns // as variables try { linesprocessed++; stx::AnyScalar val = pt.evaluate( csvsymboltable ); if (val.isBooleanType()) { if (!val.getBoolean()) { linesskipped++; continue; } } else { std::cerr << "evaluated: " << val << "\n"; } // output this data row to std::cout for(std::vector<std::string>::const_iterator coliter = datacolumns.begin(); coliter != datacolumns.end(); ++coliter) { if (coliter != datacolumns.begin()) std::cout << delimiter; std::cout << *coliter; } std::cout << "\n"; } catch (stx::UnknownSymbolException &e) { std::cerr << "evaluated: UnknownSymbolException: " << e.what() << "\n"; } catch (stx::ExpressionParserException &e) { std::cerr << "evaluated: ExpressionParserException: " << e.what() << "\n"; } } std::cerr << "Processed " << linesprocessed << " lines, " << "copied " << (linesprocessed - linesskipped) << " and " << "skipped " << linesskipped << " lines" << "\n"; }
// $Id: csvfilter.cc 59 2007-07-17 14:43:23Z tb $ /* * STX Expression Parser C++ Framework v0.7 * Copyright (C) 2007 Timo Bingmann * * This library is free software; you can redistribute it and/or modify it * under the terms of the GNU Lesser General Public License as published by the * Free Software Foundation; either version 2.1 of the License, or (at your * option) any later version. * * This library is distributed in the hope that it will be useful, but WITHOUT * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or * FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License * for more details. * * You should have received a copy of the GNU Lesser General Public License * along with this library; if not, write to the Free Software Foundation, * Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ // CSV Parser and Filter using the Expression Parser #include "ExpressionParser.h" #include <iostream> #include <string> #include <vector> #include <map> // use this as the delimiter. this can be changed to ';' or ',' if needed const char delimiter = '\t'; // read one line from instream and split it into tab (or otherwise) delimited // columns. returns the number of columns read, 0 if eof. unsigned int read_csvline(std::istream &instream, std::vector<std::string> &columns) { columns.clear(); // read one line from the input stream std::string line; if (!std::getline(instream, line, '\n').good()) { return 0; } // parse line into tab separated columns, start with inital column columns.push_back(""); for (std::string::const_iterator si = line.begin(); si != line.end(); ++si) { if (*si == delimiter) columns.push_back(""); else // add non-delimiter to last column columns.back() += *si; } return columns.size(); } // subclass stx::BasicSymbolTable and return variable values from the current // csv row. the variable names are defined by the map containing the column // header. class CSVRowSymbolTable : public stx::BasicSymbolTable { public: // maps the column variable name to the vector index const std::map<std::string, unsigned int> &headersmap; // refernce to the reused data row vector. const std::vector<std::string> &datacolumns; CSVRowSymbolTable(const std::map<std::string, unsigned int> &_headersmap, const std::vector<std::string> &_datacolumns) : stx::BasicSymbolTable(), headersmap(_headersmap), datacolumns(_datacolumns) { } virtual stx::AnyScalar lookupVariable(const std::string &varname) const { // look if the variable name is defined by the CSV file std::map<std::string, unsigned int>::const_iterator varfind = headersmap.find(varname); if (varfind == headersmap.end()) { // if not, let BasicSymbolTable check if it knows it return stx::BasicSymbolTable::lookupVariable(varname); } // return the variable value from the current vector. convert it to a // stx::AnyScalar but use the automatic type recognition for input // strings. if(varfind->second < datacolumns.size()) { return stx::AnyScalar().setAutoString( datacolumns[varfind->second] ); } else { return ""; // happens when a data row has too few delimited // fields. } } }; int main(int argc, char *argv[]) { // collect expression by joining all remaining input arguments std::string args; for(int i = 1; i < argc; i++) { if (!args.empty()) args += " "; args += argv[i]; } std::cerr << "Expression string: " << args << "\n"; // parse expression into a parse tree stx::ParseTree pt; try { pt = stx::parseExpression(args); std::cerr << "Parsed expression: " << pt.toString() << "\n"; } catch (stx::ExpressionParserException &e) { std::cerr << "ExpressionParserException: " << e.what() << "\n"; return 0; } // read first line of CSV input as column headers std::cerr << "Reading CSV column headers from input\n"; std::vector<std::string> headers; if (read_csvline(std::cin, headers) == 0) { std::cerr << "Error read column headers: no input\n"; return 0; } std::cerr << "Read " << headers.size() << " column headers.\n"; // create a header column lookup map for CSVRowSymbolTable and output the // column header line to std::cout std::map<std::string, unsigned int> headersmap; for(unsigned int headnum = 0; headnum < headers.size(); ++headnum) { headersmap[ headers[headnum] ] = headnum; if (headnum != 0) std::cout << delimiter; std::cout << headers[headnum]; } std::cout << "\n"; // iterate over the data lines of the CSV input unsigned int linesprocessed = 0, linesskipped = 0; std::vector<std::string> datacolumns; CSVRowSymbolTable csvsymboltable(headersmap, datacolumns); while( read_csvline(std::cin, datacolumns) > 0 ) { // evaluate the expression for each row using the headers/datacolumns // as variables try { linesprocessed++; stx::AnyScalar val = pt.evaluate( csvsymboltable ); if (val.isBooleanType()) { if (!val.getBoolean()) { linesskipped++; continue; } } else { std::cerr << "evaluated: " << val << "\n"; } // output this data row to std::cout for(std::vector<std::string>::const_iterator coliter = datacolumns.begin(); coliter != datacolumns.end(); ++coliter) { if (coliter != datacolumns.begin()) std::cout << delimiter; std::cout << *coliter; } std::cout << "\n"; } catch (stx::UnknownSymbolException &e) { std::cerr << "evaluated: UnknownSymbolException: " << e.what() << "\n"; } catch (stx::ExpressionParserException &e) { std::cerr << "evaluated: ExpressionParserException: " << e.what() << "\n"; } } std::cerr << "Processed " << linesprocessed << " lines, " << "copied " << (linesprocessed - linesskipped) << " and " << "skipped " << linesskipped << " lines" << "\n"; }