Presentation "Scalable Construction of Text Indexes with Thrill" at IEEE Big Data 2018

Posted on 2018-12-12 16:00 by Timo Bingmann at Permlink with 0 Comments. Tags: #talk #thrill

Today, I gave a presentation of our paper "Scalable Construction of Text Indexes with Thrill" at the IEEE International Conference on Big Data 2018 in Seattle, WA, USA.

The slides of the presentation at the IEEE conference are available here:
slides-Scalable-Construction-of-Text-Indexes-with-Thrill.pdf .

The full paper is available from this webpage: paper-Scalable-Construction-of-Text-Indexes-with-Thrill.pdf or refer to the longer version in my dissertation on scalable suffix array construction.

Download slides-Scalable-Construction-of-Text-Indexes-with-Thrill.pdf

Abstract

The suffix array is the key to efficient solutions for myriads of string processing problems in different application domains, like data compression, data mining, or bioinformatics. With the rapid growth of available data, suffix array construction algorithms have to be adapted to advanced computational models such as external memory and distributed computing. In this article, we present five suffix array construction algorithms utilizing the new algorithmic big data batch processing framework Thrill, which allows scalable processing of input sizes on distributed systems in orders of magnitude that have not been considered before.

Print Quality of Print-On-Demand Books from Amazon Createspace/KDP, epubli.de, and Ingram Spark

Posted on 2018-11-05 20:00 by Timo Bingmann at Permlink with 0 Comments. Tags: #dissertation

Having just finished my PhD thesis in computer science (see the corresponding dissertation), I ventured to actually print it as a proper book. In this article I want to share some of my experience with three print-on-demand book publishers:

epubli.de (a smaller German publisher),
Amazon CreateSpace, now called Amazon Kindle Direct Publishing (KDP),
and IngramSpark, which is the self-publishing branch of Ingram.

Disclaimer: these are my experiences, yours will probably be different. And I hope print-on-demand quality will improve further in the future.

Below are macro photographs of the various print proof and other copies I received from the publishers. The photographs were taken with a Samsung Galaxy S9 smartphone with a (cheap) macro lens. Hence the colors and blur in the photos should be considered with caution, but the sharpness and detail level is sufficient for some discussion.

TL;DR: Amazon's print proof from the USA has the nicest print and color, but they only produce paperback covers. IngramSpark's prints are second best, have a lower resolution, but they produce hardcovers and consistent quality.

Amazon Print Proof from the USA

Uploading the PDF to Amazon CreateSpace is straight-forward due to the convenient web interface. The Amazon CreateSpace print proof was manufactured in Lexington, KY, USA, it was shipped three days after ordering, and arrived eleven days after ordering. For an international shipment from the USA to Germany that is a very acceptable delivery time.

The print quality of the Amazon Print Proof was in my option the best. It has the highest resolution, bright colors, and solid black. The paper and entire book has the feel of high-quality color laser printer output. Sadly, they the only produce paperback softcover books and I preferred a hardcover. On the plus side, publishing via Amazon gives you a free CreateSpace-ISBN and it is immediately listed in the world-wide Amazon catalog.

Original PDF

As a comparison to the printed books, the pictures above show the same excepts rendered as a bitmap image. Note that the layout of the words in each cover image may differ, because each cover needed to by typeset individually to the individual specifications of the publisher.

This article continues on the next page ...

Tutorial on Boost.Spirit at C++ User Group Karlsruhe

Posted on 2018-09-12 19:30 by Timo Bingmann at Permlink with 0 Comments. Tags: #talk #c++ #parsing

On September 12th, 2018, I gave another 90min talk with live-coding examples in German at the C++ User Group Karlsruhe in rooms of the Karlsruhe Institute of Technology (KIT).

This time I was asked to present a more advanced topic around C++ and libraries and I chose to present a tutorial on Boost.Spirit.

Boost.Spirit is a parser and generator template meta-programming framework and maybe one of the most crazy and advanced uses of C++. It enables one to write context-free grammars inline as C++ code, which are translated into recursive descent parsers and fully optimized by the compiler.

This powerful framework is however not easy to get started with. I hope my tutorial helps more people to skip the steep learning curve and use Boost.Spirit for securely parsing user input and other structure data.

The tutorial consisted of a set of introduction slides: slides-2018-09-12-Cpp-Meetup.pdf . Followed by a live-coding session in German which was recorded by the KIT (see below for the youtube video).

Download slides-2018-09-12-Cpp-Meetup.pdf

The extensive code examples presented in the live coding session are available on this webpage
or on github: https://github.com/bingmann/2018-cpp-spirit-parsing.

The examples can be seen as instructive templates and copy & paste sources for new development. The examples are:

Learn to walk and parse simple integers and lists.
Parse 5, [5, 42, 69, 256].
Create a parser for a simple arithmetic grammar (and part two).
Parse 5 + 6 * 9 + 42 and evaluate correctly.
Parse CSV data directly into a C++ struct.
Parse AAPL;Apple;252.50; into a struct.
Create an abstract syntax tree (AST) from arithmetic (and part two).
Parse y = 6 * 9 + 42 * x and evaluate with variables.
Ogle some more crazy examples, e.g. how to parse.
<h1>Example for <b>C++ HTML Parser<b></h1>
This HTML <b>snippet</b> parser can also interpret
*Markdown* style and enables additional tags
to <% invoke("C++", 42) %> functions.

Furthermore, a recording of the live-coding in German is available on Youtube:
https://www.youtube.com/watch?v=gYAheppw73U

Dissertation "Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools"

Posted on 2018-07-03 18:00 by Timo Bingmann at Permlink with 1 Comments. Tags: #talk #university #dissertation #frontpage

The road to a Dr. title (PhD in the Anglo-Saxon world) is often long, rough, and twisted. First you have to do original research, produce novel results, publish articles, and then write and ultimately publish a dissertation. Defending your research in the dissertation in front of a panel of professors is one of the final milestones on that journey.

On July 3rd, I successfully defended my dissertation at the Karlsruhe Institute of Technology and the dissertation text has now been published as a book.

Published Dissertation

My final dissertation PDF is available here: dissertation-Bingmann-Scalable-String-and-Suffix-Sorting.pdf .

It has also been published at the KIT library, on arXiv:1808.00963, and finally as a print-on-demand paperback from Amazon.

The published book cover's background shows a list of most common words in the Wikipedia. The words are sorted and their distinguishing prefix is marked in blue. The cover is available as a double-page PDF: dissertation-cover.pdf , and as front and back separately: dissertation-cover-front.pdf and dissertation-cover-back.pdf .

The LaTeX source code for my dissertation is available for download: dissertation-source.zip (791 KiB). The complete text is in one .tex file, all figures (except the creative commons logo) are generated from the LaTeX code.

Dissertation Defense Presentation

The slides of my presentation during the defense are available for download here: dissertation-defense-slides.pdf .

The presentation was only part of the whole defense. My actual slide set also had almost 200 more backup slides, which however were collected from all the other talks already available on this homepage. These backup slides helped greatly in the examination question after the presentation.

Download dissertation-defense-slides.pdf

This article continues on the next page ...

Note about the new tlx library of Advanced C++ Data Structures and Algorithms

Posted on 2018-05-28 18:20 by Timo Bingmann at Permlink with 0 Comments. Tags: #c++

Last year on February 19th, I started a new github repository called tlx with the goal of de-duplicating code from three projects: Thrill, STXXL, and a private project. The idea came up after a STXXL code workshop in Frankfurt (fashionably called hackathons nowadays).

Link to library: http://github.com/tlx/tlx and Doxygen Documentation

The first main common pieces of code were:

the fast loser tree implementations from MCSTL by Johannes Singler necessary for efficient multiway merging,
my die() macros for testing and run-time assertions,
a common intrusive reference counter called counting_ptr, and
simple but vital std::string manipulation functions missing from the STL.

The initial reason for tlx to come about was to consolidate all the bug fixes to the loser tree implementations that I had scattered across the three projects. Efficient multiway merging is such a fundamental task and there was no universally available C++ library that implements the tournament tree well.

A long search for an appropriate vacant user account with three letters on github lead to "tlx". This is definitely a good C++ namespace name, but to this day, it is unclear what the letters stand for. Template Libraries for CXX? The missing Library for CXX? Template Library and more eXtensions. Have your pick, someday someone will find a good official expansion.

Since its inception, tlx has grown a lot. Its goal is to consolidate algorithms and data structures from multiple projects. In a sense tlx maybe aims to be the Boost for advanced algorithms. The goals and constraints of tlx are:

To have a library of well implemented and tested advanced algorithms and things missing from the C++ STL.
Target high modularity with as little dependencies between modules as possible.
Zero external dependencies: no additional libraries are required.
Only have compile time configuration (no platform dependent checks).
Compile on all platforms with C++ -- smartphones, supercomputers, windows, maybe even embedded microcontrollers.
Attempt to never break existing interfaces.
Warning and bug-freeness on all compilers.
Keep overhead down -- small overall size such that is can be included without bloating applications.
Collect code only under the Boost license, which one of the most liberal licenses and can be used any project.

Currently, tlx contains

The fast tournament (loser) trees from MCSTL by Johannes Singler, with many fixes.
A fast intrusive reference counter called CountingPtr, which has considerably less overhead than std::shared_ptr.
Efficient and fast multiway merging algorithms from Johannes Singler, which were previously included with gcc. The tlx version has many fixes and is available for clang and MSVC++.
Many string manipulation algorithms for std::string.
An improved version of my stx-btree implementation, which is basically always a better alternative to std::map (but not std::unordered_map).
A copy of siphash for string hashing.
Efficient sequential string sorting implementation such as radix sort and multikey quicksort (described in length in my PhD thesis).

And much more, which one can find on the front page of the Doxygen Documentation