Constructing Lexical Analyzers with RE/flex =========================================== <p align="right"><i>by Robert van Engelen, January 16, 2017.<br>Updated March 6, 2017.</i></p> <p><a href="https://twitter.com/share" class="twitter-share-button" data-show-count="true">Tweet</a><script async defer src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> <a class="github-button" href="https://github.com/Genivia/RE-flex" data-count-href="/Genivia/RE-flex/stargazers" data-count-api="/repos/Genivia/RE-flex#stargazers_count" data-count-aria-label="# stargazers on GitHub" aria-label="Star Genivia/RE-flex on GitHub">Star</a> <a class="github-button" href="https://github.com/Genivia/RE-flex/archive/master.zip" data-icon="octicon-cloud-download" aria-label="Download Genivia/RE-flex on GitHub">Download</a></p> RE/flex is a free and open-source software alternative to the fast lexical analyzer [Flex](dinosaur.compilertools.net/#flex), adding many new features and capabilities compared to Flex. RE/flex generates lexical analyzers, also known as "scanners", "lexers", or "tokenizers". Introduction ------------ This article gives an overview of RE/flex with examples to demonstrate the new features. The full documentation [can be found here.](http://www.genivia.com/doc/reflex/html) ### Licensing RE/flex is released under a permissive open source license (BSD-3) and available on [GitHub](https://github.com/Genivia/RE-flex) and on [SourceForge.](https://sourceforge.net/projects/re-flex) ### Speed RE/flex is as fast as Flex (in some cases faster), and much faster than regex libraries such as Boost.Regex, C++11 std::regex, PCRE2 and RE2. For example, tokenizing a representative C source code file into 240 tokens takes: <table class="doxtable"> <tr><th>Command / Function</th><th>Software</th><th>Time (μs)</th></tr> <tr><td><b>reflex --fast</b></td><td><b>RE/flex</b></td><td><b>17</b></td></tr> <tr><td>flex -+ --full</td><td>Flex</td><td>18</td></tr> <tr><td><b>reflex --full</b></td><td><b>RE/flex</b></td><td><b>39</b></td></tr> <tr><td>reflex -m=boost-perl</td><td>Boost.Regex (Perl mode)</td><td>288</td></tr> <tr><td>pcre2_match()</td><td>PCRE2 (pre-compiled)</td><td>318</td></tr> <tr><td>reflex -m=boost</td><td>Boost.Regex (POSIX mode)</td><td>486</td></tr> <tr><td>flex -+</td><td>Flex</td><td>3968</td></tr> <tr><td>RE2::Consume()</td><td>RE2 (pre-compiled)</td><td>5088</td></tr> <tr><td>std::cregex_iterator()</td><td>C++11 std::regex</td><td>14784</td></tr> </table> Note: *Best times of 10 tests, each taking the average time in micro seconds over 100 runs (using clang 8.0.0 -O2, 2.9 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3).* ### Features Overall, the design of RE/flex is kept simple and its usage follows Flex closely. RE/flex accepts standard Flex specification syntax and supports Flex options. This means that if you have some prior experience with Flex (Lex), then the learning curve to use RE/flex is low. Because RE/flex is compatible to Flex, textbooks and online resources on Flex also apply to RE/flex. RE/flex borrows the expressive specification syntax of Flex while adding new features and more modern capabilities such as: - integrated support for Unicode (%option unicode), auto-detects BOM in files (UTF-8/16/32); - optional "free space mode" (%option freespace), which considerably improves the readability of lex specifications - regular expressions may contain lazy quantifiers (also known as "non-greedy operators", "lazy repeats" or "reluctant quantifiers") - regular expressions may contain word boundary anchors - regular expressions may contain indent/dedent markers for indent/nodent/dedent matching - intuitive customization of the C++ lexer class code output (%class and %init) - efficient matching in direct code or by tables, as options - visualization of finite state machines (output FSM in Graphviz format for the Graphviz dot tool) - generates scanners that are thread-safe by default - works with Bison and supports reentrant, bison-bridge and bison-locations - the regex library is extensible with matchers derived from an abstract base matcher class - Boost.Regex can be used for matching as one of the regex library extensions (%option matcher=boost) - conversion of regex expressions, for regex engines that lack regex features - for the brave at heart: use Boost.Regex with Perl mode matching (%option matcher=boost-perl) Let me highlight some of these improvements briefly. ### Unicode support By contrast to Flex, you can now use `%option unicode` to write patterns to match Unicode, such as: \p{Greek}+ printf("this is greek to me: %s\n", yytext); [$€£¥] printf("currency in %s\n", yytext); 🐶+ printf("a bunch of dog emojis\n"); 你好 printf("再见\n"); よかった printf("ここまで読よんでくれてありがとうございました\n"); In fact, RE/flex comes with standard Unicode character classes and you can also use classes that are predefined for Java, C# and Python: \p{JavaIdentifierStart}\p{JavaIdentifierPart}* return IDENT; ### Readable specs Also new is `%option freespace` to enhance readability of your lex specification with spacing (with this option you must always place actions in `{` and `}`): [-+]? [0-9] | [-+]? [1-9] [0-9]+ { yylval.num = strtol(yytext, NULL, 10); return NUM; } ### POSIX matching with lazy quantifiers You can now use lazy quantifiers, such as `*?`, to create simple and elegant patterns: "/*".*?"*/" // ignore multiline C/C++ comments <!--.*?--> // ignore XML comments ### Indent, nodent and dedent matching RE/flex integrates indent and dedent matching directly in the regular expression syntax with `\i` and `\j` escapes, so many scenarios can be covered instead of just some special cases. A simple indent/nodent/dedent matcher is: ^[ \t]+ out() << "| "; // nodent: text is aligned to current indent margin ^[ \t]*\i out() << "> "; // indent: matched with \i ^[ \t]*\j out() << "< "; // dedent: matched with \j \j out() << "< "; // dedent: for each extra level dedented ### Readable source code output The source code output produced by RE/flex is substantially more readable and understandable compared to Flex. The generated code conforms to solid OOP design principles with template classes for lexers, matchers, and input sources. ### Compatible with Flex and Bison RE/flex produces C++ code, but you can still use RE/flex scanners with Bison (Yacc) and any other C code when everything is compiled together with a C++ compiler. When using RE/flex as a replacement for Flex, make sure to use command-line option `−−flex`. This makes RE/flex behave as Flex for C++ (i.e. Flex with option -+). Use options `−−flex` and `−−bison` to output code that is compatible with Flex C code. These options when combined generate global non-reentrant "yy" functions and variables such as `yylex()` and `yytext` to work seamlessly with Bison (Yacc). Also make sure to include `reflex/include` on your compiler's include path and to link your code against `reflex/lib/libreflex`. ### Development status RE/flex is production stable and comes with several examples: - a Python 3.0 source code tokenizer - a Java source code tokenizer - a C/C++ source code tokenizer (but this one does no support Unicode) - a JSON tokenizer and parser, auto-converts UTF encodings to/from wide strings - an XML tokenizer - an HTML reformatter Installation and setup ---------------------- Download RE/flex from [GitHub](https://github.com/Genivia/RE-flex) or [SourceForge](https://sourceforge.net/projects/re-flex) and build with: [command] cd reflex ./clean.sh # clean previous install ./build.sh # build reflex tool and library Add the location of the reflex tool to your $PATH variable: [command] export PATH=$PATH:/reflex_install_path/bin Make sure that RE/flex header files in the include directory are accessible when compiling, i.e. with -Ireflex/include. Also link your code with reflex/lib/libreflex.a. Optionally, install the library: [command] cd reflex/lib; sudo make install Optionally, install the reflex command (instead of export PATH): [command] cd reflex/src; sudo make install Next up: a demonstration of RE/flex through examples, starting with a simple lexer to count Unicode words. The first example keeps things very simple. No prior knowledge of RE/flex or Flex is needed to follow the examples. First example: a simple lexer to count Unicode words {#example1} ---------------------------------------------------- ### Defining the lexer class Our lexer class will need three variables to record the total character count `ch`, word count `wd`, and line count `nl` of some given input. To declare these, we add them to a `%class` block that will be part of our generated lexer class. We also add initalizations of our lexer to an `%init` block that is used by the lexer constructor. Because we want to print the results when we're done scanning, we also add a public method `results()` to our lexer class: %class{ int ch, wd, nl; public: void results() { cout << setw(8) << nl << setw(8) << wd << setw(8) << ch << endl; } %} %init{ ch = wd = nl = 0; %} The generated lexer class is named `Lexer` by default. This name can be changed with `%option lexer=NAME`. ### The main program Our generated lexer class has an `int lex()` method to scan the current input. This input is stdin by default. To change the input we can pass an input source to the constructor, which can be a `FILE*`, an `istream` object, or a (wide) string. We want our lexer to scan files. Our a main program opens a file with the file name provided as a command-line argument and instantiates our lexer class to scan it. If no command-line argument is given we scan stdin. To do so, our program constructs a `Lexer` and then invokes `lex()` to scan the entire input until EOF: int main(int argc, char **argv) { FILE *fd = NULL; if (argc > 1 && (fd = fopen(argv[1], "r")) == NULL) exit(EXIT_FAILURE); // create a lexer that consumes a file or reads stdin Lexer lexer(fd); // here we go! lexer.lex(); // display the results lexer.results(); if (fd) fclose(fd); return 0; } The benefit of using `FILE*` (instead of `istream`) is that our lexer automatically detects a UTF [Byte Order Mark](https://en.wikipedia.org/wiki/Byte_order_mark) (BOM) in files, so it will accept ASCII and UTF-8, UTF-16, and UTF-32 encodings. Encodings are internally normalized to UTF-8 automatically for matching, so we do not need to apply any options or have to mess around with our lex specifications. ### Lex specification structure When `lex()` is invoked in our main program, it will start matching patterns against the input and execute actions when patterns match. We put our patterns and actions in the "rules section" of our lex specification. A specification has the following structure: [xml] DEFINITIONS %% RULES %% USER CODE We define named patterns as macros: they are substituted in patterns. We put named patterns in the "definitions section" of our lex specification, preceded by any `%option` options that affect these patterns. ### Definitions section For our example we need a pattern to capture newlines to bump the `nl` counter and a pattern to capture words to bump the `wd` counter. The character counter does not need a pattern, it will be bumped when matching other patterns and when nothing else matches. We define two named patterns in the definitions section: %option unicode line \r?\n word (\w|\p{Punctuation})+ %% Unicode matching is enabled with `%option unicode`. Note that the `word` pattern matches a non-empty sequence of Unicode word characters and punctuation characters` (it's really up to us to define what a "word" is, for example the Unix "wc" utility considers everything a word that isn't a space). ### Rules section The two named patterns defined in the definitions section are used in our rules section to trigger actions to bump our counters: %% {line} ch += size(); ++nl; {word} ch += size(); ++wd; . ch += size(); %% The `ch` counter is increased with the length in bytes `size()` of a match. If you want to count the total number of wide characters scanned instead of bytes, use `wsize()`. ### More user code We should not forget that our lexer class and our main program require the inclusion of C++ definitions to compile. We put these in a `%top` section: %top{ #include <cstdio> #include <iostream> #include <iomanip> using namespace std; %} Like flex, we could have used `%{` and `%}` instead of `%top`, but if you do so beware that these declarations appear only after the standard includes and built-in class declarations, such as the abstract base lexer class declaration of our lexer class. Putting critical includes in a `%top` block is recommended. ### Putting it all together We're now ready to put all of this together in an .l file, say wordcount.l, with the main program added to the user section below: %top{ #include <cstdio> #include <iostream> #include <iomanip> using namespace std; %} %class{ int ch, wd, nl; public: void results() { cout << setw(8) << nl << setw(8) << wd << setw(8) << ch << endl; } %} %init{ ch = wd = nl = 0; %} %option unicode line \r?\n word (\w|\p{Punctuation})+ %% {line} ch += size(); ++nl; {word} ch += size(); ++wd; . ch += size(); %% int main(int argc, char **argv) { FILE *fd = NULL; if (argc > 1 && (fd = fopen(argv[1], "r")) == NULL) exit(EXIT_FAILURE); // create a lexer that consumes a file or reads stdin Lexer lexer(fd); // here we go! lexer.lex(); // display the results lexer.results(); if (fd) fclose(fd); return 0; } ### Compiling the example We run the reflex command on our wordcount.l file and compile the generated `lex.yy.cpp`. To compile, we should use the header files located in `reflex/include` and link our code with the RE/flex library `reflex/lib/libreflex.a`: [command] reflex wordcount.l c++ -Ireflex/include -o wordcount lex.yy.cpp reflex/lib/libreflex.a When all goes well and no errors are produced, RE/flex saves the lexer in a `lex.yy.cpp` file. To also generate a header file `lex.yy.h` use option `−−header-file`, which comes in handy when compiling separate project files. Let's run our program on its own source code: [command] ./wordcount wordcount.l 45 98 706 You will notice a slight delay to display the results. The design of RE/flex differs from Flex in that it generates a finite state machine (FSM) on demand in our program, but only just once, instead of at compile time. This is the default behavior. To fully compile the FSM ahead of run time, use `%option full` or `%option fast` for a faster code-based FSM. All RE/flex `%option` options can also be set via command-line options. So we can use `−−full` or `−−fast` with the reflex command: [command] reflex −−full wordcount.l c++ -Ireflex/include -o wordcount lex.yy.cpp reflex/lib/libreflex.a There is one last thing we can improve. The direct use of `cout` in our program is a bit cumbersome. In fact, the lexer constructor takes an optional second argument that is the output stream to write output to. This output stream is referenced via the `out()` method of the lexer. I recommend to use `out()` instead of `cout` in the `results()` method. In this way, also `echo()` (or `ECHO`) will echo matches to this output stream specified. RE/flex versus Flex actions {#actions} --------------------------- After reading the first example, you may wonder why RE/flex has new methods and variables compared to Flex, such as `size()` instead of `yyleng` or `YYLeng()`. But in fact you do not need to rewrite lex specifications that are intended for Flex to use with RE/flex when you use RE/flex option `−−flex`. This way you can still use the old "yy" Flex functions and variables with RE/flex. However, there are additional built-in RE/flex methods that Flex does not support. A selection of the most common RE/flex actions and their equivalent Flex actions (enabled with option `−−flex`) is shown in this table: <table class="doxtable"> <tr> <th>RE/flex action </th><th>Flex action </th><th>Result </th></tr> <tr> <td><code>text()</code> </td><td><code>YYText()</code>, <code>yytext</code> </td><td><code>\0</code>-terminated text match </td></tr> <tr> <td><code>size()</code> </td><td><code>YYLeng()</code>, <code>yyleng</code> </td><td>size of the match in number bytes </td></tr> <tr> <td><code>wsize()</code> </td><td><em>n/a</em> </td><td>number of wide chars matched </td></tr> <tr> <td><code>lineno()</code> </td><td><code>yylineno</code> </td><td>line number of match (&gt;=1) </td></tr> <tr> <td><code>columno()</code> </td><td><em>n/a</em> </td><td>column number of match (&gt;=0) </td></tr> <tr> <td><code>echo()</code> </td><td><code>ECHO</code> </td><td><code>out().write(text(), size())</code> </td></tr> <tr> <td><code>start()</code> </td><td><code>YY_START</code> </td><td>get current start condition </td></tr> <tr> <td><code>start(n)</code> </td><td><code>BEGIN n</code> </td><td>set start condition to <code>n</code> </td></tr> <tr> <td><code>push_state(n)</code> </td><td><code>yy_push_state(n)</code> </td><td>push current state, start <code>n</code> </td></tr> <tr> <td><code>pop_state()</code> </td><td><code>yy_pop_state()</code> </td><td>pop state and make it current </td></tr> <tr> <td><code>top_state()</code> </td><td><code>yy_top_state()</code> </td><td>get top state start condition </td></tr> <tr> <td><code>in(i)</code> </td><td><code>yyin = &amp;i</code> </td><td>set input to <code>reflex::Input i</code> </td></tr> <tr> <td><code>in()</code> </td><td><code>*yyin</code> </td><td>get <code>reflex::Input</code> object </td></tr> <tr> <td><code>out(s)</code> </td><td><code>yyout = &amp;s</code> </td><td>set output to <code>std::ostream s</code> </td></tr> <tr> <td><code>out()</code> </td><td><code>*yyout</code> </td><td>get <code>std::ostream</code> object </td></tr> <tr> <td><code>out().write(s, n)</code> </td><td><code>LexerOutput(s, n)</code> </td><td>output chars <code>s[0..n-1]</code> </td></tr> <tr> <td><code>out().put(c)</code> </td><td><code>output(c)</code> </td><td>output char <code>c</code> </td></tr> <tr> <td><code>matcher().accept()</code> </td><td><code>yy_act</code> </td><td>number of the matched rule </td></tr> <tr> <td><code>matcher().text()</code> </td><td><code>YYText()</code>, <code>yytext</code> </td><td>same as <code>text()</code> </td></tr> <tr> <td><code>matcher().wtext()</code> </td><td><em>n/a</em> </td><td><code>std::wstring</code> match </td></tr> <tr> <td><code>matcher().size()</code> </td><td><code>YYLeng()</code>, <code>yyleng</code> </td><td>same as <code>size()</code> </td></tr> <tr> <td><code>matcher().wsize()</code> </td><td><em>n/a</em> </td><td>size of wide string match </td></tr> <tr> <td><code>matcher().input()</code> </td><td><code>yyinput()</code> </td><td>get next char from input </td></tr> <tr> <td><code>matcher().winput()</code> </td><td><em>n/a</em> </td><td>get wide character from input </td></tr> <tr> <td><code>matcher().unput(c)</code> </td><td><code>unput(c)</code> </td><td>put back char <code>c</code> </td></tr> <tr> <td><code>matcher().peek()</code> </td><td><em>n/a</em> </td><td>peek at next char on input </td></tr> <tr> <td><code>matcher().more()</code> </td><td><code>yymore()</code> </td><td>concat next match to the match </td></tr> <tr> <td><code>matcher().less(n)</code> </td><td><code>yyless(n)</code> </td><td>shrink match's length to <code>n</code> </td></tr> <tr> <td><code>matcher().first()</code> </td><td><em>n/a</em> </td><td>first pos of match in input </td></tr> <tr> <td><code>matcher().last()</code> </td><td><em>n/a</em> </td><td>last pos+1 of match in input </td></tr> <tr> <td><code>matcher().rest()</code> </td><td><em>n/a</em> </td><td>get rest of input until end </td></tr> <tr> <td><code>matcher().at_bob()</code> </td><td><em>n/a</em> </td><td>true if at the begin of input </td></tr> <tr> <td><code>matcher().at_end()</code> </td><td><em>n/a</em> </td><td>true if at the end of input </td></tr> <tr> <td><code>matcher().at_bol()</code> </td><td><code>YY_AT_BOL()</code> </td><td>true if at begin of a newline </td></tr> </table> See the [RE/flex documentation](http://www.genivia.com/doc/reflex/html) for more details. Second example: a JSON parser and writer {#example2} ---------------------------------------- In this example, I will demonstrate several useful techniques with RE/flex to develop a compact parser for JSON. ### JSON [JSON](http://json.org) is a simple data format to exchange structured information. JSON and its implementations have been heavily [criticized](http://seriot.ch/parsing_json.php) for lack of standardization. No worries though, the implementation I'm presenting here was successfully tested and validated against the JSON data patterns in that critical article and therefore conforms to the so-called JSON "standard". ### A tagged JSON data structure To hold JSON values we need a tagged data structure. A tagged data structure holds values of various types depending on the tag that can be used as a value selector. A JSON value is one of undefined (UND), null (NUL), boolean (BOO), number (NUM), string (STR), array (ARR), object (OBJ), or is an error (ERR) which is used when parsing failed. These tags and the values we need to store are defined by our `JSON` class: class JSON { public: JSON() : tag(UND) { } enum Tag { UND, NUL, BOO, NUM, STR, ARR, OBJ, ERR } tag; bool boolean; double number; std::wstring string; std::vector<JSON> array; std::map<std::wstring,JSON> object; }; JSON arrays are stored as vectors and objects are maps. Because Unicode must be supported, strings and object property names are wide strings. ### Lexer class To tokenize JSON we should define tokens and token values. Any choice of numeric token encoding will do, but let's define tokens for JSON constants to mean something intuitive using ASCII code `'0'` for null, `'t'` for true, `'f'` for false, `'#'` for a literal number, and `'$'` for a string literal. Tokens for punctuation are their corresponding ASCII codes, such as `'['` for array, `'{'` for object, and others that we put in a character class `[][}{,:]` to match. Note that we simply put an unescaped `]` up front in the bracket list because the list can't be empty. Tokens `'#'` and `'$'` have token values, which we will store in our lexer class instance as lexer state variables: %class{ protected: double number; // token value for token '#' (number) std::wstring string; // token value for token '$' (string) %} The tokens and token values are returned by actions in the rules section of the lex specification. Our definitions and rules sections of our lexer is mostly straightforward, except for scanning string literals that must be converted to wide strings "on the go": %o flex freespace dotall unicode digit [0-9] digit1 [1-9] digits {digit}+ int -? {digit} | -? {digit1} {digits} frac \. {digits} exp [eE] [-+]? {digits} number {int} {frac}? {exp}? %x STRING %% [ \t\r\n] { /* ignore white space */ } [][}{,:] { return yytext[0]; } null { return '0'; } true { return 't'; } false { return 'f'; } {number} { number = strtod(yytext, NULL); return '#'; } \" { string.clear(); BEGIN STRING; } <STRING>{ \" { BEGIN INITIAL; return '$'; } \\ ["\\/] { string.push_back(yytext[1]); } \\ b { string.push_back('\b'); } \\ f { string.push_back('\f'); } \\ n { string.push_back('\n'); } \\ r { string.push_back('\r'); } \\ t { string.push_back('\t'); } \\ u [[:xdigit:]]{4} { string.push_back(strtoul(yytext + 2, NULL, 16)); } []-\x7f\x20-[] { string.push_back(yytext[0]); } \p{Non_ASCII_Unicode} { string.push_back(reflex::utf8(yytext)); } } <*> . { return '!'; /* error */ } %% Let's break this down, shall we? 1. Option `flex` allows us to use the classic "yy" Flex actions and variables such as `yytext` (see the table in the previous section of this article). Option `freespace` improves regex readability by permitting spacing. Option `dotall` forces `.` (dot) to match any character, including newline. Option `unicode` makes `.` (dot) also match Unicode characters and `\p{Non_ASCII_Unicode}` matches any UTF-8 content that isn't ASCII. 2. The JSON lexical structure is simply copied here from the JSON "standard" into the definitions section with named patterns (macros). 3. As you can see, tokens are returned to the parser by actions in the rules section. 4. The `STRING` condition that is declared as an exclusive start condition with `%x STRING`, has several patterns corresponding to the JSON "standard" string contents, which is converted to a wide string by the `STRING` conditional patterns and actions. 5. The `reflex::utf8(yytext))` call converts the UTF-8 encoded `yytext` matched by `\p{Non_ASCII_Unicode}` to a single wide character. 6. Any non-matched character in any state `<*>.` will result in an error token `'!'` to be returned to the parser. 7. The `number` and `string` variables are lexer class variables defined by the `%class` described earlier. ### Parser class RE/flex generates a `yyFlexLexer` class for this specification since we use option `%o flex`. Now that we have a lexer class `yyFlexLexer` that returns tokens and holds token values, then next step is to implement a so-called "recursive descent parser" for JSON syntax. Since the parser needs to invoke the lexer `yylex()` function to retrieve tokens and access token values from the lexer state, we simply derive the JSON parser from the `yyFlexLexer` class. Parsing a value into a JSON data structure with `parse(JSON& data)` amounts to calling `yylex()` and setting the JSON data to the appropriate tagged value: class JSONParser : public yyFlexLexer { public: JSONParser(FILE *fd = NULL) : yyFlexLexer(fd), level(0) { } JSON::Tag parse(JSON& data) { int token = yylex(); switch (token) { case '0': return data.tag = JSON::NUL; case 't': data.boolean = true; return data.tag = JSON::BOO; case 'f': data.boolean = false; return data.tag = JSON::BOO; case '#': data.number = number; return data.tag = JSON::NUM; case '$': data.string = string; return data.tag = JSON::STR; case '[': return parse_array(data); case '{': return parse_object(data); default : return error(token, data); } } protected: ... size_t level; }; Parsing JSON arrays is done by a protected function `parse_array(JSON& data)` that calls `parse(item)` repeately in a loop to parse array items: protected: JSON::Tag parse_array(JSON& data) { for (size_t len = 0; ; ++len) { JSON item; switch (parse(item)) { case JSON::NUL: case JSON::BOO: case JSON::NUM: case JSON::STR: case JSON::ARR: case JSON::OBJ: data.array.push_back(item); break; case JSON::UND: --level; return len == 0 ? data.tag = JSON::ARR : JSON::ERR; default : return JSON::ERR; } int token = yylex(); if (token == ']') { --level; return data.tag = JSON::ARR; } if (token != ',') return JSON::ERR; } return JSON::ERR; } To capture empty arrays `[]` propertly required some logic, because `parse(item)` should return `UND` in that case and not an error `ERR`, which is done via the error function that is called by `parse(JSON& data)` when looking at a `]`: protected: JSON::Tag error(int token, JSON& data) { data.tag = JSON::ERR; switch (token) { case ']': return level > 0 ? JSON::UND : JSON::ERR; default : return JSON::ERR; } } Parsing JSON objects is done by a protected function `parse_object(JSON& data)` that calls `parse(item)` repeately in a loop to parse property values created with `data.object[string]` (the string token value of the lexer): protected: JSON::Tag parse_object(JSON& data) { for (size_t len = 0; ; ++len) { int token = yylex(); if (len == 0 && token == '}') { --level; return data.tag = JSON::OBJ; } if (token != '$') return JSON::ERR; JSON& item = data.object[string]; if (yylex() != ':') return JSON::ERR; switch (parse(item)) { case JSON::NUL: case JSON::BOO: case JSON::NUM: case JSON::STR: case JSON::ARR: case JSON::OBJ: break; default : return JSON::ERR; } token = yylex(); if (token == '}') { --level; return data.tag = JSON::OBJ; } if (token != ',') return JSON::ERR; } return JSON::ERR; } ### Writing JSON data to a stream The `<<` operator can be overloaded to write objects to a stream with a custom format. We use this mechanism to write JSON data to a stream. The JSON data is displayed based on its tag: std::ostream& operator<<(std::ostream& os, const JSON& data) { switch (data.tag) { case JSON::NUL: os << "null"; break; case JSON::BOO: os << (data.boolean ? "true" : "false"); break; case JSON::NUM: os << data.number; break; case JSON::STR: print_string(os, data.string); break; case JSON::ARR: os << "["; for (std::vector<JSON>::const_iterator i = data.array.begin(); i != data.array.end(); ++i) os << (i != data.array.begin() ? "," : "") << *i; os << "]"; break; case JSON::OBJ: os << "{"; for (std::map<std::wstring,JSON>::const_iterator i = data.object.begin(); i != data.object.end(); ++i) { if (i != data.object.begin()) os << ","; print_string(os, i->first); os << ":" << i->second; } os << "}"; break; default : os << "undefined"; break; } return os; } Writing JSON strings to a stream is done with `print_string()`, which escapes control characters and converts wide characters to UTF-8: void print_string(std::ostream& os, const std::wstring& s) { os << "\""; for (std::wstring::const_iterator i = s.begin(); i != s.end(); ++i) { switch (*i) { case '"' : case '\\': os << "\\" << static_cast<char>(*i); break; case '\b': os << "\\b"; break; case '\f': os << "\\f"; break; case '\n': os << "\\n"; break; case '\r': os << "\\r"; break; case '\t': os << "\\t"; break; default : if (*i >= '\x20' && *i <= '\x7f') { // emit printable char os << static_cast<char>(*i); } else if (*i < 0x20) { // emit \u00xx for control codes os << "\\u" << std::internal << std::setw(4) << std::setfill('0') << std::hex << *i << std::dec; } else { // else emit UTF-8 char buf[8]; buf[reflex::utf8(*i, buf)] = '\0'; // convert to UTF-8 and make \0-terminated os << buf; } } } os << "\""; } ### The main program Our main program simply parses JSON and displays it. The code is self-explanatory: int main(int argc, char **argv) { FILE *fd = NULL; if (argc > 1 && (fd = fopen(argv[1], "r")) == NULL) exit(EXIT_FAILURE); JSON data; // parse JSON from stdin or a file provided as a command-line argument (FILE* permits UTF BOM detection) JSONParser parser(fd); if (parser.parse(data) == JSON::ERR) { std::cerr << "Error at (" << parser.lineno() << "," << parser.columno() << ") when looking at " << parser.text() << std::endl; exit(EXIT_FAILURE); } // print JSON value std::cout << data; if (fd) fclose(fd); return 0; } You see the benefit of deriving a parser from the lexer class in the fact that we can use the `text()`, `lineno()`, and `columno()` lexer functions in our program to report the details of an error. Note that using `yytext` here would not work without being in a lexer method scope, but the Flex-like `YYText()` is fine. Finally, if our test program is part of the .l specification file's user section, then we need to make sure to include the propert C++ definitions by putting these in our .l specification: %top{ #include <stdlib.h> #include <vector> #include <map> %} Tips and tricks {#tricks} --------------- ### Speeding things up Use reflex option `−−full` to include the full FSM in the generated code. This avoids the startup overhead to construct an internal FSM from the specified regular expressions when your scanner starts. Use reflex option `−−fast` to include direct code in the generated code. This produces an FSM in C++ code instead of tables and may be faster. The FSM encoded in tables and in code is compressed to reduce the number of edges by combining overlapping character ranges. This means that the tables and code may not appear to be precise deterministic finite state machines (DFAs). ### Using Boost.Regex To use Boost.Regex as a matcher engine, use reflex option `−−matcher=boost`. This assumes that you've installed Boost.Regex. Optimization options `−−full` and `−−fast` do not apply. If you feel adventurous, use reflex option `−−matcher=boost-perl` to use Perl-based matching instead of POSIX-based matching. Beware that this changes the matching behavior from the POSIX *leftmost longest match* to the *greedy leftmost* match as in Perl. ### Using RE/flex to convert regex If you are looking to convert regular expressions with Unicode and UTF-8 to plain 8-bit regex forms, use reflex option `−−regexp-file` to generate a text file with converted Unicode-based expressions (don't forget to use `%option unicode`). These converted expressions can be used in a C/C++ application directly, e.g. with Boost.Regex to scan strings. For example, `\p{Greek}+` translates to the regex "(?m)((?:\\xcd[\\xb0-\\xb3]|\\xcd[\\xb5-\\xb7]|\\xcd[\\xba-\\xbd]|\\xcd\\xbf|\\xce\\x84|\\xce\\x86|\\xce[\\x88-\\x8a]|\\xce\\x8c|\\xce[\\x8e-\\xa1]|\\xce[\\xa3-\\xbf]|\\xcf[\\x80-\\xa1]|\\xcf[\\xb0-\\xbf]|\\xe1\\xb4[\\xa6-\\xaa]|\\xe1\\xb5[\\x9d-\\xa1]|\\xe1\\xb5[\\xa6-\\xaa]|\\xe1\\xb6\\xbf|\\xe1\\xbc[\\x80-\\x95]|\\xe1\\xbc[\\x98-\\x9d]|\\xe1(\\xbc[\\xa0-\\xbf]|\\xbd[\\x80-\\x85])|\\xe1\\xbd[\\x88-\\x8d]|\\xe1\\xbd[\\x90-\\x97]|\\xe1\\xbd\\x99|\\xe1\\xbd\\x9b|\\xe1\\xbd\\x9d|\\xe1\\xbd[\\x9f-\\xbd]|\\xe1\\xbe[\\x80-\\xb4]|\\xe1(\\xbe[\\xb6-\\xbf]|\\xbf[\\x80-\\x84])|\\xe1\\xbf[\\x86-\\x93]|\\xe1\\xbf[\\x96-\\x9b]|\\xe1\\xbf[\\x9d-\\xaf]|\\xe1\\xbf[\\xb2-\\xb4]|\\xe1\\xbf[\\xb6-\\xbe]|\\xe2\\x84\\xa6|\\xea\\xad\\xa5|\\xf0\\x90(\\x85[\\x80-\\xbf]|\\x86[\\x80-\\x8c])|\\xf0\\x90\\x86\\xa0|\\xf0\\x9d(\\x88[\\x80-\\xbf]|\\x89[\\x80-\\x85]))+)". You can put this to work with any regex matcher that takes UTF-8 input to search for greek words. You can also convert expressive regex forms to a regex that a particular regex engine library accepts by using the converter directly in your code. For example to match Unicode with the efficient RE/flex regex library: #include <reflex/matcher.h> // reflex::Matcher, reflex::Input, reflex::Pattern // a pattern with Unicode, this is allocated static so the internal FSM is constructed only once: static const reflex::Pattern pattern(reflex::Matcher::convert("[\\p{Greek}\\p{Zs}\\pP]+")); // use a Matcher to check if sentence is in Greek: if (reflex::Matcher(pattern, sentence).matches()) std::cout << "This is Greek" << std::endl; ### Visualizing FSMs To visualize the FSMs defined by the lex specification rules, use reflex option `−−graphs-file`. This generates Graphviz files that can be converted to PDF or to other graphics formats with the [Graphviz dot](http://www.graphviz.org) tool. This option does not work when Boost.Regex is selected as a matcher engine. For example, the JSON example rules consist of the INITIAL and STRING start condition states. Each start condition state has an FSM shown below. <div class="chart"><a href="images/reflex-json-initial.png" data-title="The INITIAL start condition state FSM for the JSON example" data-lightbox="image-2"><img src="images/reflex-json-initial.png"/></a></div> <div class="chart"><a href="images/reflex-json-string.png" data-title="The STRING start condition state FSM for the JSON example" data-lightbox="image-2"><img src="images/reflex-json-string.png"/></a></div> ### Changing input sources Lexers can read from `FILE*`, `istream`, 8-bit strings and from wide strings. Pass the input as the first agrument to the Lexer class, for example: Lexer lexer("How now brown cow"); or with the `−−flex` option: yyFlexLexer lexer("How now brown cow"); The lexer class accepts an optional second argument `ostream` that the `echo()` (Flex `ECHO`) and `out()` (Flex `yyout`) methods use. To change input on the fly, use the lexer class `in(i)` method, where `i` is a `FILE*`, an `istream`, an 8-bit string, or a wide string. Similarly, to change the output stream use the lexer class `out(o)` method with an `ostream` parameter. ### Debugging Use reflex option `-d` to debug your scanner. This produces messages on stdout showing rules and actions invoked. ### Default ECHO It is generally a good idea to use RE/flex option `-s` to remove the default rule that echos unmatched text. Flex produces an error message when text is unmatched, while RE/flex remains silent and skips to the next match. Algorithms {#algos} ---------- The RE/flex implementation uses new variations of algorithms and data structures. To build RE/flex I've improved a classic algorithm to convert regular expressions directly to deterministic finite automata (DFAs) using "follow sets" described in [Compilers: Principles, Techniques, and Tools](https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools) and based on the work by R.F. McNaughton and H. Yamada "Regular Expressions and State Graphs for Automata" IEEE Transactions on Electronic Computing 9 (1960). The improved algorithm handles lazy quantifiers (non-greedy operators) for the optional operator `??` and the repetitions `+?`, `*?` and `{n,m}?`, which wasn't possible before without requiring the construction of an intermediate non-deterministic finite automaton (NFA) (e.g. using submatch-tracking variants of Thompson's algorithm as is used by RE2 and PCRE as described in the excellent article [Regular Expression Matching Can Be Simple And Fast](https://swtch.com/~rsc/regexp/regexp1.html).) DFAs are constructed efficiently from follow sets (which form the basis of DFA state transitions) by inserting DFA states (sets of positions) in a binary tree. DFA edges are represented by overlapping range sets `reflex::ORanges`, a data structure derived from `std::set` to efficiently merge adjacent ranges automatically when multiple sets are joined (range set union) or split (range set intersection). This turns out to be more space efficient than bit vectors to represent sets, especially for wide character ranges that can range up to U+10FFFF. The time efficiency is comparable to bit vectors. The final finite state machines use compressed edges to minimize the number of conditional jumps. Concluding remarks {#conclusions} ------------------ The idea behind lexical analyzer generators dates back to the 70s and even earlier. Basically, a lexical analyzer generator takes a set of regular expression patterns and actions. The generated lexical analyzer scans the input from begin to end while matching the patterns and executing actions. It's amazing that this technology has stayed very relevant until this day. Perhaps the Lex manual is right in stating that "The asteroid to kill this dinosaur is still in orbit." Lexical analysis generators are typically used together with the Berkeley Yacc parser generator or with GNU bison (a version of Yacc). A Yacc or Bison parser invokes the scanner as a "tokenizer" to retrieve "tokens" to parse. These tokens are produced by the scanner's actions when the input is scanned. After teaching Compiler Construction courses for over 17 years I have often looked at alternatives for Flex. But most alternatives introduce a new specification syntax and aren't fully compatible with Bison (bison-bridge and bison-locations). I wasn't ready to swap out the course textbooks (or write my own), so I needed a generator that accepts Flex specifications. I also wanted a generator that offers features that remove the limitations of Flex and its alternatives. So I decided to write my own lexical analyzer generator that is compatible with Flex. Getting the highest performance possible wasn't the primary goal. Machines are damn fast these days anyway and tokenizers only take a fraction of a compiler's execution. A higher priority is getting more modern capabilities designed and implemented with reasonably good performance. What about using Boost.Regex as a matcher engine in scanners and tokenizers? Many have contemplated that option. RE/flex permits this and simplifies the process to produce Boost.Regex regex patterns to be used for scanning. However, Boost.Regex should be used in POSIX mode and generally runs slower than the RE/flex regex library. [![To top](images/go-up.png) To top](#) <p align="right"><i>Copyright (c) 2017, Robert van Engelen, Genivia Inc. All rights reserved.</i></p>