reflex.cpp File Reference

updated Tue Oct 1 2024 by Robert van Engelen
 
Macros | Functions | Variables
reflex.cpp File Reference

RE/flex scanner generator replacement for Flex/Lex. More...

#include "reflex.h"
Include dependency graph for reflex.cpp:

Macros

#define WITH_BOOST_PARTIAL_MATCH_BUG
 Work around the Boost.Regex partial_match bug by forcing the generated scanner to buffer all input. More...
 

Functions

int fopen_s (FILE **file, const char *name, const char *mode)
 Safer fopen_s() More...
 
char char_tolower (char c)
 Convert to lower case. More...
 
static std::string file_ext (std::string &name, const char *ext)
 Add file extension if not present, modifies the string argument and returns a copy. More...
 
int main (int argc, char **argv)
 Main program instantiates Reflex class and runs Reflex::main(argc, argv) More...
 

Variables

static const char * options_table []
 Table with command-line reflex options and lex specification %options. More...
 
static const Reflex::Library library_table []
 Table with regex library properties. More...
 

Detailed Description

RE/flex scanner generator replacement for Flex/Lex.

Author
Robert van Engelen - engel.nosp@m.en@g.nosp@m.enivi.nosp@m.a.co.nosp@m.m

Macro Definition Documentation

#define WITH_BOOST_PARTIAL_MATCH_BUG

Work around the Boost.Regex partial_match bug by forcing the generated scanner to buffer all input.

Function Documentation

char char_tolower ( char  c)
inline

Convert to lower case.

Returns
lower case char
static std::string file_ext ( std::string &  name,
const char *  ext 
)
static

Add file extension if not present, modifies the string argument and returns a copy.

Returns
copy of file name string with extension ext
int fopen_s ( FILE **  file,
const char *  name,
const char *  mode 
)
inline

Safer fopen_s()

int main ( int  argc,
char **  argv 
)

Main program instantiates Reflex class and runs Reflex::main(argc, argv)

Variable Documentation

const Reflex::Library library_table[]
static

Table with regex library properties.

This table is extensible and new regex libraries may be added. Each regex library is described by:

  • a unique name that is used for specifying the matcher=NAME option
  • the header file to be included
  • the pattern type or class used by the matcher class
  • the matcher class
  • the regex library signature

A regex library signature is a string of the form "decls:escapes?+.", see reflex::convert.

The optional "decls:" part specifies which modifiers and other special (?...) constructs are supported:

  • non-capturing group (?:...) is supported
  • one or all of "imsx" specify which (?ismx) modifiers are supported:
  • 'i' specifies that (?i...) case-insensitive matching is supported
  • 'm' specifies that (?m...) multiline mode is supported for the ^ and $ anchors
  • 's' specifies that (?s...) dotall mode is supported
  • 'x' specifies that (?x...) freespace mode is supported
  • # specifies that (?#...) comments are supported
  • = specifies that (?=...) lookahead is supported
  • < specifies that (?<...) lookbehind is supported
  • ! specifies that (?!=...) and (?!<...) are supported
  • ^ specifies that (?^...) negative (reflex) patterns are supported

The "escapes" characters specify which standard escapes are supported:

  • a for \a (BEL U+0007)
  • b for \b (BS U+0008) in brackets [\b] only AND the \b word boundary
  • c for \cX control character specified by X modulo 32
  • d for \d ASCII digit [0-9]
  • e for \e ESC U+001B
  • f for \f FF U+000C
  • h for \h ASCII blank [ \t] (SP U+0020 or TAB U+0009)
  • i for \i reflex indent anchor
  • j for \j reflex dedent anchor
  • j for \k reflex undent anchor
  • l for \l ASCII lower case letter [a-z]
  • n for \n LF U+000A
  • p for \p{C} Unicode character classes, also implies Unicode {X}, , , , ,
  • r for \r CR U+000D
  • s for \s space (SP, TAB, LF, VT, FF, or CR)
  • t for \t TAB U+0009
  • u for \u ASCII upper case letter [A-Z] (when not followed by {XXXX})
  • v for \v VT U+000B
  • w for \w ASCII word-like character [0-9A-Z_a-z]
  • x for \xXX 8-bit character encoding in hexadecimal
  • y for \y word boundary
  • z for \z end of input anchor
  • `for `\ begin of input anchor
  • ' for \' end of input anchor
  • < for \< left word boundary
  • > for \> right word boundary
  • A for \A begin of input anchor
  • B for \B non-word boundary
  • D for \D ASCII non-digit [^0-9]
  • H for \H ASCII non-blank [^ \t]
  • L for \L ASCII non-lower case letter [^a-z]
  • N for \N not a newline
  • P for \P{C} Unicode inverse character classes, see 'p'
  • Q for \Q...\E quotations
  • R for \R Unicode line break
  • S for \S ASCII non-space (no SP, TAB, LF, VT, FF, or CR)
  • U for \U ASCII non-upper case letter [^A-Z]
  • W for \W ASCII non-word-like character [^0-9A-Z_a-z]
  • X for \X any Unicode character
  • Z for \Z end of input anchor, before the final line break
  • 0 for \0nnn 8-bit character encoding in octal requires a leading 0
  • '1' to '9' for backreferences (not applicable to lexer specifications)

Note that 'p' is a special case to support Unicode-based matchers that natively support UTF8 patterns and Unicode classes {C}, {C}, , , , , , , , , , and {X}. Basically, 'p' prevents conversion of Unicode patterns to UTF8. This special case does not support {NAME} expansions in bracket lists such as [a-z||{upper}] and {lower}{+}{upper} used in lexer specifications.

The optional "?+" specify lazy and possessive support:

  • ? lazy quantifiers for repeats are supported
  • + possessive quantifiers for repeats are supported

The optional "." (dot) specifies that dot matches any character except newline. A dot is implied by the presence of the 's' modifier, and can be omitted in that case.

const char* options_table[]
static

Table with command-line reflex options and lex specification %options.

The table consists of option names with hyphens replaced by underscores.