**Articles**

**In this article**

*by Dr. Robert van Engelen, July 16, 2019.Genivia Research LabsUpdated April 15, 2023.Updated May 28, 2023.*

`^^^ ^^^ ^^^ ^ ^^^` We ignore shorter matches that are part of longer matches. So `do` is not a match in `dog` since we want to match `dog`. We also ignore word boundaries in the text. For example, `own` matches part of `brown`. This makes the search actually harder, since we cannot just inspect and match words between spaces. Existing state-of-the-art methods are fast, such as [Bitap](https://en.wikipedia.org/wiki/Bitap_algorithm) to find a single matching string in text and [Hyperscan](https://github.com/intel/hyperscan) that essentially uses Bitap "buckets" and hashing to find matches of multiple string patterns. Bitap slides a window over the searched text to predict matches based on the letters it has shifted into the window. The window length of Bitap is the minimum length among all string patterns we search for. Short Bitap windows generate many false positives. In the worst case the shortest string among all string patterns is one letter long. This severely limits the performance of Bitap. For example, Bitap finds as many as 10 potential match locations in the example text for matching string patterns: `the quick brown fox jumps over the lazy dog`

`^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ` These potential matches marked `^` correspond to the letters with which the patterns start, i.e. `a`, `t`, `d`, `o` and `e` are starting letters. The remaining part of the string patterns are ignored and must be matched separately afterwards. Hyperscan essentially uses Bitap buckets, which means that additional optimization can be applied to separate the string patterns into different buckets depending on the properties of the string patterns. The number of buckets is bound by SIMD architectural constraints of the machine to optimize Hyperscan. However, as a Bitap-based method, having a few short strings among the set of string patterns will hamper the performance of Hyperscan. We can do better than Bitap-based methods. To understand how we can do better, consider associating each letter in the text with a letter in the string patterns at offsets, like so: `the quick brown fox jumps over the lazy dog`

`^^^ ^^^ ^^^ ^ ^^^`

`012 012 012 0 012` Here we have matching letters at offsets in the strings we search: - at offset 0 we match one of the letters `a`, `t`, `d`, `o`, `e` of the pattern `a|an|the|do|dog|own|end`

- at offset 1 we match one of the letters `n`, `h`, `o`, `w` of the pattern `an|the|do|dog|own|end`

- at offset 2 we match one of the letters `e`, `g`, `n`, `d`, of the pattern `the|dog|own|end` The trick is to predict matches based on all available information, meaning the letters at word (string pattern) offsets 0, 1 and 2 (and higher for longer words.) To do so, we define a function `predictmatch` that return `TRUE` if a word (string pattern) possibly matches the text in a window, otherwise returns `FALSE`. We also define two functions `matchbit` and `acceptbit` that can be implemented as arrays or matrices. The functions take character `c` and an offset `k` to return `matchbit(c, k) = 1` if `word[k] = c` for any word in the set of string patterns, and return `acceptbit(c, k) = 1` if any word ends at `k` with `c`. With these two functions, `predictmatch` is defined as follows in pseudo code to predict string pattern matches up to 4 characters long against a sliding window of length 4: func predictmatch(window[0:3]) var c0 = window[0] var c1 = window[1] var c2 = window[2] var c3 = window[3] if acceptbit(c0, 0) then return TRUE if matchbit(c0, 0) then if acceptbit(c1, 1) then return TRUE if matchbit(c1, 1) then if acceptbit(c2, 2) then return TRUE if match_bit(c2, 2) then if matchbit(c3, 3) then return TRUE return FALSE We will eliminate control flow and replace it with logical operations on bits. For a window of size 4, we need 8 bits (twice the window size). The 8 bits are ordered as follows, where `!` means the logical complement (the not operator): bit7 = ! acceptbit(c0, 0) bit6 = ! matchbbit(c0, 0) bit5 = ! acceptbit(c1, 1) bit4 = ! matchbit(c1, 1) bit3 = ! acceptbit(c2, 2) bit2 = ! matchbit(c2, 2) bit1 = ! matchbit(c3, 3) bit0 = unused, assigned 1 The `predictmatch` computation now becomes: func predictmatch(window[0:3]) var c0 = window[0] var c1 = window[1] var c2 = window[2] var c3 = window[3] var bit7 = ! acceptbit(c0, 0) var bit6 = ! matchbbit(c0, 0) var bit5 = ! acceptbit(c1, 1) var bit4 = ! matchbit(c1, 1) var bit3 = ! acceptbit(c2, 2) var bit2 = ! matchbit(c2, 2) var bit1 = ! matchbit(c3, 3) var bit0 = 1 if bit7 == 0 then return TRUE if bit6 == 0 then if bit5 == 0 then return TRUE if bit4 == 0 then if bit3 == 0 then return TRUE if bit2 == 0 then if bit1 == 0 then return TRUE return FALSE What did we gain with this change? Nothing it seems. However, we can now simplify `predictmatch` by removing all expensive branches and replace them with bit-wise logical operations: bits = (mem[c0] & 0xc0) | (mem[c1] & 0x30) | (mem[c2] & 0x0c) | (mem[c3] & 0x03) return ((bits >> 6 | bits >> 4 | bits >> 2 | bits >> 1) != 0x7f) where the `mem[]` array represents `matchbit` and `acceptbit` compactly using 8 bits, or 2 bits per offset for 4 offsets, computed as follows. Note that this simplification should improve execution speed, because the execution branches are not very predictable and bit-wise logical operations on 8 bits is relatively cheap. Parameter `k` is the window size of a PM-*k* method, e.g. `k = 4` in our example to illustrate how PM-4 works. Let |Σ| be the alphabet size, e.g. |Σ| is 256 for 8-bit characters. The objective is to assign `mem[i, j]` 0, 1, 2 or 3, such that even values indicate `matchbit(c, i) = 1` and and values less than 2 indicate `acceptbit(c, i) = 1`: for i = 0 to k-1 for j = 0 to |Σ|-1 mem[i, j] = 3 for each word in the set of words var m = min(|word|, k) for i = 0 to m - 2 mem[i, word[i]] &= 2 mem[m - 1, word[m - 1]] = 0 When this initalization step is completed, we obtain `mem[]` from the set of words (string patterns). To compute `predictmatch` efficiently for any window size `k`, we define: func predictmatch(mem[0:k-1, 0:|Σ|-1], window[0:k-1]) var d = 0 for i = 0 to k - 1 d |= mem[i, window[i]] << 2*(k-i-1) var t = d for i = 0 to k - 3 d |= d >> 2 d = (d >> 1) | t return (d != (1 << 2*k) – 1) The `predictmatch` function returns `TRUE` when a potential match is found in the window. Searching for a match of a word in the given text `data` with `predictmatch` is performed as follows: func search(mem[0:k-1, 0:|Σ|-1], data) for i = 0 to |data| - k if predictmatch(mem, data[i:i+k-1]) then if data[i:i+k-1] matches a word in the set of words then return i return -1 This function returns the index of the match in the text `data` or -1 when not found. Note that the last `k-1` characters in the text data are not searched because the window cannot extend beyond the data. To search all data requires either padding the data with `k-1` with some dummy characters that cannot match, or requires performing a partial `predictmatch` on fewer characters as we near the end of the text data. Note that the next step in `search` is to verify a possible match against the set of words. This last step can be expensive when many false positives are found by `predictmatch`, that is, when a few words actually match but the number of predictions is much higher. [![To top](images/go-up.png) To top](#) PM-*k* with hashing ------------------- Hashing combines multiple characters into a single hash value. This effectively correlates the characters that make up a word (a string pattern). The words `own` and `end` are in our set of words. The simple `predictmatch` function will find a potential match for words that are not in the set of words, such as `ond` and `ewn` for example, which are false positives. Hashing effectively improves the accuracy of the predicted match by lowering the false positive rate. The words `own` and `end` no longer produces false matches for `ond` and `ewn` for example. False positives are not entirely excluded however, unless the hash is perfect, which requires 32 bit hash values to match windows of 4 characters. The resulting `mem[]` array would be too large to be practical. Therefore, finding a sweet spot to reduce false positives as much as possible with a small range of hash values is important. The updated pseudo-code `predictmatch` function with hashing is defined as follows: func predictmatch(mem[0:k-1, 0:2H-1], window[0:k-1]) const salt = ... // optional, some constant between 0 and 2H-1 var h = salt var d = 0 for i = 0 to k - 1 h = hash(h, window[i]) d |= mem[i, h] << 2*(k-i-1) var t = d for i = 0 to k - 3 d |= d >> 2 d = (d >> 1) | t return (d != (1 << 2*k) – 1) The only difference is that `h` is introduced with a hash value computed with `h = hash(h, window[i])` over the text `data` in the window and `mem` is indexed by `h` instead of a character. Correspondingly, `mem[]` is constructed and indexed by hash values instead of single characters. Hash values range from `0` to `2H-1`. Salting the hash is optional, it may or may not improve the accuracy depending on the hashing function `hash` chosen. We do not need to hash the first character in the window, so we can change the `i` loop above to run from 1 and by initializing `h = window[i]`. This means that we can use `mem[]` to quickly find a starting character in the text data first and then call `predictmatch`: func search(mem[0:k-1, 0:|Σ|-1], data) for i = 0 to |data| - k if mem[0, data[i]] != 3 then if predictmatch(mem, data[i:i+k-1]) then if data[i:i+k-1] matches a word in the set of words then return i return -1 This approach reduces computational overhead. [![To top](images/go-up.png) To top](#) PM-*k* with hashing and Bitap ----------------------------- Combining PM-*k* with Bitap can reduce the false positive rate with almost no overhead if the number of strings searched is limited. The idea is to filter possible matches with Bitap prior to `predictmatch`. This approach works well when we search a small set of string patterns and/or if the string patterns are all longer than *k* characters. If a single string among the string patterns is shorter than a few characters, then the benefit of Bitap quickly evaporates and offers no advantage if one string pattern consists of just one character. Also, if the number of strings to search is high, then Bitap produces many false positives and is not effective as a pre-filter for PM-*k*. In the implementation section I introduce a threshold value to determine if Bitap is useful based on the minimum strings pattern length and Bitap "entropy". When I mention "entropy", think about the number and complexity of the string patterns to search. The more strings we have and the more complex they are, the higher the entropy. Bitap is effective if the entropy is sufficiently low. The next section compares the performance of PM-4 methods with hashing and Bitap. The performance comparisons were performed with ugrep and with a [PM-4 implementation in C.](#PM-4-implementation-in-C) [![To top](images/go-up.png) To top](#) Performance results ------------------- Determining the performance of multi-string matching and search methods can have many pitfalls. It depends on the test set consisting of a representative corpus to search and representative search patterns. Therefore, I make no absolute claim that the method is always faster than any other method. We also do not want to cherry-pick a specific test set to beat other methods, just as other method implementers can do to try to beat PM-4 in very specific cases. Instead, we should randomize our experiment as much as possible to obtain a fair assessment of performance. It turns out that PM-4 significantly improves the search speed compared to Bitap and Hyperscan for representative string patterns. To conduct performance experiments, I chose a 100,000,000 byte [English Wikipedia dump](files/enwik8.zip). This text file contains 1,128,023 words, mostly English words, some names, abbreviations and numbers. The word length frequency of the 348,191 [distinct words](files/words.zip) in this file is depicted in the graph shown below: The following performance test was conducted with a new and common MacBook Pro using clang 12.0.0 -O2 on a 2.9 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3 MacOS machine. The best times of 30 runs is shown under minimal machine load. The performance is compared of ugrep with the new search method presented in this article, Hyperscan's simple grep tool, ripgrep 13.0.0 and GNU grep 3.10. The data labelled "length *N* and up" represent the elapsed search time of a random set of 1,000 string patterns of length *N* and longer, picked at random from the 348,191 words, where the probability distribution of lengths correspond to the word length frequencies in the graph shown above. The set of data labelled "length *N* and up" contains at least one string of length *N*. From this bar chart we observe that the new PM-*k* multi-search string method implemented in ugrep performs very well accross the board. The performance gap is smaller for longer strings to search for. This is not entirely surprising, because Hyperscan uses Bitap-based methods, which work best for a set of strings that have a minimal length. The shortest string pattern determines the Bitap predictability of matches, which is lower for short string patterns. The hashing function used in this experiment is a PM-4 shift-and-xor hash named `<<3/<<3/<<3`, see my [PM-4 implementation in C,](#PM-4-implementation-in-C) which consists of three operations to hash inputs `a` and `b` and a prior hash value `h` #define hash1(a,b) (uint16_t)((((uint16_t)(a) << 3) ^ (uint16_t)(b)) & (HASH_MAX - 1)) #define hash2(h,b) (uint16_t)((((uint16_t)(h) << 3) ^ (uint16_t)(b)) & (HASH_MAX - 1)) #define hash3(h,b) (uint16_t)((((uint16_t)(h) << 3) ^ (uint16_t)(b)) & (HASH_MAX - 1)) PM-4+hash significantly improves the match prediction rate of `predictmatch` when compared to PM-4 and Bitap for arbitrary strings picked at random from the 348,191 distinct words, where the probability distribution of lengths correspond to the word length frequencies. The log-log graph shown below shows these results for 2 words, 4 words, 8 words, and so on up to 1024 words searched: The results show the average taken over 100 runs of random string searches. Higher values are better. The top rate in this graph is 1, meaning perfect prediction. A 0.1 is a 10% prediction rate and 0.01 is a 1% prediction rate. Observe that combining Bitap with hashed PM-4 (PM-4+hash+Bitap) gives the highest prediction rates. A more clear picture of the efficacy of PM-4 emerges when we restrict the length of the searched words to 4 characters. In the following comparison, words are randomly picked from the 348,191 distinct words when 1 to 4 characters long. The log-log graph is shown below for 2 words, 4 words, 8 words, and so on up to 1024 words: To evaluate the efficacy of the hashing method on prediction rates, I will compare the performance of three hash functions. I tested with many more, but picked these three as better performing. The `<<3/<<3/<<3` hashing function I chose is rather simple by shifting and xor-ing four bytes to produce 10-bit hash values to index `mem[]`. We expect this hashing function to give better runtime performance with low computational overhead, but it may produce more false positives when compared to more advanced hashing functions that are computationaly more expensive. A more traditional hashing function is the `*33/*33/*33` hash that multiplies the previous hash value by 33 and adds the next character to the hash: #define hash1(a,b) (uint16_t)((((uint16_t)(a) << 5) + (uint16_t)(a) + (uint16_t)(b)) & (HASH_MAX - 1)) #define hash2(h,b) (uint16_t)((((uint16_t)(h) << 5) + (uint16_t)(h) + (uint16_t)(b)) & (HASH_MAX - 1)) #define hash3(h,b) (uint16_t)((((uint16_t)(h) << 5) + (uint16_t)(h) + (uint16_t)(b)) & (HASH_MAX - 1)) I also compared the results to an alternative "cheap" hashing function that puts more emphasis on the first two characters is `6<<<3/<<0`: #define hash1(a,b) (uint16_t)((((uint16_t)(a) << 6) ^ (uint16_t)(b)) & (HASH_MAX - 1)) #define hash2(h,b) (uint16_t)((((uint16_t)(h) << 3) ^ (uint16_t)(b)) & (HASH_MAX - 1)) #define hash3(h,b) (uint16_t)((((uint16_t)(h) << 0) ^ (uint16_t)(b)) & (HASH_MAX - 1)) The match prediction rates for these three hashing functions used with PM-4+hash+Bitap are shown in the log-log graph below for 2 patterns, 4 patterns, 8 patterns, and so on up to 1024 patterns randomly picked from the 348,191 distinct words: The results are not conclusive, although the results indicate that `*33/*33/*33` performs better. When the length of the words search is restricted to 4, we get a better picture of the efficacy of the three hashing methods on prediction rates: These results show the average taken over 100 runs of randomly picked words up to 4 characters long to search. The difference is more pronounced with the `*33/*33/*33` hashing function performing better in this case. However, the `*33/*33/*33` hash function is typically more expensive to execute compared to the `<<3/<<3/<<3` and `<<6/<<3/<<0` hashes. Therefore, ugrep uses the `<<3/<<3/<<3` hash that offers good runtime performance and reasonably good match prediction in general for words of arbitrary length. [![To top](images/go-up.png) To top](#) PM-*k* correctness proof ------------------------ In this section I will prove the correctness of `predictmatch` with a short and simple constructive proof. The objective is to show that the logic-based version of `predictmatch` is semantically identical to the control-flow-based version of `predictmatch`. Let's consider the case PM-4 with a window of 4 characters long for `predictmatch`: bits = (mem[c0] & 0xc0) | (mem[c1] & 0x30) | (mem[c2] & 0x0c) | (mem[c3] & 0x03) return (( bits >> 6 | bits >> 4 | bits >> 2 | bits >> 1) != 0x7f) Now align the `bits` patterns as follows, where the indices 1 to 7 are the `bits` values indexed from 0 (low order) to 7 (high order) and `-` represents a zero bit: bits >> 6 = [-,-,-,-,-,-,7,6] bits >> 4 = [-,-,-,-,7,6,5,4] bits >> 2 = [-,-,7,6,5,4,3,2] bits >> 1 = [-,7,6,5,4,3,2,1] Note that `predictmatch` returns `TRUE` when bits >> 6 = [-,-,-,-,-,-,7,6] bits >> 4 = [-,-,-,-,7,6,5,4] bits >> 2 = [-,-,7,6,5,4,3,2] bits >> 1 = [-,7,6,5,4,3,2,1] ^ ^ ^ ^ \ \ \ \__ if bit6, bit4, bit2, bit1 are all zero then \ \ \ return TRUE \ \ \__ if bit6, bit4, bit3 are all zero then \ \ return TRUE \ \__ if bit6, bit5 are all zero then \ return TRUE \__ if bit7 is zero then return TRUE That is if bit7 is zero then return TRUE if bit6, bit5 are all zero then return TRUE if bit6, bit4, bit3 are all zero then return TRUE if bit6, bit4, bit2, bit1 are all zero then return TRUE return FALSE By initialization of `mem[]` as defined earlier, we have the 8 `bits` assigned bits = [bit7,bit6,bit5,bit4,bit3,bit2,bit1,bit0] where bit7 = ! acceptbit(c0, 0) bit6 = ! matchbbit(c0, 0) bit5 = ! acceptbit(c1, 1) bit4 = ! matchbit(c1, 1) bit3 = ! acceptbit(c2, 2) bit2 = ! matchbit(c2, 2) bit1 = ! matchbit(c3, 3) bit0 = unused, assigned 1 Combining the above we have if acceptbit(c0, 0) then return TRUE if matchbbit(c0, 0) and acceptbit(c1, 1) then return TRUE if matchbbit(c0, 0) and matchbit(c1, 1) and acceptbit(c2, 2) then return TRUE if matchbbit(c0, 0) and matchbit(c1, 1) and matchbit(c2, 2) and matchbit(c3, 3) then return TRUE return FALSE which is semantically identical to if acceptbit(c0, 0) then return TRUE if matchbit(c0, 0) then if acceptbit(c1, 1) then return TRUE if matchbit(c1, 1) then if acceptbit(c2, 2) then return TRUE if match_bit(c2, 2) then if matchbit(c3, 3) then return TRUE return FALSE This completes the simple constructive proof for the case `k = 4`. It is not difficult to see that the proof can be generalized to any positive integer `k`. [![To top](images/go-up.png) To top](#) PM-4 implementation in C ------------------------ We start by slightly optimizing the pseudo code return ((bits >> 6 | bits >> 4 | bits >> 2 | bits >> 1) != 0x7f) to the factored form return (((((((bits >> 2) | bits) >> 2) | bits) >> 1) | bits) != 0xff) where it is assumed that `bit0` in `bits` is assigned 1, see my earlier comment in this article. An implementation of `predictmatch` in C with a very simple, computationally efficient, `<<3/<<3/<<3` hashing function that produces reasonably high match prediction rates as discussed: #define HASH_MAX 4096 #define hash1(a,b) (uint16_t)((((uint16_t)(a) << 3) ^ (uint16_t)(b)) & (HASH_MAX - 1)) #define hash2(h,b) (uint16_t)((((uint16_t)(h) << 3) ^ (uint16_t)(b)) & (HASH_MAX - 1)) #define hash3(h,b) (uint16_t)((((uint16_t)(h) << 3) ^ (uint16_t)(b)) & (HASH_MAX - 1)) static inline int predictmatch(uint8_t mem[], const char *window) { uint8_t b0 = window[0]; uint8_t b1 = window[1]; uint8_t b2 = window[2]; uint8_t b3 = window[3]; uint16_t h1 = hash1(b0, b1); uint16_t h2 = hash2(h1, b2); uint16_t h3 = hash3(h2, b3); uint8_t a0 = mem[b0]; uint8_t a1 = mem[h1]; uint8_t a2 = mem[h2]; uint8_t a3 = mem[h3]; uint8_t b = (a0 & 0xc0) | (a1 & 0x30) | (a2 & 0x0c) | (a3 & 0x03); uint8_t m = ((((((b >> 2) | b) >> 2) | b) >> 1) | b); return m != 0xff; } The initialization of `mem[]` with a set of `n` string patterns is done as follows: void init(int n, const char **patterns, uint8_t mem[]) { int i; memset(mem, 0xff, sizeof(uint8_t)*HASH_MAX); for (i = 0; i < n; ++i) { uint8_t b0 = (uint8_t)patterns[i][0]; uint8_t b1 = b0 ? (uint8_t)patterns[i][1] : 0; uint8_t b2 = b1 ? (uint8_t)patterns[i][2] : 0; uint8_t b3 = b2 ? (uint8_t)patterns[i][3] : 0; uint16_t h1 = hash1(b0, b1); uint16_t h2 = hash2(h1, b2); uint16_t h3 = hash3(h2, b3); mem[b0] &= 0x3f | ((b1 != 0) << 7); mem[h1] &= 0xcf | ((b2 != 0) << 5); mem[h2] &= 0xf3 | ((b3 != 0) << 3); mem[h3] &= 0xfc; } mem[0] &= 0x3f; } The `mem[0] &= 0x3f` statement flags `\0` as a predicted match to detect a terminating `\0` in the text data to search. A program to search a file for matching string patterns reads the specified file into memory and uses the remaining arguments as string patterns to search: int main(int argc, const char **argv) { uint8_t mem[HASH_MAX]; size_t hits = 0, total = 0; const char *filename = argv[1]; char *buffer, *ptr, *end; struct stat st; FILE *fp = fopen(filename, "r"); if (fp == NULL || fstat(fileno(fp), &st) != 0) { perror("cannot stat file"); exit(1); } buffer = (char*)malloc(st.st_size + 4); // +4 to add \0 and padding if (fread(buffer, st.st_size, 1, fp) != 1) { perror("cannot read file"); exit(1); } fclose(fp); buffer[st.st_size] = '\0'; // make \0-terminated init(argc - 2, &argv[2], mem); ptr = buffer; end = buffer + st.st_size; while (1) { while ((mem[(uint8_t)*ptr] & 0xc0) == 0xc0) ptr++; if (ptr >= end) break; if (predictmatch(mem, ptr)) { size_t len = match(argc - 2, &argv[2], ptr); if (len > 0) { printf("match at %zu\n", ptr - buffer); ++hits; ptr += len - 1; } ++total; } ++ptr; } printf("hits=%zu rate=%.2f%%\n", hits, (float)hits/total*100); } A simple and inefficient `match` function can be defined as size_t match(int n, const char **patterns, const char *ptr) { size_t len; int i; for (i = 0; i < n; ++i) if (*ptr == *patterns[i] && memcmp(patterns[i], ptr, (len = strlen(patterns[i]))) == 0) return len; return 0; } This `match` version is not efficient for large sets of string patterns. Best is to use a tree, DFA or a regex to match multiple strings. [![To top](images/go-up.png) To top](#) PM-4 with Bitap implementation in C ----------------------------------- Adding Bitap to PM-4 in C is almost trivial. The initialization is updated as follows to include `bitap` array initialization given the string patterns to search: int init(int n, const char **patterns, uint8_t mem[], uint16_t bitap[], int *entropy) { int i, j, min = 16; memset(mem, 0xff, sizeof(uint8_t)*HASH_MAX); memset(bitap, 0xff, sizeof(uint16_t)*256); *entropy = 0; for (i = 0; i < n; ++i) { uint8_t b0 = (uint8_t)patterns[i][0]; uint8_t b1 = b0 ? (uint8_t)patterns[i][1] : 0; uint8_t b2 = b1 ? (uint8_t)patterns[i][2] : 0; uint8_t b3 = b2 ? (uint8_t)patterns[i][3] : 0; uint16_t h1 = hash1(b0, b1); uint16_t h2 = hash2(h1, b2); uint16_t h3 = hash3(h2, b3); mem[b0] &= 0x3f | ((b1 != 0) << 7); mem[h1] &= 0xcf | ((b2 != 0) << 5); mem[h2] &= 0xf3 | ((b3 != 0) << 3); mem[h3] &= 0xfc; for (j = 0; j < min; ++j) { uint8_t b = patterns[i][j]; if (b == 0) break; bitap[b] &= ~(1 << j); } min = j; } mem[0] &= 0x3f; for (i = 0; i < 256; ++i) { bitap[i] |= ~((1 << min) - 1); for (j = 0; j < min; ++j) *entropy += ((bitap[i] & (1 << j)) == 0); } *entropy /= min; return min; } The function returns the minimum length of the string patterns for Bitap. The `min` variable is limited to 16, the number of bits in the Bitap bit vector. The `entropy` parameter measures the complexity of the string patterns capped to `min` length. A small entropy makes Bitap more effective, since the expected false positive rate will be low. If the entropy is too high, Bitap should not be used. An entropy threshold of 15 or less appears optimal for ugrep based on performance benchmarks. This threshold roughly corresponds to the point of convergence of the PM-4+hash prediction rate of the PM-4+hash+Bitap prediction rate in the log-log prediction rate graph shown earlier. For sets of 16 string patterns and greater there is negligible difference in the prediction rate. The search program is updated as follows to include a Bitap filter in the loop if the entropy is sufficiently low: int main(int argc, const char **argv) { uint8_t mem[HASH_MAX]; uint16_t bitap[256]; uint16_t bits, mask; int min, entropy; size_t hits = 0, total = 0, bitaps = 0; const char *filename = argv[1]; char *buffer, *ptr, *end; struct stat st; FILE *fp = fopen(filename, "r"); if (fp == NULL || fstat(fileno(fp), &st) != 0) { perror("cannot stat file"); exit(1); } buffer = (char*)malloc(st.st_size + 4); // add \0 and padding if (fread(buffer, st.st_size, 1, fp) != 1) { perror("cannot read file"); exit(1); } fclose(fp); buffer[st.st_size] = '\0'; min = init(argc - 2, &argv[2], mem, bitap, &entropy); bits = ~0; mask = 1 << (min - 1); ptr = buffer; end = buffer + st.st_size - min + 1; if (entropy < 14) { while (1) { while (ptr < end && ((bits = (bits << 1) | bitap[(uint8_t)*ptr]) & mask) != 0) ptr++; if (ptr >= end) break; ++bitaps; if (predictmatch(mem, ptr - min + 1)) { size_t len = match(argc - 2, &argv[2], ptr - min + 1); if (len > 0) { printf("match at %zu\n", ptr - min + 1 - buffer); ++hits; ptr += len - min; bits = ~0; } ++total; } ++ptr; } } else { while (1) { while ((mem[(uint8_t)*ptr] & 0xc0) == 0xc0) ptr++; if (ptr >= end) break; if (predictmatch(mem, ptr)) { size_t len = match(argc - 2, &argv[2], ptr); if (len > 0) { printf("match at %zu\n", ptr - buffer); ++hits; ptr += len - 1; } ++total; } ++ptr; } } printf("hits=%zu rate=%.2f%% (bitap rate=%.2f%%)\n", hits, (float)hits/total*100, (float)hits/bitaps*100); } This combination with Bitap offers the benefit of `predictmatch` to predict matches fairly accurately for short string patterns and Bitap to improve prediction for long string patterns. [![To top](images/go-up.png) To top](#) PM-*k* hardware implementation ------------------------------ An implementation in hardware with logic gates greatly accelerates the performance of PM-*k*. The diagram below roughly outlines a possible implementation in hardware of PM-4 with hashing: Assuming the window is placed over the string `ABXK` in the input, the matcher predicts a possible match by hashing the input characters (1) from the left to the right as clocked by (4). The memorized hashed patterns are stored in four memories `mem` (5), each with a fixed number of addressable entries `A` addressed by the hash outputs `H`. The `mem` outputs for `acceptbit` as `D1` and `matchbit` as `D0`, which are gated through a set of OR gates (6). The outputs are combined by the NAND gate (7) to output a match prediction (3). Prior to matching, all string patterns are "learned" by the memories `mem` by hashing the string presented on the input, for example the string pattern `AB`: The control (2) lines are activated as shown, where the third line (`d` for `A` and `0` for `B`) represents `acceptbit` for `D1` and `D2` for `matchbit` is low (6) to write to `mem` (5). [![To top](images/go-up.png) To top](#) Summary ------- This article introduced the PM-*k* multi-string approximate search method. Another way to look at PM-*k* is to consider it a class of methods that can be combined with existing multi-string search methods. One such example is PM-4 combined with hashing and Bitap. Multi-string predictive matching with PM-4 hashing and Bitap boosts the performance of multi-string and regex pattern search in ugrep. The implementation in ugrep demonstrates that the search performance beats other state-of-the-art search tools and methods. The generalization to regex patterns, such as implemented by ugrep, becomes obvious when considering the fact that we can generate all strings op to *k* characters long from the regex pattern provided. This is feasible when *k* is not too large. Therefore, ugrep uses PM-4 with hashing and Bitap to predict matches to optimize the performance of the DFA-based POSIX regex matcher. [![To top](images/go-up.png) To top](#) Downloads --------- [ugrep file search utility.](get-ugrep.html) [English Wikipedia dump](files/enwik8.zip) and [all unique words extracted](files/words.zip) and sorted. [C source code of PM-4 with hashing and Bitap](files/pm4-bitap.c) BSD-3 licensed [C++ source code to randomly pick words from a file](files/pick.cpp) BSD-3 licensed [![To top](images/go-up.png) To top](#)

*Copyright (c) 2023, Robert van Engelen, Genivia Inc. All rights reserved.*