PrevNext
Rare
 0/42

String Searching

Authors: Benjamin Qi, Siyong Huang, Dustin Miao, Mihnea Brebenel

Knuth-Morris-Pratt and Z Algorithms (and a few more related topics).

Edit This Page
Resources
CPC

String Matching, KMP, Tries

CP2

Single String

A Note on Notation:

For a string SS:

  • S|S| denotes the size of string SS
  • S[i]S[i] denotes the character at index ii starting from 00
  • S[l:r]S[l:r] denotes the substring beginning at index ll and ending at index rr
  • S[:r]S[:r] is equivalent to S[0:r]S[0:r], represents the prefix ending at rr
  • S[l:]S[l:] is equivalent to S[l:S1]S[l:|S| - 1], represents the suffix beginning of ll.
  • S+TS + T denotes concactinating TT to the end of SS. Note that this implies that addition is non-commutative.

Knuth-Morris-Pratt Algorithm

Define an array πS\pi_S of size S|S| such that πS[i]\pi_S[i] is equal to the length of the longest nontrivial suffix of the prefix ending at position ii that coincides with a prefix of the entire string. Formally,

πS[i]=max{k1k<i and S[0:k1]S[i(k1):i]}\pi_S[i] = \max \{k \: | \: 1 \leq k < i \text{ and } S[0:k - 1] \equiv S[i - (k - 1): i] \}

In other words, for a given index ii, we would like to compute the length of the longest substring that ends at ii, such that this string also happens to be a prefix of the entire string. One such string that satisfies this criteria is the prefix ending at ii; we will be disregarding this solution for obvious reasons.

For instance, for S=“abcabcd"S = \text{``abcabcd"}, πS=[0,0,0,1,2,3,0]\pi_S = [0, 0, 0, 1, 2, 3, 0], and the prefix function of S=“aabaaab"S = \text{``aabaaab"} is πS=[0,1,0,1,2,2,3]\pi_S = [0, 1, 0, 1, 2, 2, 3]. In the second example, πS[4]=2\pi_S[4] = 2 because the prefix of length 22 (“ab")\text{``ab"}) is equivalent to the substring of length 22 that ends at index 44. In the same way, πS[6]=3\pi_S[6] = 3 because the prefix of length 33 (“abb"\text{``abb"}) is equal to the substring of length 33 that ends at index 66. For both of these samples, there is no longer substring that satisfies these criteria.

The purpose of the KMP algorithm is to efficiently compute the πS\pi_S array in linear time. Suppose we have already computed the πS\pi_S array for indices 0i0\dots i, and need to compute the value for index i+1i + 1.

Firstly, note that between πS[i]\pi_S[i] and πS[i+1]\pi_S[i + 1], πS[i+1]\pi_S[i + 1] can be at most one greater. This occurs when S[πS[i]]=S[i+1]S[\pi_S[i]] = S[i + 1].

KMP Example 1

In the example above, πS[i]=5\pi_S[i] = 5, meaning that the suffix of length 55 is equivalent to a prefix of length 55 of the entire string. It follows that if the character at position 55 of the string is equal to the character at position i+1i + 1, then the match is simply extended by a single character. Thus, πS[i+1]=πS[i]+1=6\pi_S[i + 1] = \pi_S[i] + 1 = 6.

In the general case, however, this is not necessarily true. That is to say, S[πS[i]]S[i+1]S[\pi_S[i]] \neq S[i + 1]. Thus, we need to find the largest index j<πS[i]j < \pi_S[i] such that the prefix property holds (ie S[:j1]S[ij+1:i]S[:j - 1] \equiv S[i - j + 1:i]). For such a length jj, we repeat the procedure in the first example by comparing characters at indicies jj and i+1i + 1: if the two are equal, then we can conclude our search and assign πS[i+1]=j+1\pi_S[i + 1] = j + 1, and otherwise, we find the next smallest jj and repeat. Indeed, notice that the first example is simply the case where jj begins as πS[i]\pi_S[i].

KMP Example 2

In the second example above, we let j=2j = 2.

The only thing that remains is to be able to efficiently find all the jj that we might possibly need. To recap, if the position we're currently at is jj, to handle transitions we need to find the largest index kk that satisfies the prefix property S[:k1]S[jk+1:j]S[:k - 1] \equiv S[j - k + 1 : j]. Since j<ij < i, this value is simply πS[j1]\pi_S[j - 1], a value that has already been computed. All that remains is to handle the case where j=0j = 0. If S[0]=S[i+1]S[0] = S[i + 1], πS[i+1]=1\pi_S[i + 1] = 1, otherwise πS[i+1]=0\pi_S[i + 1] = 0.

C++

vector<int> pi(const string &s) {
int n = (int)s.size();
vector<int> pi_s(n);
for (int i = 1, j = 0; i < n; i++) {
while (j > 0 && s[j] != s[i]) { j = pi_s[j - 1]; }
if (s[i] == s[j]) { j++; }
pi_s[i] = j;
}
return pi_s;
}

Python

from typing import List
def pi(s: str) -> List[int]:
n = len(s)
pi_s = [0] * n
j = 0
for i in range(1, n):
while j > 0 and s[j] != s[i]:
j = pi_s[j - 1]
if s[i] == s[j]:
j += 1
pi_s[i] = j
return pi_s

Claim: The KMP algorithm runs in O(n)\mathcal{O}(n) for computing the πS\pi_S array on a string SS of length nn.

Proof: Note that jj doesn't actually change through multiple iterations. This is because on iteration ii, we assign j=πS[i1]j = \pi_S[i - 1]. However, in the previous iteration, we assign πS[i1]\pi_S[i - 1] to be jj. Furthermore, note that jj is always non-negative. In each iteration of ii, jj is only increased by at most 11 in the if statement. Since jj remains non-negative and is only increased a constant amount per iteration, it follows that jj can only decrease by at most nn times through all iterations of ii. Since the inner loop is completely governed by jj, the overall complexity amortizes to O(n)\mathcal{O}(n). \blacksquare

Problems

StatusSourceProblem NameDifficultyTags
CSESVery Easy
Show TagsKMP, Z
POIEasy
Show TagsKMP, Strings
Baltic OINormal
Show TagsKMP, Strings
Old GoldHard
Show TagsKMP, Strings
POIHard
Show TagsKMP, Strings
CEOIHard
Show TagsKMP
POIVery Hard
Show TagsKMP
POIVery Hard
Show TagsKMP

Z Algorithm

Focus Problem – try your best to solve this problem before continuing!

Explanation

The Z-algorithm is very similar to KMP, but it uses a different function than π\pi and it has an interesting different application than string matching.

Instead of using π\pi, it uses the z-function. Given a position, this function gives the length of the longest string that's both the prefix of SS and of the suffix of SS starting at the given position.

Here's some examples of what this function might look like:

  • aabxaayaab \rightarrow Z=[10,1,0,0,2,1,0,3,1,0]Z=[10,1,0,0,2,1,0,3,1,0]
  • aabxaabxcaabxaabxay \rightarrow Z=[18,1,0,0,4,1,0,0,0,8,1,0,0,5,1,0,0,1,0]Z=[18,1,0,0,4,1,0,0,0,8,1,0,0,5,1,0,0,1,0]

Let's also take a closer look at Z9=8Z_9=8 (0-indexed) for the second string. The value for this position is 88 because that's the longest common prefix between the string itself aabxaabxcaabxaabxay and the suffix starting at position 99 aabxaabxay (also 0-indexed).

To efficiently compute this array, we maintain the [l,r][l, r] interval such that Sl...rS_{l...r} is also a prefix, i.e. Zl=rl+1Z_l=r-l+1.

Say we have a position ii anywhere in [l,r][l,r]. We would then have these two cases:

  1. If i+Zil<ri + Z_{i-l} < r, we know that Zi=ZilZ_i = Z_{i-l}.
  2. Otherwise, i+Zilri + Z_{i-l} \geq r, meaning that the answer can expand beyond rr. Thus, we compare character by character from there on.

Implementation

Time Complexity: O(N)\mathcal{O}(N)

C++

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>
using namespace std;
vector<int> z_function(const string &s) {
vector<int> z(s.size());
z[0] = s.size();

Python

from typing import List
def z_function(s: str) -> List[int]:
n = len(s)
z = [0] * n
z[0] = n
l, r = 0, 0
for i in range(1, n):
z[i] = max(0, min(z[i - l], r - i + 1))
StatusSourceProblem NameDifficultyTags
YSVery Easy
Show TagsZ
CSESVery Easy
Show TagsKMP, Z
CFNormal
Show TagsDP, Strings
CFNormal
Show TagsZ
CFHard

Palindromes

Manacher

Focus Problem – try your best to solve this problem before continuing!

Manacher's Algorithm functions similarly to the Z-Algorithm. It determines the longest palindrome centered at each character.

Let's denote dpidp_i as the maximum diameter of a palindrome centered at ii. Manacher's algorithm makes use of the previously determined dpjdp_j, where j<ij < i incalculating dpidp_i. The main idea is that for a palindrome centered at ii with the borders leftleft and rightright the dpjdp_j (i<jrighti < j \le right) values are - probably - mirrors of the dpkdp_k (leftk<ileft \le k < i) values on the left side of the palindrome. Probably because for some jj the maximum palindrome might cross the right border. This way the algorithm only considers the palindrome centers that could lead to the expansion of the maximum palindrome.

Time complexity: O(N)\mathcal{O}(N)

C++

#include <bits/stdc++.h>
using namespace std;
string menacher(string s) {
// Preprocess the input so it can handle even length palindromes
string arr;
for (int i = 0; i < s.size(); i++) {
arr.push_back('#');
arr.push_back(s[i]);
}

Don't Forget!

If s[l, r] is a palindrome, then s[l+1, r-1] is as well.

StatusSourceProblem NameDifficultyTags
CFNormal
Show TagsStrings
CFNormal
Show TagsStrings
CFHard
Show TagsPrefix Sums, Strings

Palindromic Tree

A Palindromic Tree is a tree-like data structure that behaves similarly to KMP. Unlike KMP, in which the only empty state is 00, the Palindromic Tree has two empty states: length 00, and length 1-1. This is because appending a character to a palindrome increases the length by 22, meaning a single character palindrome must have been created from a palindrome of length 1-1

StatusSourceProblem NameDifficultyTags
APIOEasy
CFHard
Show TagsPrefix Sums, Strings
MMCCVery Hard

Multiple Strings

Tries

Focus Problem – try your best to solve this problem before continuing!

A trie is a tree-like data structure that stores strings. Each node is a string, and each edge is a character.

The root is the empty string, and every node is represented by the characters along the path from the root to that node. This means that every prefix of a string is an ancestor of that string's node.

C++

#include <bits/stdc++.h>
using namespace std;
const int NMAX = 5e3;
const int WMAX = 1e6;
const int MOD = 1e9 + 7;
int trie[WMAX][26];
int node_count;
bool stop[WMAX];
StatusSourceProblem NameDifficultyTags
COCIVery Easy
Show TagsDFS, Strings, Trie
IOIVery Easy
Show TagsDFS, Strings, Trie
CFNormal
Show TagsMST, Trie
YSEasy
Show TagsGreedy, Trie
GoldNormal
Show TagsStrings, Trie
CFNormal
Show TagsStrings, Trie
COCINormal
Show TagsTrie
ACNormal
CFNormal
Show TagsSmall to Large, Tree, Trie
CFNormal
Show TagsBitmasks, Tree, Trie
IZhOHard
Show TagsGreedy, Trie
JOIHard
Show TagsBIT, Trie
CFHard
Show TagsTree, Trie

Aho-Corasick

Focus Problem – try your best to solve this problem before continuing!

The Aho-Corasick algorithm stores the pattern words in a trie structure, described above. It uses the trie to transition from a state to anoother. Similar to KMP algorithm, we want to reuse the information we have already processed.

A suffix link or failure link for a node uu is a special edge that points to the longest proper suffix of the string corresponding to node uu. The suffix links for the root and all its immediate children point to the root node. For all other nodes uu with parent pp and letter cc on the edge pup \rightarrow u the suffix link can be computed by following pp's failure link and transitioning to letter cc from there.

While processing the string SS the algorithm maintains the current node in the trie such that word formed in the node is equal to the longest suffix ending in ii.

For example, when transitioning from ii to i+1i+1 in SS there only are two choices:

  1. If nodenode does have an outgoing edge with letter Si+1S_{i+1}, then move down the edge.
  2. Otherwise, follow the failure link of SiS_i and transition to letter Si+1S_{i+1} from there.

The image below depicts how the structure looks like for the words [a,ag,c,caa,gag,gc,gca][a, ag, c, caa, gag, gc, gca].

Trie An Aho-Corasick trie with failure links as light edges.

There is a special case when some words are substring of other words in the wordlist. This could lead to some problems depending on the implementation. Dictionary links can solve this problem. They act like suffix links that point to the first suffix that is also a word in the wordlist. The code below constructs the structure using a BFS.

Time Complexity: O(mσ)\mathcal{O}(m\sigma) - where mm is the size of the alphaber and σ\sigma the size of the alphabet

C++

#include <bits/stdc++.h>
using namespace std;
const int MAX_N = 6e5;
const int SIGMA = 26;
int n;
string s;
// The number of nodes in trie
StatusSourceProblem NameDifficultyTags
CFEasy
Show TagsStrings
GoldNormal
Show TagsStrings
CFNormal
Show TagsStrings

Problems

StatusSourceProblem NameDifficultyTags
CSESNormal
CSESNormal
CSESNormal

Module Progress:

Join the USACO Forum!

Stuck on a problem, or don't understand a module? Join the USACO Forum and get help from other competitive programmers!

PrevNext