Single String Knuth-Morris-Pratt Algorithm Problems Z Algorithm Explanation Implementation Palindromes Manacher Palindromic Tree Multiple Strings Tries Aho-Corasick Problems

Rare

0/42

String Searching

Authors: Benjamin Qi, Siyong Huang, Dustin Miao, Mihnea Brebenel

Knuth-Morris-Pratt and Z Algorithms (and a few more related topics).

Edit This Page

Prerequisites

Silver - Graph Traversal

Single String Knuth-Morris-Pratt Algorithm Problems Z Algorithm Explanation Implementation Palindromes Manacher Palindromic Tree Multiple Strings Tries Aho-Corasick Problems

Resources
	CPC	11 - Strings	String Matching, KMP, Tries
	CP2	6.4 - String Matching

Single String

A Note on Notation:

For a string $S$ :

$|S|$ denotes the size of string $S$
$S[i]$ denotes the character at index $i$ starting from $0$
$S[l:r]$ denotes the substring beginning at index $l$ and ending at index $r$
$S[:r]$ is equivalent to $S[0:r]$ , represents the prefix ending at $r$
$S[l:]$ is equivalent to $S[l:|S| - 1]$ , represents the suffix beginning of $l$ .
$S + T$ denotes concactinating $T$ to the end of $S$ . Note that this implies that addition is non-commutative.

Knuth-Morris-Pratt Algorithm

Resources
	cp-algo	Prefix Function
	PAPS	14.2 - String Matching
	GFG	KMP Algorithm
	TC	String Searching

Define an array $\pi_S$ of size $|S|$ such that $\pi_S[i]$ is equal to the length of the longest nontrivial suffix of the prefix ending at position $i$ that coincides with a prefix of the entire string. Formally,

\pi_S[i] = \max \{k \: | \: 1 \leq k < i \text{ and } S[0:k - 1] \equiv S[i - (k - 1): i] \}

In other words, for a given index $i$ , we would like to compute the length of the longest substring that ends at $i$ , such that this string also happens to be a prefix of the entire string. One such string that satisfies this criteria is the prefix ending at $i$ ; we will be disregarding this solution for obvious reasons.

For instance, for $S = \text{``abcabcd"}$ , $\pi_S = [0, 0, 0, 1, 2, 3, 0]$ , and the prefix function of $S = \text{``aabaaab"}$ is $\pi_S = [0, 1, 0, 1, 2, 2, 3]$ . In the second example, $\pi_S[4] = 2$ because the prefix of length $2$ ( $\text{``ab"})$ is equivalent to the substring of length $2$ that ends at index $4$ . In the same way, $\pi_S[6] = 3$ because the prefix of length $3$ ( $\text{``abb"}$ ) is equal to the substring of length $3$ that ends at index $6$ . For both of these samples, there is no longer substring that satisfies these criteria.

The purpose of the KMP algorithm is to efficiently compute the $\pi_S$ array in linear time. Suppose we have already computed the $\pi_S$ array for indices $0\dots i$ , and need to compute the value for index $i + 1$ .

Firstly, note that between $\pi_S[i]$ and $\pi_S[i + 1]$ , $\pi_S[i + 1]$ can be at most one greater. This occurs when $S[\pi_S[i]] = S[i + 1]$ .

In the example above, $\pi_S[i] = 5$ , meaning that the suffix of length $5$ is equivalent to a prefix of length $5$ of the entire string. It follows that if the character at position $5$ of the string is equal to the character at position $i + 1$ , then the match is simply extended by a single character. Thus, $\pi_S[i + 1] = \pi_S[i] + 1 = 6$ .

In the general case, however, this is not necessarily true. That is to say, $S[\pi_S[i]] \neq S[i + 1]$ . Thus, we need to find the largest index $j < \pi_S[i]$ such that the prefix property holds (ie $S[:j - 1] \equiv S[i - j + 1:i]$ ). For such a length $j$ , we repeat the procedure in the first example by comparing characters at indicies $j$ and $i + 1$ : if the two are equal, then we can conclude our search and assign $\pi_S[i + 1] = j + 1$ , and otherwise, we find the next smallest $j$ and repeat. Indeed, notice that the first example is simply the case where $j$ begins as $\pi_S[i]$ .

In the second example above, we let $j = 2$ .

The only thing that remains is to be able to efficiently find all the $j$ that we might possibly need. To recap, if the position we're currently at is $j$ , to handle transitions we need to find the largest index $k$ that satisfies the prefix property $S[:k - 1] \equiv S[j - k + 1 : j]$ . Since $j < i$ , this value is simply $\pi_S[j - 1]$ , a value that has already been computed. All that remains is to handle the case where $j = 0$ . If $S[0] = S[i + 1]$ , $\pi_S[i + 1] = 1$ , otherwise $\pi_S[i + 1] = 0$ .

C++

vector<int> pi(const string &s) {
	int n = (int)s.size();
	vector<int> pi_s(n);
	for (int i = 1, j = 0; i < n; i++) {
		while (j > 0 && s[j] != s[i]) { j = pi_s[j - 1]; }
		if (s[i] == s[j]) { j++; }
		pi_s[i] = j;
	}
	return pi_s;
}

Python

from typing import List


def pi(s: str) -> List[int]:
	n = len(s)
	pi_s = [0] * n
	j = 0
	for i in range(1, n):
		while j > 0 and s[j] != s[i]:
			j = pi_s[j - 1]
		if s[i] == s[j]:
			j += 1
		pi_s[i] = j
	return pi_s

Claim: The KMP algorithm runs in $\mathcal{O}(n)$ for computing the $\pi_S$ array on a string $S$ of length $n$ .

Proof: Note that $j$ doesn't actually change through multiple iterations. This is because on iteration $i$ , we assign $j = \pi_S[i - 1]$ . However, in the previous iteration, we assign $\pi_S[i - 1]$ to be $j$ . Furthermore, note that $j$ is always non-negative. In each iteration of $i$ , $j$ is only increased by at most $1$ in the if statement. Since $j$ remains non-negative and is only increased a constant amount per iteration, it follows that $j$ can only decrease by at most $n$ times through all iterations of $i$ . Since the inner loop is completely governed by $j$ , the overall complexity amortizes to $\mathcal{O}(n)$ . $\blacksquare$

Problems

Source	Problem Name	Difficulty	Tags
CSES	String Matching	Very Easy	Show Tags KMP, Z
POI	2006 - Periods of Words	Easy	Show Tags KMP, Strings
Baltic OI	2019 - Necklace	Normal	Show Tags KMP, Strings
Old Gold	Cow Patterns	Hard	Show Tags KMP, Strings
POI	2005 - Template	Hard	Show Tags KMP, Strings
CEOI	2011 - Matching	Hard	Show Tags KMP
POI	2012 - Prefixuffix	Very Hard	Show Tags KMP
POI	2011 - Periodicity	Very Hard	Show Tags KMP

Z Algorithm

Finding Periods

CSES - Normal

Focus Problem – try your best to solve this problem before continuing!

Resources
	cp-algo	Z Function
	CPH	26.4 - Z-algorithm
	CF	Z Algorithm

Explanation

The Z-algorithm is very similar to KMP, but it uses a different function than $\pi$ and it has an interesting different application than string matching.

Instead of using $\pi$ , it uses the z-function. Given a position, this function gives the length of the longest string that's both the prefix of $S$ and of the suffix of $S$ starting at the given position.

Here's some examples of what this function might look like:

aabxaayaab $\rightarrow$ $Z=[10,1,0,0,2,1,0,3,1,0]$
aabxaabxcaabxaabxay $\rightarrow$ $Z=[18,1,0,0,4,1,0,0,0,8,1,0,0,5,1,0,0,1,0]$

Let's also take a closer look at $Z_9=8$ (0-indexed) for the second string. The value for this position is $8$ because that's the longest common prefix between the string itself aabxaabxcaabxaabxay and the suffix starting at position $9$ aabxaabxay (also 0-indexed).

To efficiently compute this array, we maintain the $[l, r]$ interval such that $S_{l...r}$ is also a prefix, i.e. $Z_l=r-l+1$ .

Say we have a position $i$ anywhere in $[l,r]$ . We would then have these two cases:

If $i + Z_{i-l} < r$ , we know that $Z_i = Z_{i-l}$ .
Otherwise, $i + Z_{i-l} \geq r$ , meaning that the answer can expand beyond $r$ . Thus, we compare character by character from there on.

Implementation

Time Complexity: $\mathcal{O}(N)$

C++

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>

using namespace std;

vector<int> z_function(const string &s) {
	vector<int> z(s.size());
	z[0] = s.size();

Python

from typing import List


def z_function(s: str) -> List[int]:
	n = len(s)
	z = [0] * n
	z[0] = n
	l, r = 0, 0
	for i in range(1, n):
		z[i] = max(0, min(z[i - l], r - i + 1))

Source	Problem Name	Difficulty	Tags
YS	Z Algorithm	Very Easy	Show Tags Z
CSES	String Matching	Very Easy	Show Tags KMP, Z
CF	Vasya and Big Integers	Normal	Show Tags DP, Strings
CF	Prefixes and Suffixes	Normal	Show Tags Z
CF	Concatenation with Intersection	Hard

Palindromes

Manacher

Longest Palindrome

CSES - Easy

Focus Problem – try your best to solve this problem before continuing!

Resources
	HR	Manacher's Algorithm
	Medium	Manacher’s Algorithm: Longest Palindromic Substring
	cp-algo	Manacher's Algorithm

Manacher's Algorithm functions similarly to the Z-Algorithm. It determines the longest palindrome centered at each character.

Let's denote $dp_i$ as the maximum diameter of a palindrome centered at $i$ . Manacher's algorithm makes use of the previously determined $dp_j$ , where $j < i$ incalculating $dp_i$ . The main idea is that for a palindrome centered at $i$ with the borders $left$ and $right$ the $dp_j$ ( $i < j \le right$ ) values are - probably - mirrors of the $dp_k$ ( $left \le k < i$ ) values on the left side of the palindrome. Probably because for some $j$ the maximum palindrome might cross the right border. This way the algorithm only considers the palindrome centers that could lead to the expansion of the maximum palindrome.

Time complexity: $\mathcal{O}(N)$

C++

#include <bits/stdc++.h>
using namespace std;

string menacher(string s) {
	// Preprocess the input so it can handle even length palindromes
	string arr;
	for (int i = 0; i < s.size(); i++) {
		arr.push_back('#');
		arr.push_back(s[i]);
	}

Don't Forget!

If s[l, r] is a palindrome, then s[l+1, r-1] is as well.

Source	Problem Name	Difficulty	Tags
CF	Sonya and Matrix Beauty	Normal	Show Tags Strings
CF	Prefix-Suffix Palindrome	Normal	Show Tags Strings
CF	Palisection	Hard	Show Tags Prefix Sums, Strings

Palindromic Tree

A Palindromic Tree is a tree-like data structure that behaves similarly to KMP. Unlike KMP, in which the only empty state is $0$ , the Palindromic Tree has two empty states: length $0$ , and length $-1$ . This is because appending a character to a palindrome increases the length by $2$ , meaning a single character palindrome must have been created from a palindrome of length $-1$

Resources
	CF	adamant - Palindromic Tree
	adilet.org	Palindromic Tree

Source	Problem Name	Difficulty	Tags
APIO	2014 - Palindrome	Easy
CF	Palisection	Hard	Show Tags Prefix Sums, Strings
MMCC	Momoka	Very Hard

Multiple Strings

Tries

Word Combinations

CSES - Easy

Focus Problem – try your best to solve this problem before continuing!

Resources
	CPH	26.2
	CF	Algorithm Gym
	PAPS	14.1 - Tries

A trie is a tree-like data structure that stores strings. Each node is a string, and each edge is a character.

The root is the empty string, and every node is represented by the characters along the path from the root to that node. This means that every prefix of a string is an ancestor of that string's node.

C++

#include <bits/stdc++.h>
using namespace std;

const int NMAX = 5e3;
const int WMAX = 1e6;
const int MOD = 1e9 + 7;

int trie[WMAX][26];
int node_count;
bool stop[WMAX];

Source	Problem Name	Difficulty	Tags
COCI	Vlak	Very Easy	Show Tags DFS, Strings, Trie
IOI	Type Printer	Very Easy	Show Tags DFS, Strings, Trie
CF	Xor-MST	Normal	Show Tags MST, Trie
YS	Set XOR-Min	Easy	Show Tags Greedy, Trie
Gold	Find and Replace	Normal	Show Tags Strings, Trie
CF	Old Berland Language	Normal	Show Tags Strings, Trie
COCI	2020 - Klasika	Normal	Show Tags Trie
AC	XOR Game	Normal
CF	Short Code	Normal	Show Tags Small to Large, Tree, Trie
CF	Beautiful Subarrays	Normal	Show Tags Bitmasks, Tree, Trie
IZhO	2012 - XOR	Hard	Show Tags Greedy, Trie
JOI	2016 - Selling RNA Strands	Hard	Show Tags BIT, Trie
CF	Tree and XOR	Hard	Show Tags Tree, Trie

Aho-Corasick

Finding Patterns

CSES - Hard

Focus Problem – try your best to solve this problem before continuing!

Resources
	cp-algo	Aho Corasick
	CF	adamant - Aho-Corasick
	GFG	Aho-Corasick for Pattern Searching

The Aho-Corasick algorithm stores the pattern words in a trie structure, described above. It uses the trie to transition from a state to anoother. Similar to KMP algorithm, we want to reuse the information we have already processed.

A suffix link or failure link for a node $u$ is a special edge that points to the longest proper suffix of the string corresponding to node $u$ . The suffix links for the root and all its immediate children point to the root node. For all other nodes $u$ with parent $p$ and letter $c$ on the edge $p \rightarrow u$ the suffix link can be computed by following $p$ 's failure link and transitioning to letter $c$ from there.

While processing the string $S$ the algorithm maintains the current node in the trie such that word formed in the node is equal to the longest suffix ending in $i$ .

For example, when transitioning from $i$ to $i+1$ in $S$ there only are two choices:

If $node$ does have an outgoing edge with letter $S_{i+1}$ , then move down the edge.
Otherwise, follow the failure link of $S_i$ and transition to letter $S_{i+1}$ from there.

The image below depicts how the structure looks like for the words $[a, ag, c, caa, gag, gc, gca]$ .

An Aho-Corasick trie with failure links as light edges.

There is a special case when some words are substring of other words in the wordlist. This could lead to some problems depending on the implementation. Dictionary links can solve this problem. They act like suffix links that point to the first suffix that is also a word in the wordlist. The code below constructs the structure using a BFS.

Time Complexity: $\mathcal{O}(m\sigma)$ - where $m$ is the size of the alphaber and $\sigma$ the size of the alphabet

C++

#include <bits/stdc++.h>

using namespace std;

const int MAX_N = 6e5;
const int SIGMA = 26;

int n;
string s;
// The number of nodes in trie

Source	Problem Name	Difficulty	Tags
CF	Frequency of String	Easy	Show Tags Strings
Gold	Censoring	Normal	Show Tags Strings
CF	You Are Given Some Strings...	Normal	Show Tags Strings

Problems

Source	Problem Name	Difficulty
CSES	Finding Borders	Normal
CSES	Minimal Rotation	Normal
CSES	Maximum Xor Subarray	Normal

Module Progress:

Join the USACO Forum!

Stuck on a problem, or don't understand a module? Join the USACO Forum and get help from other competitive programmers!

Join Forum

Table of Contents

String Searching

Prerequisites

Table of Contents

Single String

A Note on Notation:

Knuth-Morris-Pratt Algorithm

Problems

Z Algorithm

Explanation

Implementation

Palindromes

Manacher

Don't Forget!

Palindromic Tree

Multiple Strings

Tries

Aho-Corasick

Problems

Module Progress:

Join the USACO Forum!

Table of Contents

String Searching

Prerequisites

Table of Contents

Single String

A Note on Notation:

Knuth-Morris-Pratt Algorithm

Problems

Z Algorithm

Explanation

Implementation

Palindromes

Manacher

Don't Forget!

Palindromic Tree

Multiple Strings

Tries

Aho-Corasick

Problems

Module Progress:Not Started

Join the USACO Forum!

Module Progress: