Skip to content

Longest Common Substring & Subsequence Finder

Longest common substring (17 chars)

ps over the lazy

The quick brown fox jumps over the lazy dog
The slow brown fox leaps over the lazy cat

Estimates for educational purposes — not financial, medical, or legal advice. See terms.

Find the longest shared text between two inputs using the classic dynamic-programming algorithms for longest common substring (contiguous match) and longest common subsequence (non-contiguous). Paste two texts, pick which kind of match you want, and the result highlights the matching text in both inputs.

The two modes

Longest common substring finds the longest run of characters that appears the same in both strings, contiguously. The shared text must be an unbroken sequence in both inputs. “The quick brown fox” and “The slow brown fox” share the substring ” brown fox” (10 characters including the leading space) because those characters appear consecutively in both.

Longest common subsequence finds the longest sequence of characters that appears in both strings in the same order but not necessarily contiguous. “abcXdef” and “aYbZcWdef” share the subsequence “abcdef” (6 characters) because you can pick those letters in the right order from both strings, even though they’re interrupted by other characters.

The substring is always a subset of the subsequence — the longest substring is never longer than the longest subsequence for the same input pair. They become equal only when the common text really is contiguous in both strings.

The algorithms

Both use a 2-dimensional dynamic-programming table of size $(m+1) \times (n+1)$, where $m$ and $n$ are the lengths of the two input strings. Each cell $(i, j)$ in the table stores a piece of the answer for a prefix of the inputs.

For substring, cell $(i, j)$ holds the length of the longest common substring ending at position $i$ in string A and position $j$ in string B:

dp(i,j)={dp(i1,j1)+1if Ai=Bj0otherwise\text{dp}(i, j) = \begin{cases} \text{dp}(i-1, j-1) + 1 & \text{if } A_i = B_j \\ 0 & \text{otherwise} \end{cases}

Whenever characters match, you extend the run; whenever they don’t, the run breaks and the cell is zero. Track the global maximum as you fill the table, then read off the match by backing up from the cell with the maximum value.

For subsequence, cell $(i, j)$ holds the length of the longest common subsequence of the first $i$ characters of A and the first $j$ of B:

dp(i,j)={dp(i1,j1)+1if Ai=Bjmax(dp(i1,j), dp(i,j1))otherwise\text{dp}(i, j) = \begin{cases} \text{dp}(i-1, j-1) + 1 & \text{if } A_i = B_j \\ \max(\text{dp}(i-1, j),\ \text{dp}(i, j-1)) & \text{otherwise} \end{cases}

Whenever characters match, you inherit the diagonal value plus one; otherwise you take the best of dropping a character from either string. The final answer lives in cell $(m, n)$, and you reconstruct the actual subsequence by walking back through the table.

Both algorithms run in $O(m \cdot n)$ time and memory.

Example: plagiarism detection

Two student essays share the sentence “the industrial revolution fundamentally transformed western economies by mechanising production and concentrating capital.” The substring mode finds that exact 102-character match if both essays contain it. In real plagiarism work, a 50+ character shared contiguous substring is strong evidence that one author saw the other’s text — genuinely independent prose rarely produces such long coincidental overlaps.

Example: code diff

Given two versions of a function, the subsequence mode finds the longest sequence of lines (or characters, depending on how you tokenise) that are unchanged between the versions. Everything not in the LCS is either an insertion, deletion, or modification. This is the core of how git diff and most other diff tools compute what changed between two files — the LCS tells you what stayed the same, and the delta is everything else.

Example: similar but not identical titles

“The Lord of the Rings: The Fellowship of the Ring” and “The Lord of the Rings: The Two Towers” share the substring “The Lord of the Rings: The ” (27 characters). The subsequence would go further — adding any common-ordered characters from the divergent tails — but for “is this the same franchise?” questions, the substring is more discriminating.

What this tool does not do

It doesn’t compute a similarity percentage — for that, use the similarity percentage checker, which normalises the match length against the input sizes across four algorithms.

It doesn’t handle fuzzy or approximate matching (Levenshtein distance, edit distance). Those are different problems with different algorithms; this tool finds exact character matches only.

It doesn’t produce a full diff output (inserted/deleted/unchanged line counts). The subsequence length tells you what’s shared, but producing a full diff requires tracking the non-matching edits too — for that, the text diff tool gives the line-by-line changed/unchanged split.

Frequently asked questions

What is the difference between substring and subsequence?

A substring is a contiguous run of characters — 'abcd' and 'efabcdgh' share the substring 'abcd' because those four characters appear next to each other in both strings. A subsequence is a sequence of characters in the same order but NOT necessarily contiguous — 'abcd' and 'axbycdz' share the subsequence 'abcd' because those four characters appear in both strings in the same left-to-right order, even though they are separated by other characters in the second. The substring is always shorter than or equal to the subsequence; they are equal only when the common part is genuinely contiguous in both strings.

What is this useful for?

Plagiarism detection — if two essays share a long contiguous substring, that's strong evidence one was copied from the other. Code diff debugging — finding the unchanged block between two versions of a file helps you focus on the actually-edited regions. DNA sequence alignment — subsequences are how biologists find shared genetic material between species. Text deduplication — identifying near-duplicate paragraphs in a large corpus. Diff tools (git diff, GNU diff) use variants of the LCS algorithm to compute what changed between two files.

How does the algorithm work?

Both versions use dynamic programming with a 2-dimensional table. For substring: entry (i, j) stores the length of the longest common substring ending exactly at position i in string A and position j in string B. If the characters match, the value is (i−1, j−1) + 1; otherwise it's 0 (the run is broken). For subsequence: entry (i, j) stores the length of the longest common subsequence considering the first i characters of A and the first j of B. If characters match, value is (i−1, j−1) + 1; otherwise it's the maximum of (i−1, j) and (i, j−1). Both run in O(m × n) time and memory, where m and n are the string lengths.

When there are ties, which match does the tool return?

For substring: the leftmost match encountered during the scan of string A. For subsequence: the one reconstructed by a canonical backtrack through the DP table, which biases toward taking matches that are in the lower-right corner of the table first. Neither is 'more correct' than any other — they're all equally valid maximal matches. If you need to find all maximal matches, the algorithm can be extended to backtrack multiple paths, but this tool just reports one.

How long can the input strings be?

The algorithm is O(m × n) in both time and memory, so input sizes of a few thousand characters each run comfortably in the browser. Ten thousand × ten thousand is 100 million operations and 100 million cells in the DP table, which is slow but still works. For genuinely long inputs (megabytes), specialised algorithms with better memory profiles exist (Hirschberg's for LCS subsequence, suffix arrays for substring), but they're not needed for the use cases this tool targets.