Compare two strings and get four different similarity scores side by side, each computed by a different algorithm with a different strength. Paste two texts and the results update live.
The four algorithms
Levenshtein — the classic edit distance. Counts the minimum number of single-character insertions, deletions, or substitutions to turn one string into the other. The similarity score is $1 - \text{distance} / \max(|a|, |b|)$, so 100% means no edits needed. Good general-purpose character-level similarity. Weakness: on long inputs, one added paragraph at the end inflates the distance proportionally even if the earlier text is identical.
Jaro-Winkler — designed for short-string record linkage (matching names across databases). The base Jaro score counts matching characters within a sliding window and penalizes transpositions. Winkler’s extension adds a prefix bonus: strings sharing their first few characters score higher. “MARTHA” and “MARHTA” score 96% under Jaro-Winkler because they share a 3-character prefix; plain Jaro gives 94%. The algorithm is tuned for the case where you’re looking for typos in a name or ID field, not for comparing longer texts.
Sørensen-Dice coefficient — set overlap on character bigrams. Extract all 2-character windows from each string, then count how many appear in both. The formula is $2 |A \cap B| / (|A| + |B|)$. Good at detecting near-duplicates and works naturally on word-like inputs. “night” and “nacht” share exactly one bigram (“ht”) out of four total in each, scoring 25%. It’s order-insensitive at the scale of bigrams — strings with the same characters in different orders can score similarly.
Cosine similarity — treats each text as a bag of words, counts word frequencies, and computes the cosine of the angle between the two frequency vectors. Lowercases and strips punctuation before counting. Order-independent: “cat and dog” and “dog and cat” score 100%. Word-level, not character-level — two strings with no shared words score 0% regardless of how similar they are letter-by-letter.
Example: typo detection
“MARTHA” vs “MARHTA” (one character transposition in the middle):
- Levenshtein: 67% (2 edits out of 6-char max)
- Jaro-Winkler: 96% (shared MAR prefix boosts the score)
- Dice: 60% (4 of 5 bigrams shared)
- Cosine: 0% (treated as one unique word each — they don’t match)
Jaro-Winkler is the right algorithm for this case. Levenshtein understates how similar they are because the transposition costs 2 edits, not 1. Cosine is useless because the inputs are single “words” that don’t match.
Example: paraphrase detection
“The cat sat on the mat” vs “A cat is sitting on the mat”:
- Levenshtein: 54% (many edits)
- Jaro-Winkler: 78%
- Dice: 68%
- Cosine: 60% (shared: cat, on, the, mat; different: The/A, sat/is/sitting)
Cosine and Dice reflect the word overlap; Levenshtein understates the similarity because “sat” and “is sitting” differ character-by-character even though they express the same idea.
Example: reordering
“The cat sat on the mat” vs “The mat was sat on by the cat”:
- Levenshtein: ~42% (lots of character-level changes)
- Jaro-Winkler: 74%
- Dice: 51%
- Cosine: 89% (same word set, mostly same counts)
Cosine is the right algorithm here — the two sentences use almost the same words in different orders, and cosine treats them as nearly identical. The character-level algorithms see the reordering as a lot of edits.
Example: completely different texts
“The weather is nice today” vs “Quantum entanglement preserves coherence”:
- Levenshtein: ~10%
- Jaro-Winkler: ~45% (coincidental character matches within window)
- Dice: ~5%
- Cosine: 0% (no shared words)
Cosine is the most discriminating here — it sees no shared vocabulary and returns zero. Jaro-Winkler’s window-based matching is misleading on long unrelated strings because it counts coincidental character proximity.
Picking the right algorithm
Short cheat sheet:
- Matching names, IDs, short tags: Jaro-Winkler
- Near-duplicate detection on words or phrases: Sørensen-Dice
- Paraphrase detection on longer texts: Cosine
- General-purpose character similarity baseline: Levenshtein
If you can’t decide, show all four — that’s what this tool does, and the disagreement between them is itself informative.
What this tool does not do
It doesn’t handle semantic similarity — two sentences with no shared words can mean the same thing (“it’s raining” vs “water is falling from the sky”), and none of these algorithms will notice. For semantic similarity you need an embedding model (sentence-BERT, OpenAI embeddings, etc.), which is a completely different class of tool.
It doesn’t compute longest common substring or produce a full diff — use the longest common substring finder for the shared-block length, or the text diff tool for the line-by-line breakdown.
It doesn’t support weighted comparison where some words matter more than others (TF-IDF). Cosine treats every word equally, which is fine for short texts but can be misleading for long documents where common stop words dominate the count.