Skip to content

String Similarity Percentage Checker

Similarity scores

  • Levenshtein74.4%

    Edit distance normalised by length — good general character similarity.

  • Jaro-Winkler82.6%

    Prioritises common prefixes — designed for names and short strings.

  • Sørensen-Dice69.9%

    Character-bigram overlap — good for near-duplicate detection.

  • Cosine (word)72.7%

    Cosine of word-frequency vectors — good for longer texts.

Estimates for educational purposes — not financial, medical, or legal advice. See terms.

Compare two strings and get four different similarity scores side by side, each computed by a different algorithm with a different strength. Paste two texts and the results update live.

The four algorithms

Levenshtein — the classic edit distance. Counts the minimum number of single-character insertions, deletions, or substitutions to turn one string into the other. The similarity score is $1 - \text{distance} / \max(|a|, |b|)$, so 100% means no edits needed. Good general-purpose character-level similarity. Weakness: on long inputs, one added paragraph at the end inflates the distance proportionally even if the earlier text is identical.

Jaro-Winkler — designed for short-string record linkage (matching names across databases). The base Jaro score counts matching characters within a sliding window and penalizes transpositions. Winkler’s extension adds a prefix bonus: strings sharing their first few characters score higher. “MARTHA” and “MARHTA” score 96% under Jaro-Winkler because they share a 3-character prefix; plain Jaro gives 94%. The algorithm is tuned for the case where you’re looking for typos in a name or ID field, not for comparing longer texts.

Sørensen-Dice coefficient — set overlap on character bigrams. Extract all 2-character windows from each string, then count how many appear in both. The formula is $2 |A \cap B| / (|A| + |B|)$. Good at detecting near-duplicates and works naturally on word-like inputs. “night” and “nacht” share exactly one bigram (“ht”) out of four total in each, scoring 25%. It’s order-insensitive at the scale of bigrams — strings with the same characters in different orders can score similarly.

Cosine similarity — treats each text as a bag of words, counts word frequencies, and computes the cosine of the angle between the two frequency vectors. Lowercases and strips punctuation before counting. Order-independent: “cat and dog” and “dog and cat” score 100%. Word-level, not character-level — two strings with no shared words score 0% regardless of how similar they are letter-by-letter.

Example: typo detection

“MARTHA” vs “MARHTA” (one character transposition in the middle):

  • Levenshtein: 67% (2 edits out of 6-char max)
  • Jaro-Winkler: 96% (shared MAR prefix boosts the score)
  • Dice: 60% (4 of 5 bigrams shared)
  • Cosine: 0% (treated as one unique word each — they don’t match)

Jaro-Winkler is the right algorithm for this case. Levenshtein understates how similar they are because the transposition costs 2 edits, not 1. Cosine is useless because the inputs are single “words” that don’t match.

Example: paraphrase detection

“The cat sat on the mat” vs “A cat is sitting on the mat”:

  • Levenshtein: 54% (many edits)
  • Jaro-Winkler: 78%
  • Dice: 68%
  • Cosine: 60% (shared: cat, on, the, mat; different: The/A, sat/is/sitting)

Cosine and Dice reflect the word overlap; Levenshtein understates the similarity because “sat” and “is sitting” differ character-by-character even though they express the same idea.

Example: reordering

“The cat sat on the mat” vs “The mat was sat on by the cat”:

  • Levenshtein: ~42% (lots of character-level changes)
  • Jaro-Winkler: 74%
  • Dice: 51%
  • Cosine: 89% (same word set, mostly same counts)

Cosine is the right algorithm here — the two sentences use almost the same words in different orders, and cosine treats them as nearly identical. The character-level algorithms see the reordering as a lot of edits.

Example: completely different texts

“The weather is nice today” vs “Quantum entanglement preserves coherence”:

  • Levenshtein: ~10%
  • Jaro-Winkler: ~45% (coincidental character matches within window)
  • Dice: ~5%
  • Cosine: 0% (no shared words)

Cosine is the most discriminating here — it sees no shared vocabulary and returns zero. Jaro-Winkler’s window-based matching is misleading on long unrelated strings because it counts coincidental character proximity.

Picking the right algorithm

Short cheat sheet:

  • Matching names, IDs, short tags: Jaro-Winkler
  • Near-duplicate detection on words or phrases: Sørensen-Dice
  • Paraphrase detection on longer texts: Cosine
  • General-purpose character similarity baseline: Levenshtein

If you can’t decide, show all four — that’s what this tool does, and the disagreement between them is itself informative.

What this tool does not do

It doesn’t handle semantic similarity — two sentences with no shared words can mean the same thing (“it’s raining” vs “water is falling from the sky”), and none of these algorithms will notice. For semantic similarity you need an embedding model (sentence-BERT, OpenAI embeddings, etc.), which is a completely different class of tool.

It doesn’t compute longest common substring or produce a full diff — use the longest common substring finder for the shared-block length, or the text diff tool for the line-by-line breakdown.

It doesn’t support weighted comparison where some words matter more than others (TF-IDF). Cosine treats every word equally, which is fine for short texts but can be misleading for long documents where common stop words dominate the count.

Frequently asked questions

Which algorithm should I use?

Depends on what you're comparing. For short strings (names, tags, single words) use Jaro-Winkler — it's designed for record linkage and prioritises common prefixes. For near-duplicate detection on words or short sentences use Sørensen-Dice, which counts shared character bigrams. For longer documents use cosine similarity on word vectors, which is order-independent and weighs common words. Levenshtein (character edit distance) is a good general-purpose baseline but can be misleading for long inputs because one added paragraph at the end inflates the distance proportionally to the paragraph length.

Why four different algorithms?

Because 'similarity' isn't a single well-defined thing — different use cases care about different kinds of overlap. A tool that reports one number hides important information; a tool that reports four lets you see which kinds of similarity the two inputs share and which they don't. If Levenshtein says 60% but cosine says 95%, you're probably looking at two texts with similar vocabulary but different word order. If Dice says 80% but Jaro-Winkler says 30%, you're probably looking at two strings that share characters but disagree on their beginnings.

What does 100% mean for each algorithm?

Levenshtein 100% means zero edits needed — the strings are character-for-character identical. Jaro-Winkler 100% means the same. Dice 100% means every character bigram in one string appears in the other with the same count, which also implies identity for the same-length case. Cosine 100% means the word-count vectors are parallel — which for a pair of strings means identical word multisets (same words, same counts), but with possibly different word ORDER. Two strings with the same words in different orders score cosine 100% but Levenshtein less.

What does 0% mean?

For Levenshtein: the edit distance equals the longer string's length, meaning every character in the longer string has to change to get to the shorter one. For Jaro-Winkler: no matching characters within the sliding window. For Dice: zero shared character bigrams. For cosine: no shared words in the two texts. Different algorithms can bottom out at different inputs — a character-swapped string scores 0 on cosine (different words) but non-zero on Jaro-Winkler (shared characters).

Does case or punctuation matter?

For Levenshtein, Jaro-Winkler, and Dice: yes, they compare raw characters. 'Hello' and 'hello' differ by one character (the H/h). For cosine: no, the word tokeniser lowercases everything and strips punctuation, so 'Hello, world!' and 'hello world' score 100%. If you want case-insensitive comparison on the character-level algorithms, lowercase your inputs before pasting them.