(PHP 4 >= 4.0.1, PHP 5)
levenshtein — Berechnet die Levenshtein-Distanz zwischen zwei Strings
Die Levenshtein-Distanz bezeichnet die minimale Anzahl von Zeichen, die Sie ersetzen, einfügen oder löschen müssen, um str1 in str2 umzuwandeln. Die Komplexität des Algorithmus ist 0(m*n), wobei n und m die Länge von str1 und str2 darstellen (deutlich besser, wenn man mit similar_text() vergleicht, was mit 0(max(n,m)**3) daher kommt, aber trotzdem immer noch teuer).
In ihrer einfachsten Form erhält die Funktion nur die zwei Strings als Parameter und berechnet nur die Anzahl der Einfügungen, Ersetzungen und Löschungen, die benötigt werden, um str1 in str2 zu überführen.
Eine zweite Variante verwendet drei zusätzliche Parameter, die die Kosten von Einfügungen, Ersetzungen und Löschungen festlegen. Dies ist eine allgemeingültigere und anpassungsfähigere Version als Variante eins, aber nicht so effizient.
Eine der Zeichenketten, für die die Levenshtein-Distanz zu berechnen ist.
Eine der Zeichenketten, für die die Levenshtein-Distanz zu berechnen ist.
Definiert die Kosten des Einfügens.
Definiert die Kosten des Ersetzens.
Definiert die Kosten des Löschens.
Die Funktion gibt die Levenshtein-Distanz zwischen den beiden Argument-Strings oder -1 zurück, sofern eines der Argumente länger als 255 Zeichen ist.
Beispiel #1 levenshtein()-Beispiel
<?php
// eingegebenes falsch geschriebenes Wort
$input = 'carrrot';
// Wörterarray als Vergleichsquelle
$words = array('apple','pineapple','banana','orange',
'radish','carrot','pea','bean','potato');
// noch keine kürzeste Distanz gefunden
$shortest = -1;
// durch die Wortliste gehen, um das ähnlichste Wort zu finden
foreach ($words as $word) {
// berechne die Distanz zwischen Inputwort und aktuellem Wort
$lev = levenshtein($input, $word);
// auf einen exakten Treffer prüfen
if ($lev == 0) {
// das nächste Wort ist das Wort selbst (exakter Treffer)
$closest = $word;
$shortest = 0;
// Schleife beenden, da wir einen exakten Treffer gefunden haben
break;
}
// Wenn die Distanz kleiner ist als die nächste gefundene kleinste Distanz
// ODER wenn ein nächstkleineres Wort noch nicht gefunden wurde
if ($lev <= $shortest || $shortest < 0) {
// setze den nächstliegenden Treffer und die kürzestes Distanz
$closest = $word;
$shortest = $lev;
}
}
echo "Eingegebenes Wort: $input\n";
if ($shortest == 0) {
echo "Exakter Treffer gefunden: $closest\n";
} else {
echo "Meinten Sie: $closest?\n";
}
?>
Das oben gezeigte Beispiel erzeugt folgende Ausgabe:
Eingegebenes Word: carrrot Meinten Sie: carrot?
A ported version of the memory efficient implementation by Sten Hjelmqvist at http://www.codeproject.com/KB/recipes/Levenshtein.aspx - works 1.5x faster than previous versions posted here (44s vs 63s on 2.5KB strings). If you don't care about UTF, you can substitute mb_substr for array-like access and get even faster (24s on the same test set).
I also added trimming of common beginning and end so you can save up to 99% more time — in case of small edits inside large texts.
<?php
function levenshteinDistance2($str1, $str2)
{
$len1 = mb_strlen($str1);
$len2 = mb_strlen($str2);
// strip common prefix
$i = 0;
do {
if(mb_substr($str1, $i, 1) != mb_substr($str2, $i, 1))
break;
$i++;
$len1--;
$len2--;
} while($len1 > 0 && $len2 > 0);
if($i > 0) {
$str1 = mb_substr($str1, $i);
$str2 = mb_substr($str2, $i);
}
// strip common suffix
$i = 0;
do {
if(mb_substr($str1, $len1-1, 1) != mb_substr($str2, $len2-1, 1))
break;
$i++;
$len1--;
$len2--;
} while($len1 > 0 && $len2 > 0);
if($i > 0) {
$str1 = mb_substr($str1, 0, $len1);
$str2 = mb_substr($str2, 0, $len2);
}
if ($len1 == 0)
return $len2;
if ($len2 == 0)
return $len1;
$v0 = range(0, $len1);
$v1 = array();
for ($i = 1; $i <= $len2; $i++) {
$v1[0] = $i;
$str2j = mb_substr($str2, $i - 1, 1);
for ($j = 1; $j <= $len1; $j++) {
$cost = (mb_substr($str1, $j - 1, 1) == $str2j) ? 0 : 1;
$m_min = $v0[$j] + 1;
$b = $v1[$j - 1] + 1;
$c = $v0[$j - 1] + $cost;
if ($b < $m_min)
$m_min = $b;
if ($c < $m_min)
$m_min = $c;
$v1[$j] = $m_min;
}
$vTmp = $v0;
$v0 = $v1;
$v1 = $vTmp;
}
return $v0[$len1];
}
?>
Once again, some testing (with thanks to Tiago) showed me that there was still an error in the code. It wasn't coming up with the right distance when a deletion was involved. This is because there was an error in the way that the distance array was instantiated. (Also, my thanks to the editor for making this easier on everyone involved.)
<?php
/*
* This function starts out with several checks in an attempt to save time.
* 1. The shorter string is always used as the "right-hand" string (as the size of the array is based on its length).
* 2. If the left string is empty, the length of the right is returned.
* 3. If the right string is empty, the length of the left is returned.
* 4. If the strings are equal, a zero-distance is returned.
* 5. If the left string is contained within the right string, the difference in length is returned.
* 6. If the right string is contained within the left string, the difference in length is returned.
* If none of the above conditions were met, the Levenshtein algorithm is used.
*/
function LevenshteinDistance($s1, $s2)
{
$nLeftLength = strlen($s1);
$nRightLength = strlen($s2);
if ($nLeftLength >= $nRightLength)
{
$sLeft = $s1;
$sRight = $s2;
} else {
$sLeft = $s2;
$sRight = $s1;
$nLeftLength += $nRightLength; // arithmetic swap of two values
$nRightLength = $nLeftLength - $nRightLength;
$nLeftLength -= $nRightLength;
}
if ($nLeftLength == 0)
return $nRightLength;
else if ($nRightLength == 0)
return $nLeftLength;
else if ($sLeft === $sRight)
return 0;
else if (($nLeftLength < $nRightLength) && (strpos($sRight, $sLeft) !== FALSE))
return $nRightLength - $nLeftLength;
else if (($nRightLength < $nLeftLength) && (strpos($sLeft, $sRight) !== FALSE))
return $nLeftLength - $nRightLength;
else {
$nsDistance = range(0, $nRightLength);
for ($nLeftPos = 1; $nLeftPos <= $nLeftLength; ++$nLeftPos)
{
$cLeft = $sLeft[$nLeftPos - 1];
$nDiagonal = $nLeftPos - 1;
$nsDistance[0] = $nLeftPos;
for ($nRightPos = 1; $nRightPos <= $nRightLength; ++$nRightPos)
{
$cRight = $sRight[$nRightPos - 1];
$nCost = ($cRight == $cLeft) ? 0 : 1;
$nNewDiagonal = $nsDistance[$nRightPos];
$nsDistance[$nRightPos] =
min($nsDistance[$nRightPos] + 1,
$nsDistance[$nRightPos - 1] + 1,
$nDiagonal + $nCost);
$nDiagonal = $nNewDiagonal;
}
}
return $nsDistance[$nRightLength];
}
}
?>
[EDITOR'S NOTE: original post and 2 corrections combined into 1 -- mgf]
Here is an implementation of the Levenshtein Distance calculation that only uses a one-dimensional array and doesn't have a limit to the string length. This implementation was inspired by maze generation algorithms that also use only one-dimensional arrays.
I have tested this function with two 532-character strings and it completed in 0.6-0.8 seconds.
<?php
/*
* This function starts out with several checks in an attempt to save time.
* 1. The shorter string is always used as the "right-hand" string (as the size of the array is based on its length).
* 2. If the left string is empty, the length of the right is returned.
* 3. If the right string is empty, the length of the left is returned.
* 4. If the strings are equal, a zero-distance is returned.
* 5. If the left string is contained within the right string, the difference in length is returned.
* 6. If the right string is contained within the left string, the difference in length is returned.
* If none of the above conditions were met, the Levenshtein algorithm is used.
*/
function LevenshteinDistance($s1, $s2)
{
$sLeft = (strlen($s1) > strlen($s2)) ? $s1 : $s2;
$sRight = (strlen($s1) > strlen($s2)) ? $s2 : $s1;
$nLeftLength = strlen($sLeft);
$nRightLength = strlen($sRight);
if ($nLeftLength == 0)
return $nRightLength;
else if ($nRightLength == 0)
return $nLeftLength;
else if ($sLeft === $sRight)
return 0;
else if (($nLeftLength < $nRightLength) && (strpos($sRight, $sLeft) !== FALSE))
return $nRightLength - $nLeftLength;
else if (($nRightLength < $nLeftLength) && (strpos($sLeft, $sRight) !== FALSE))
return $nLeftLength - $nRightLength;
else {
$nsDistance = range(1, $nRightLength + 1);
for ($nLeftPos = 1; $nLeftPos <= $nLeftLength; ++$nLeftPos)
{
$cLeft = $sLeft[$nLeftPos - 1];
$nDiagonal = $nLeftPos - 1;
$nsDistance[0] = $nLeftPos;
for ($nRightPos = 1; $nRightPos <= $nRightLength; ++$nRightPos)
{
$cRight = $sRight[$nRightPos - 1];
$nCost = ($cRight == $cLeft) ? 0 : 1;
$nNewDiagonal = $nsDistance[$nRightPos];
$nsDistance[$nRightPos] =
min($nsDistance[$nRightPos] + 1,
$nsDistance[$nRightPos - 1] + 1,
$nDiagonal + $nCost);
$nDiagonal = $nNewDiagonal;
}
}
return $nsDistance[$nRightLength];
}
}
?>
Simple levenshtein function without string length limit ...
<?php
function levenshtein2($str1, $str2, $cost_ins = null, $cost_rep = null, $cost_del = null) {
$d = array_fill(0, strlen($str1) + 1, array_fill(0, strlen($str2) + 1, 0));
$ret = 0;
for ($i = 1; $i < strlen($str1) + 1; $i++)
$d[$i][0] = $i;
for ($j = 1; $j < strlen($str2) + 1; $j++)
$d[0][$j] = $j;
for ($i = 1; $i < strlen($str1) + 1; $i++)
for ($j = 1; $j < strlen($str2) + 1; $j++) {
$c = 1;
if ($str1{$i - 1} == $str2{$j - 1})
$c = 0;
$d[$i][$j] = min($d[$i - 1][$j] + 1, $d[$i][$j - 1] + 1, $d[$i - 1][$j - 1] + $c);
$ret = $d[$i][$j];
}
return $ret;
}
?>
I have made a function that removes the length-limit of levenshtein function and ajust the result with similar_text:
<?php
function _similar($str1, $str2) {
$strlen1=strlen($str1);
$strlen2=strlen($str2);
$max=max($strlen1, $strlen2);
$splitSize=250;
if($max>$splitSize)
{
$lev=0;
for($cont=0;$cont<$max;$cont+=$splitSize)
{
if($strlen1<=$cont || $strlen2<=$cont)
{
$lev=$lev/($max/min($strlen1,$strlen2));
break;
}
$lev+=levenshtein(substr($str1,$cont,$splitSize), substr($str2,$cont,$splitSize));
}
}
else
$lev=levenshtein($str1, $str2);
$porcentage= -100*$lev/$max+100;
if($porcentage>75)//Ajustar con similar_text
similar_text($str1,$str2,$porcentage);
return $porcentage;
}
?>
Using PHP's example along with Patrick's comparison percentage function, I have come up with a function that returns the closest word from an array, and assigns the percentage to a referenced variable:
<?php
function closest_word($input, $words, &$percent = null) {
$shortest = -1;
foreach ($words as $word) {
$lev = levenshtein($input, $word);
if ($lev == 0) {
$closest = $word;
$shortest = 0;
break;
}
if ($lev <= $shortest || $shortest < 0) {
$closest = $word;
$shortest = $lev;
}
}
$percent = 1 - levenshtein($input, $closest) / max(strlen($input), strlen($closest));
return $closest;
}
?>
Usage:
<?php
$input = 'carrrot';
$words = array('apple','pineapple','banana','orange',
'radish','carrot','pea','bean','potato');
$percent = null;
$found = closest_word($input, $words, $percent);
printf('Closest word to "%s": %s (%s%% match)', $input, $found, round($percent * 100, 2));
?>
I found that lowercasing the array prior to comparing yields a better comparison when the case is not of importance, for example: comparing a user-inputted category to a list of existing categories.
I also found that when the percentage was above 75%, it was usually the match that I was looking for.
To find the comparison percentage of levenshtein, use this function:
<?php
function _levenshtein($str1, $str2) {
return 1-levenshtein($str1, $str2)/max(strlen($str1), strlen($str2));
}
echo _levenshtein('dogs', 'cats'); //.25
This function (along with levenshtein) is obsolete though: similar_text is 20% faster that the two.
?>
@ ivo_gelov at gmx dot net
Your implementation of Myers' algorithm is not entirely correct. The algorithm finds the end of a match, but you still need to find the start of a match. The easiest way to find the start of a match is to execute the algorithm again on the reversed string, starting at the j'th character. You can't just assume the start of the match is at j - m because you don't know whether the match is longer, shorter, or of equal length.
Just my 2 cents.
Here is a small function to implement a fuzzy search.
If someone is wondering about the mathematics inside, this information can be found in "A fast bit-vector algorithm for approximate string matching based on dynamic programming" by Gene Myers, May 27 1998
<?php
// This can search for PATTERN into HAYSTACK with MIST mistaken symbols in PATTERN
// OK_MATCH is a function ($Position, $Score)
// Position is zero-based index in HAYSTACK, where PATTERN is found,
// and Score is the number of mistaken symbols
function FuzzyLook($pattern,$haystack,$mist,$ok_match)
{
$m = strlen($pattern);
$n = strlen($haystack);
if($n==0 OR $m==0) return;
// Precompute Peq[Z]
for($i=0; $i<32; $i++) $Bit[$i] = (1 << $i);
for($i=0; $i<$m; $i++) $Peq[ ord(substr($pattern,$i,1)) ] |= $Bit[$i];
$Pv = 0xFFFFFFFF;
$Mv = 0;
$Score = $m;
for($j=0; $j<$n; $j++)
{
$Eq = $Peq[ ord(substr($haystack,$j,1))];
$Xv = $Eq | $Mv;
$Xh = ((($Eq & $Pv) + $Pv) ^ $Pv) | $Eq;
$Ph = $Mv | ~ ($Xh | $Pv);
$Mh = $Pv & $Xh;
if($Ph & (1 << ($m-1))) $Score += 1;
elseif($Mh & (1 << ($m-1))) $Score -= 1;
$Ph <<= 1;
$Pv = ($Mh << 1) | ~ ($Xv | $Ph);
$Mv = $Ph & $Xv;
if($Score <= $mist && $j>=$m-1) $ok_match($j-$m+1,$Score);
}
}
?>
I have found that levenshtein is actually case-sensitive (in PHP 4.4.2 at least).
<?php
$distance=levenshtein('hello','ELLO');
echo "$distance";
?>
Outputs: "5", instead of "1". If you are implementing a fuzzy search feature that makes use of levenshtein, you will probably need to find a way to work around this.
Here is a string resynch function:
<?php
// Trouve les operations a effectuer pour modifier $b en $a en exploitant leurs similitudes (Finds the operations required to change $b to $a)
// Identique a la fonction Resynch Compare de Hex Workshop
//
// Parametres:
// $a Premiere chaine (cible, target)
// $b Seconde chaine (source)
// $l Nombre d'octets devant correspondre pour etre consides comme un bloc similaire (number of matching bytes required)
// $s Distance maximale dans laquelle les blocs similaires sont cherches (search window)
//
// Retourne:
// Array
// Array
// [0] Operation: + Add , - Del , / Replace, = Match
// [1] Source offset
// [2] Source count
// [3] Target offset
// [4] Target count
//
function str_resynch($a,$b,$l=32,$s=2048) {
$r=array();
for($i=0,$c=strlen($a),$cc=strlen($b),$ii=0,$z=$s-1,$z2=($z<<1)+1; $i<$c; $i++) {
$d=$i-$z;
$d=($d<$ii)?substr($b,$ii,$z2-$ii+$d):substr($b,$d,$z2);
$p=strpos($d,$a{$i});
$n=0;
while ($p!==FALSE) {
$m=1;
$bi=$i;
$bp=$p;
$p+=$ii;
while ((++$i<$c) && (++$p<$cc)) {
if ($a{$i}!=$b{$p}) break;
$m++;
}
if ($m<$l) {
$i=$bi;
$n=$bp+1;
$p=@strpos($d,$a{$i},$n);
}
else {
$i--;
$r[]=array($bi,$bp+$ii,$m); // offset a, offset b, Count
$ii=$p;
break;
}
}
}
if (!count($r)) return ($cc)?array('/',0,$c,0,$cc):array(array('+',0,$c,0,0));
$o=array();
$bi=0;
$bp=0;
for($i=0,$m=count($r);$i<$m;$i++) {
if ($r[$i][0]!=$bi) {
if ($r[$i][1]!=$bp) {
// Replace
$o[]=array('/',$bi,$r[$i][0]-$bi,$bp,$r[$i][1]-$bp);
$bi=$r[$i][0];
$bp=$r[$i][1];
}
else {
// Insertion
$o[]=array('+',$bi,$r[$i][0]-$bi,$bp,0);
$bi=$r[$i][0];
}
}
elseif ($r[$i][1]!=$bp) {
// Delete
$o[]=array('-',$bi,0,$bp,$r[$i][1]-$bp);
$bp=$r[$i][1];
}
// Match
$o[]=array('=',$r[$i][0],$r[$i][2],$r[$i][1],$r[$i][2]);
$bi+=$r[$i][2];
$bp+=$r[$i][2];
}
if ($c!=$bi) {
if ($cc!=$bp) $o[]=array('/',$bi,$c-$bi,$bp,$cc-$bp);
else $o[]=array('+',$bi,$c-$bi,$bp,0);
}
elseif ($cc!=$bp) $o[]=array('-',$bi,0,$bp,$cc-$bp);
return $o;
}
?>
//levenshtein for arrays
function array_levenshtein($array1,$array2)
{ $aliases= array_flip(array_values(array_unique(array_merge($array1,$array2))));
if(count($aliases)>255) return -1;
$stringA=''; $stringB='';
foreach($array1 as $entry) $stringA.=chr($aliases[$entry]);
foreach($array2 as $entry) $stringB.=chr($aliases[$entry]);
return levenshtein($stringA,$stringB);
}
// e.g. use array_levenshtein to detect special expressions in unser-inputs
echo array_levenshtein(split(" ", "my name is xxx"), split(" ","my name is levenshtein"));
//output: 1
<?php
/*********************************************************************
* The below func, btlfsa, (better than levenstien for spelling apps)
* produces better results when comparing words like haert against
* haart and heart.
*
* For example here is the output of levenshtein compared to btlfsa
* when comparing 'haert' to 'herat, haart, heart, harte'
*
* btlfsa('haert','herat'); output is.. 3
* btlfsa('haert','haart'); output is.. 3
* btlfsa('haert','harte'); output is.. 3
* btlfsa('haert','heart'); output is.. 2
*
* levenshtein('haert','herat'); output is.. 2
* levenshtein('haert','haart'); output is.. 1
* levenshtein('haert','harte'); output is.. 2
* levenshtein('haert','heart'); output is.. 2
*
* In other words, if you used levenshtein, 'haart' would be the
* closest match to 'haert'. Where as, btlfsa sees that it should be
* 'heart'
*/
function btlfsa($word1,$word2)
{
$score = 0;
// For each char that is different add 2 to the score
// as this is a BIG difference
$remainder = preg_replace("/[".preg_replace("/[^A-Za-z0-9\']/",' ',$word1)."]/i",'',$word2);
$remainder .= preg_replace("/[".preg_replace("/[^A-Za-z0-9\']/",' ',$word2)."]/i",'',$word1);
$score = strlen($remainder)*2;
// Take the difference in string length and add it to the score
$w1_len = strlen($word1);
$w2_len = strlen($word2);
$score += $w1_len > $w2_len ? $w1_len - $w2_len : $w2_len - $w1_len;
// Calculate how many letters are in different locations
// And add it to the score i.e.
//
// h e a r t
// 1 2 3 4 5
//
// h a e r t a e = 2
// 1 2 3 4 5 1 2 3 4 5
//
$w1 = $w1_len > $w2_len ? $word1 : $word2;
$w2 = $w1_len > $w2_len ? $word2 : $word1;
for($i=0; $i < strlen($w1); $i++)
{
if ( !isset($w2[$i]) || $w1[$i] != $w2[$i] )
{
$score++;
}
}
return $score;
}
// *************************************************************
// Here is a full code example showing the difference
$misspelled = 'haert';
// Imagine that these are sample suggestions thrown back by soundex or metaphone..
$suggestions = array('herat', 'haart', 'heart', 'harte');
// Firstly order an array based on levenshtein
$levenshtein_ordered = array();
foreach ( $suggestions as $suggestion )
{
$levenshtein_ordered[$suggestion] = levenshtein($misspelled,$suggestion);
}
asort($levenshtein_ordered, SORT_NUMERIC );
print "<b>Suggestions as ordered by levenshtein...</b><ul><pre>";
print_r($levenshtein_ordered);
print "</pre></ul>";
// Secondly order an array based on btlfsa
$btlfsa_ordered = array();
foreach ( $suggestions as $suggestion )
{
$btlfsa_ordered[$suggestion] = btlfsa($misspelled,$suggestion);
}
asort($btlfsa_ordered, SORT_NUMERIC );
print "<b>Suggestions as ordered by btlfsa...</b><ul><pre>";
print_r($btlfsa_ordered);
print "</pre></ul>";
?>
I am using the Levenshtein distance to SORT my search results.
I have a search page for peoples names. I do a SOUNDEX() search on the name in mysql. MySQL SOUNDEX() will perform the "fuzzy" search for me.
I then calculate the Levenshtein distance between the search term and the actual name found by the SOUNDEX() search. This will give me a score on how close my results are to the search string.
I can the sort my results for display listing the closest results first.
<?php
// PHP CODE INCLUDING DB LOOKUPS HERE
usort($searchresults, "finallevenshteinsortfunction");
function finallevenshteinsortfunction($a, $b)
{
if(($a['levenshtein'] > $b['levenshtein']) || ( $a['levenshtein'] == $b['levenshtein'] && strnatcasecmp( $a['Last_Name'], $b['Last_Name']) >= 1) ){ return $a['levenshtein'];} // Ok... The levenstein is greater OR with the same levenshtein, the last name is alphanumerically first
elseif($a['levenshtein'] == $b['levenshtein']){ return '0';} // The levenstein matches
elseif($a['levenshtein'] < $b['levenshtein']){ return -$a['levenshtein'];}
else{die("<!-- a horrable death -->");}
}
?>
Some code for a basic fuzzy search.
<?php
$needle = "Youser Nein stitched tymes hand";
$haystack = "
Did you know that a stitch in time can save nine. Isn't that amazing! I think so
The user-supplied function has to return the loser, from the sand. Describe the cost for this particular operation.
Eweser may decide to use only some of the supplied nans. And pay attention. This isn't hard...
";
// explode into words
$hwords = preg_split("/[\s\W]+/", $haystack);
$nwords = preg_split("/[\s\W]+/", $needle);
echo "You searched for $needle<br>";
echo "I found...<br>";
foreach ($hwords as $hkey => $hayword) {
$hmp = metaphone ($hayword);
foreach ($nwords as $nkey => $needword) {
// First or last letters of needle and haystack have to match (case insensitive)
$nfirst = strtolower(substr($needword, 0, 1));
$nlast = strtolower(substr($needword, -1));
$hfirst = strtolower(substr($hayword, 0, 1));
$hlast = strtolower(substr($hayword, -1));
if (($hfirst == $nfirst) or ($hlast == $nlast)) {
$nmp = metaphone ($needword);
$distance = levenshtein ($hmp, $nmp);
// $distance = levenshtein ($hayword, $needword);
$n_len = strlen($nmp);
$per = round(($distance/$n_len)*1000);
if ($per < 335) {
// Highlight word in haystack
$haystack = str_replace($hayword, "<b>$hayword</b>", $haystack);
$haystack = str_replace("<b><b>", "<b>", $haystack);
$haystack = str_replace("</b></b>", "</b>", $haystack);
}
}
}
}
// echo the new haystack
echo $haystack;
// Returns ...
//
// You searched for Youser Nein stitched tymes hand
// I found...
// Did you know that a <b>stitch</b> in <b>time</b> can
// save <b>nine</b>. Isn't that amazing! I think so. The
// <b>user</b>-supplied function has to return the
// <b>loser</b>, from the sand. Describe the cost for this
// particular operation. Eweser may decide to use only some
// of the supplied nans. <b>And</b> pay attention. This
// isn't <b>hard</b>...
?>
Try combining this with metaphone() for a truly amazing fuzzy search function. Play with it a bit, the results can be plain scary (users thinking the computer is almost telepathic) when implemented properly. I wish spell checkers worked as well as the code I've written.
I would release my complete code if reasonable, but it's not, due to copyright issues. I just hope that somebody can learn from this little tip!
I wrote this function to have an "intelligent" comparison between data to be written in a DB
and already existent data. Not ony calculating distances but also balancing distances for
each field.
<?php
/*
This function calculate a balanced percentage distance between an array of strings
"$record" and a compared array "$compared", balanced through an array of
weights "$weight". The three arrays must have the same indices.
For an unbalanced distance, set all weights to 1.
The used formula is:
percentage distance = sum(field_levenshtein_distance * field_weight) / sum(record_field_length * field_weight) * 100
*/
function search_similar($record, $weights, $compared, $precision=2) {
$field_names = array_keys($record);
# "Weighted length" of $record and "weighted distance".
foreach ($field_names as $field_key) {
$record_weight += strlen($record[$field_key]) * $weights[$field_key];
$weighted_distance += levenshtein($record[$field_key],$compared[$field_key]) * $weights[$field_key];
}
# Building the result..
if ($record_weight) {
return round(($weighted_distance / $record_weight * 100),$precision);
} elseif ((strlen(implode("",$record)) == 0) && (strlen(implode("",$compared)) == 0)) { // empty records
return round(0,$precision);
} elseif (array_sum($weights) == 0) { // all weights == 0
return round(0,$precision);
} else {
return false;
}
/*
Be very careful distinguising 0 result and false result.
The function results 0 ('0.00' if $precision is 2 and so on) if:
- $record and $compared are equals (even if $record and $compared are empty);
- all weights are 0 (the meaning could be "no care about any field").
Conversely, the function results false if $record is empty, but the weights
are not all 0 and $compared is not empty. That cause a "division by 0" error.
I wrote this kind of check:
if ($rel_dist = search_similar(...)) {
print $rel_dist;
} elseif ($rel_dist == "0.00") { // supposing that $precision is 2
print $rel_dist;
} else {
print "infinite";
}
*/
}
?>
Regarding the post by fgilles on April 26th 2001, I suggest not to use levenshtein() function to test for over-uppercasing unless you've got plenty of time to waste in your host. ;) Anyhow, I think it's a useful feature, as I get really annoyed when reading whole messages in uppercase.
PHP's levenshtein() function can only handle up to 255 characters, which is not realistic for user input (only the first paragraph oh this post has 285 characters). If you choose to use a custom function able to handle more than 255 characters, efficiency is an important issue.
I use this function, specific for this case, but much faster:
function ucase_percent ($str) {
$str2 = strtolower ($str);
$l = strlen ($str);
$ucase = 0;
for ($i = 0; $i < $l; $i++) {
if ($str{$i} != $str2{$i}) {
$ucase++;
}
}
return $ucase / $l * 100.0;
}
I think 10% is enough for written English (maybe other languages like German, which use more capital letters, need more). With some sentencies in uppercase (everybody has the right to shout occasionally), 20% would be enough; so I use a threshold of 30%. When exceeded, I lowercase the whole message.
Hope you find it useful and it helps keeping the web free of ill-mannered people.
I am using this function to avoid duplicate information on my client's database.
After retrieving a series of rows and assigning the results to an array values, I loop it with foreach comparing its levenshtein() with the user supplied string.
It helps to avoid people re-registering "John Smith", "Jon Smith" or "Jon Smit".
Of course, I can't block the operation if the user really wants to, but a suggestion is displayed along the lines of: "There's a similar client with this name.", followed by the list of the similar strings.
for the unserious guys how might do dna sequence analysis
with php. Here's the function that can handle long strings.
to get an output of the matrix uncomment the
printf-section. handle with care iteration are approximately
2x2^strlen ;)
------------------------------------------------------------
function levdis($s,$t){
$n=strlen($s);
$m=strlen($t);
$matrix=array(range(0,$n+1),range(0,$m+1));
$ret=0;
if ($n==0){
return $m;
}
elseif ($m==0){
return $n;
}
for ($i=0;$i<=$n;$i++) {
$matrix[$i][0]=$i;
}
for ($j=0;$j<=$m;$j++) {
$matrix[0][$j]=$j;
}
for ($i=1;$i<=$n;$i++) {
for ($j=1;$j<=$m;$j++) {
if ($s[$i-1]==$t[$j-1]) {
$cost=0;
}else{
$cost=1;
}
$matrix[$i][$j]=min($matrix[$i-1][$j]+1,
$matrix[$i][$j-1]+1,
$matrix[$i-1][$j-1]+$cost);
}
}
/* for ($j=0;$j<=$m;$j++) {
for ($i=0;$i<=$n;$i++) {
printf(" %02d",$matrix[$i][$j]);
}
echo "\n";
}*/
return $matrix[$n][$m];
}
----------------------------------------------
At the time of this manual note the user defined thing
in levenshtein() is not implemented yet. I wanted something
like that, so I wrote my own function. Note that this
doesn't return levenshtein() difference, but instead
an array of operations to transform a string to another.
Please note that the difference finding part (resync)
may be extremely slow on long strings.
<?php
/* matchlen(): returns the length of matching
* substrings at beginning of $a and $b
*/
function matchlen(&$a, &$b)
{
$c=0;
$alen = strlen($a);
$blen = strlen($b);
$d = min($alen, $blen);
while($a[$c] == $b[$c] && $c < $d)
$c++;
return $c;
}
/* Returns a table describing
* the differences of $a and $b */
function calcdiffer($a, $b)
{
$alen = strlen($a);
$blen = strlen($b);
$aptr = 0;
$bptr = 0;
$ops = array();
while($aptr < $alen && $bptr < $blen)
{
$matchlen = matchlen(substr($a, $aptr), substr($b, $bptr));
if($matchlen)
{
$ops[] = array('=', substr($a, $aptr, $matchlen));
$aptr += $matchlen;
$bptr += $matchlen;
continue;
}
/* Difference found */
$bestlen=0;
$bestpos=array(0,0);
for($atmp = $aptr; $atmp < $alen; $atmp++)
{
for($btmp = $bptr; $btmp < $blen; $btmp++)
{
$matchlen = matchlen(substr($a, $atmp), substr($b, $btmp));
if($matchlen>$bestlen)
{
$bestlen=$matchlen;
$bestpos=array($atmp,$btmp);
}
if($matchlen >= $blen-$btmp)break;
}
}
if(!$bestlen)break;
$adifflen = $bestpos[0] - $aptr;
$bdifflen = $bestpos[1] - $bptr;
if($adifflen)
{
$ops[] = array('-', substr($a, $aptr, $adifflen));
$aptr += $adifflen;
}
if($bdifflen)
{
$ops[] = array('+', substr($b, $bptr, $bdifflen));
$bptr += $bdifflen;
}
$ops[] = array('=', substr($a, $aptr, $bestlen));
$aptr += $bestlen;
$bptr += $bestlen;
}
if($aptr < $alen)
{
/* b has too much stuff */
$ops[] = array('-', substr($a, $aptr));
}
if($bptr < $blen)
{
/* a has too little stuff */
$ops[] = array('+', substr($b, $bptr));
}
return $ops;
}
Example:
$tab = calcdiffer('Tämä on jonkinlainen testi',
'Tämä ei ole minkäänlainen testi.');
$ops = array('='=>'Ok', '-'=>'Remove', '+'=>'Add');
foreach($tab as $k)
echo $ops[$k[0]], " '", $k[1], "'\n";
Example output:
Ok 'Tämä '
Remove 'on jonki'
Add 'ei ole minkää'
Ok 'nlainen testi'
Add '.'
For spell checking applications, delay could be tolerable if you assume the typist got the first two or three chars of each word right. Then you'd only need to calc distances for a small segment of the dictionary. This is a compromise but one I think a lot of spell checkers make.
For an example of site search using this function look at the PHP manual search button on this page. It appears to be doing this for the PHP function list.
Exempla of use for a forum: users can't post messages too much uppercased
<?php
if ((strlen($subject)>10) and ( ( levenshtein ($subject, strtolower ($subject) / strlen ($subject) ) > .3 ) ){
$subject = strtolower($subject);
}
?>
well, think of it as a sort of typo counter
by the way we might introduce a feature similar to something that the jikes compiler for java does using this function:
every time you try to access an undefined funtion you will get a suggestion of a function that's named
similar to what you requested
e.g.:
Undefined function 'rest()'!
Where you looking for 'reset()' instead?
It's also useful if you want to make some sort of registration page and you want to make sure that people who register don't pick usernames that are very similar to their passwords.
This could come in handy if you were to implement a simple searchengine in PHP. It would allow for simple fuzzy searching.
One application of this is when you want to look for a similar match instead of an exact one. You can sort the results of checking the distances of a word to a dictionary and sort them to see which were the more similar ones. Of course it will be a quite resourse consuming task anyway.