PHP Doku:: Konvertiert alle benannten HTML-Zeichen in ihre entsprechenden Ursprungszeichen - function.html-entity-decode.html

Verlauf / Chronik / History: (1) anzeigen

Sie sind hier:
Doku-StartseitePHP-HandbuchFunktionsreferenzTextverarbeitungZeichenkettenString-Funktionenhtml_entity_decode

Ein Service von Reinhard Neidl - Webprogrammierung.

String-Funktionen

<<hebrevc

htmlentities>>

html_entity_decode

(PHP 4 >= 4.3.0, PHP 5)

html_entity_decodeKonvertiert alle benannten HTML-Zeichen in ihre entsprechenden Ursprungszeichen

Beschreibung

string html_entity_decode ( string $string [, int $quote_style = ENT_COMPAT [, string $charset ]] )

html_entity_decode() ist das Gegenstück zu htmlentities(), das alle benannten HTML-Zeichen innerhalb von string in ihre entsprechenden Ursprungszeichen zurückwandelt.

Parameter-Liste

string

The input string.

quote_style

Der optionale zweite Parameter quote_style lässt Ihnen die Entscheidung, was mit 'einfachen' und "doppelten" Anführungszeichen geschehen soll. Sie können eine der drei genannten Konstanten einsetzen, standardmäßig wird ENT_COMPAT verwendet:
Verfügbare quote_style-Konstanten
Konstantenname Beschreibung
ENT_COMPAT Konvertiert doppelte Anführungszeichen und lässt einfache Anführungszeichen unberührt.
ENT_QUOTES Konvertiert sowohl doppelte als auch einfache Anführungszeichen.
ENT_NOQUOTES Lässt sowohl doppelte als auch einfache Anführungszeichen unberührt.

charset

Die ISO-8859-1 Zeichentabelle wird standardmäßig als dritter Parameter charset verwendet. Dieser Parameter legt die Zeichentabelle fest, die der Konvertierung zugrunde gelegt wird.

Die folgenden Zeichensätze werden mit PHP 4.3.0 und höher unterstützt:
Unterstützte Zeichensätze
Zeichensatz Alias Beschreibung
ISO-8859-1 ISO8859-1 Westeuropäisch, Latin-1
ISO-8859-15 ISO8859-15 Westeuropäisch, Latin-9. Enthält das Euro-Zeichen sowie französische und finnische Buchstaben, die in Latin-1(ISO-8859-1) fehlen.
UTF-8   ASCII-kompatibles Multi-Byte 8-Bit Unicode.
cp866 ibm866, 866 DOS-spezifischer Kyrillischer Zeichensatz. Dieser Zeichensatz wird ab PHP Version 4.3.2 unterstützt.
cp1251 Windows-1251, win-1251, 1251 Windows-spezifischer Kyrillischer Zeichensatz. Dieser Zeichensatz wird ab PHP Version 4.3.2 unterstützt.
cp1252 Windows-1252, 1252 Windows spezifischer Zeichensatz für westeuropäische Sprachen.
KOI8-R koi8-ru, koi8r Russisch. Dieser Zeichensatz wird ab PHP Version 4.3.2 unterstützt.
BIG5 950 Traditionelles Chinesisch, hauptsächlich in Taiwan verwendet.
GB2312 936 Vereinfachtes Chinesisch, nationaler Standard-Zeichensatz.
BIG5-HKSCS   Big5 mit Hongkong-spezifischen Erweiterungen; traditionelles Chinesisch.
Shift_JIS SJIS, 932 Japanisch
EUC-JP EUCJP Japanisch

Hinweis: Weitere Zeichensätze sind nicht implementiert, an ihrer Stelle wird ISO-8859-1 verwendet.

Rückgabewerte

Gibt die dekodierte Zeichenkette zurück.

Changelog

Version Beschreibung
5.0.0 Die Unterstützung für Multibyte-Zeichensätze wurde hinzugefügt.

Beispiele

Beispiel #1 Dekodieren benannter HTML-Zeichen

<?php
$orig 
"I'll \"walk\" the <b>dog</b> now";

$a htmlentities($orig);

$b html_entity_decode($a);

echo 
$a// I'll &quot;walk&quot; the &lt;b&gt;dog&lt;/b&gt; now

echo $b// I'll "walk" the <b>dog</b> now


// Usern mit einer PHP-Version vor 4.3.0 hilft folgender Workaround:
function unhtmlentities($string)
{
    
// replace numeric entities
    
$string preg_replace('~&#x([0-9a-f]+);~ei''chr(hexdec("\\1"))'$string);
    
$string preg_replace('~&#([0-9]+);~e''chr("\\1")'$string);
    
// replace literal entities
    
$trans_tbl get_html_translation_table(HTML_ENTITIES);
    
$trans_tbl array_flip($trans_tbl);
    return 
strtr($string$trans_tbl);
}

$c unhtmlentities($a);

echo 
$c// I'll "walk" the <b>dog</b> now

?>

Anmerkungen

Hinweis:

Sie wundern sich vielleicht, warum trim(html_entity_decode('&nbsp;')); den String nicht zu einem leeren Sting reduziert. Der Grund ist darin zu finden, dass '&nbsp;' nicht dem Zeichen mit ASCII-Code 32 entspricht (dieser wird von trim() entfernt), sondern dem Zeichen mit ASCII-Code 160 (0xa0) in der Standard-Zeichentabelle ISO 8859-1.

Siehe auch


32 BenutzerBeiträge:
- Beiträge aktualisieren...
neurotic dot neu at gmail dot com
10.08.2010 21:25
This is a safe rawurldecode with utf8 detection:

<?php
function utf8_rawurldecode($raw_url_encoded){
   
$enc = rawurldecode($raw_url_encoded);
    if(
utf8_encode(utf8_decode($enc))==$enc){;
        return
rawurldecode($raw_url_encoded);
    }else{
        return
utf8_encode(rawurldecode($raw_url_encoded));
    }
}
?>
Free at Key dot no
1.07.2010 14:51
Handy function to convert remaining HTML-entities into human readable chars (for entities which do not exist in target charset):

<?php
function cleanString($in,$offset=null)
{
   
$out = trim($in);
    if (!empty(
$out))
    {
       
$entity_start = strpos($out,'&',$offset);
        if (
$entity_start === false)
        {
           
// ideal
           
return $out;   
        }
        else
        {
           
$entity_end = strpos($out,';',$entity_start);
            if (
$entity_end === false)
            {
                 return
$out;
            }
           
// zu lang um eine entity zu sein
           
else if ($entity_end > $entity_start+7)
            {
                
// und weiter gehts
                
$out = cleanString($out,$entity_start+1);
            }
           
// gottcha!
           
else
            {
                
$clean = substr($out,0,$entity_start);
                
$subst = substr($out,$entity_start+1,1);
                
// &scaron; => "s" / &#353; => "_"
                
$clean .= ($subst != "#") ? $subst : "_";
                
$clean .= substr($out,$entity_end+1);
                
// und weiter gehts
                
$out = cleanString($clean,$entity_start+1);
            }
        }
    }
    return
$out;
}
?>
Matt Robinson
6.09.2009 23:11
I wrote in a previous comment that html_entity_decode() only handled about 100 characters. That's not quite true; it only handles entities that exist in the output character set (the third argument). If you want to get ALL HTML entities, make sure you use ENT_QUOTES and set the third argument to 'UTF-8'.

If you don't want a UTF-8 string, you'll need to convert it afterward with something like utf8_decode(), iconv(), or mb_convert_encoding().

If you're producing XML, which doesn't recognise most HTML entities:

When producing a UTF-8 document (the default), then htmlspecialchars(html_entity_decode($string, ENT_QUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8') (because you only need to escape < and > and & unless you're printing inside the XML tags themselves).

Otherwise, either convert all the named entities to numeric ones, or declare the named entities in the document's DTD. The full list of 252 entities can be found in the HTML 4.01 Spec, or you can cut and paste the function from my site (http://inanimatt.com/php-convert-entities.php).
marion at figmentthinking dot com
10.03.2009 14:11
I just ran into the:
Bug #27626 html_entity_decode bug - cannot yet handle MBCS in html_entity_decode()!

The simple solution if you're still running PHP 4 is to wrap the html_entity_decode() function with the utf8_decode() function.

<?php
$string
= '&nbsp;';
$utf8_encode = utf8_encode(html_entity_decode($string));
?>

By default html_entity_decode() returns the ISO-8859-1 character set, and by default utf8_decode()...

http://us.php.net/manual/en/function.utf8-decode.php
"Converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1"
jl dot garcia at gmail dot com
5.03.2009 0:33
I created this function to filter all the text that goes in or comes out of the database.

<?php
function filter_string($string, $nohtml='', $save='') {
    if(!empty(
$nohtml)) {
       
$string = trim($string);
        if(!empty(
$save)) $string = htmlentities(trim($string), ENT_QUOTES, 'ISO-8859-15');
        else
$string = html_entity_decode($string, ENT_QUOTES, 'ISO-8859-15');
    }
    if(!empty(
$save)) $string = mysql_real_escape_string($string);
    else
$string = stripslashes($string);
    return(
$string);
}
?>
Anonymous
3.10.2008 1:40
The previous post seems incorrect.  Even if PHP sets the charset, it can be overridden if HTML charsets are sent via META tags beforehand.
Anonymous
31.07.2008 7:01
You may want to specify the character set if you see unexpected behavior.  Here is an example.

# cat test.php
<?php
$str
= '&#33;';
$quotes = html_entity_decode($str, ENT_QUOTES);
$noquotes = html_entity_decode($str, ENT_NOQUOTES);
$noquotesutf8 = html_entity_decode($str, ENT_NOQUOTES, 'UTF-8');
echo
"quotes='$quotes', noquotes='$noquotes', noquotesutf8='$noquotesutf8'\n";
?>

# php test.php
quotes='!', noquotes='&#33;', noquotesutf8='!'
kae at verens dot com
9.05.2008 15:11
the references to 'chr()' in the example unhtmlentities() function should be changed to unichr, using the example unichr() function described in the 'chr' reference (http://php.net/chr).

the reason for this is characters such as &#x20AC; which do not break down into an ASCII number (that's the Euro, by the way).
me at richardsnazell dot com
21.01.2008 13:19
I had a problem getting the 'TM' trademark symbol to display correctly in an email subject line. Using html_entity_decode() with different charsets didn't work, but directly replacing the entity with it's ASCII equivalent did:

$subject = str_replace('&trade;', chr(153), $subject);
Matt Robinson
22.10.2007 20:11
Bafflingly, html_entity_decode() only converts the 100 most common named entities, whereas the HTML 4.01 Recommendation lists over 250. This wrapper function converts all known named entities to numeric ones before handing over to the original html_entity_decode, and hopefully isn't too insufferably slow (am I right in thinking that making the conversion table static will prevent it being reinitialised on each call?)

Unfortunately it's just a little too long for this documentation. You can see the code at http://www.lazycat.org/software/html_entity_decode_full.phps
Hayley Watson
2.10.2007 0:15
To go further with Fabian's comment:

The XML specification (production 66) says that (decimal) numeric character references start with '&#', followed by one or more digits [0-9], and end with a ';' - just as the documented regular expression states. Hex references start with "&#x" and the allowed digits are [0-9a-fA-F].

And indeed, &#000000000000000000039; is a legitimate reference for an apostrophe (but don't tell Internet Explorer).

So Fabien's alteration to the expression is necessary. It's still insufficient, however, as chr() does not handle multibyte characters such as "&#8364;".
Hayley Watson
1.10.2007 23:54
Fabian's observation that chr(039) returns "a heart character" is explained by the fact that numeric literals that start with '0' are interpreted in base 8, which doesn't have a digit '9'. So 039==3 and hence chr(039) is equivalent to chr(3), NOT chr(39).
Fabian
28.09.2007 23:31
Actually I am not sure about the regex replacements from numeric entities back.
If you give &#039; to a browser. &#39; will also turn into a single quote.

But if I do a:
<?php
   chr
(039);
?>
I will get not a single quote but a heart character (haven't seen it since DOS days :))
However
<?php
   chr
(39);
?>
gives the correct result.
This makes the correct preg something like this

<?php
   $string
= preg_replace('~&#x0*([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
  
$string = preg_replace('~&#0*([0-9]+);~e', 'chr(\\1)', $string);
?>

The reason is also already found on preg_replace manual page:
http://de.php.net/manual/en/function.preg-replace.php#69478

039 is interpreted as octal
akniep at rayo dot info
13.07.2007 18:39
In answer to "laurynas dot butkus at gmail dot com" and "romans@void.lv" and their great code2utf-function I added the functionality for entries between [128, 160[ that are not ASCii, but equal for all major western encodings like ISO8859-X and UTF-8 that has been mentioned before.

Now, the following function should in fact convert any number (table-entry) into an UTF-8-character. Thus, the return-value  code2utf( <number> )  equals the character that is represented by the XML-entity  &#<number>;  (exceptions: #129, #141, #143, #144, #157).

To give an example, the function may be useful for creating a UTF-8-compatible html_entity_decode-function  or  determining the entry-position of UTF-8-characters in order to find the correct entity-replacement or similar.

    function code2utf($number)
    {
        if ($number < 0)
            return FALSE;
       
        if ($number < 128)
            return chr($number);
       
        // Removing / Replacing Windows Illegals Characters
        if ($number < 160)
        {
                if ($number==128) $number=8364;
            elseif ($number==129) $number=160; // (Rayo:) #129 using no relevant sign, thus, mapped to the saved-space #160
            elseif ($number==130) $number=8218;
            elseif ($number==131) $number=402;
            elseif ($number==132) $number=8222;
            elseif ($number==133) $number=8230;
            elseif ($number==134) $number=8224;
            elseif ($number==135) $number=8225;
            elseif ($number==136) $number=710;
            elseif ($number==137) $number=8240;
            elseif ($number==138) $number=352;
            elseif ($number==139) $number=8249;
            elseif ($number==140) $number=338;
            elseif ($number==141) $number=160; // (Rayo:) #129 using no relevant sign, thus, mapped to the saved-space #160
            elseif ($number==142) $number=381;
            elseif ($number==143) $number=160; // (Rayo:) #129 using no relevant sign, thus, mapped to the saved-space #160
            elseif ($number==144) $number=160; // (Rayo:) #129 using no relevant sign, thus, mapped to the saved-space #160
            elseif ($number==145) $number=8216;
            elseif ($number==146) $number=8217;
            elseif ($number==147) $number=8220;
            elseif ($number==148) $number=8221;
            elseif ($number==149) $number=8226;
            elseif ($number==150) $number=8211;
            elseif ($number==151) $number=8212;
            elseif ($number==152) $number=732;
            elseif ($number==153) $number=8482;
            elseif ($number==154) $number=353;
            elseif ($number==155) $number=8250;
            elseif ($number==156) $number=339;
            elseif ($number==157) $number=160; // (Rayo:) #129 using no relevant sign, thus, mapped to the saved-space #160
            elseif ($number==158) $number=382;
            elseif ($number==159) $number=376;
        } //if
       
        if ($number < 2048)
            return chr(($number >> 6) + 192) . chr(($number & 63) + 128);
        if ($number < 65536)
            return chr(($number >> 12) + 224) . chr((($number >> 6) & 63) + 128) . chr(($number & 63) + 128);
        if ($number < 2097152)
            return chr(($number >> 18) + 240) . chr((($number >> 12) & 63) + 128) . chr((($number >> 6) & 63) + 128) . chr(($number & 63) + 128);
       
       
        return FALSE;
    } //code2utf()
laurynas dot butkus at gmail dot com
15.05.2007 13:24
In PHP4 html_entity_decode() is not working well with UTF-8  spitting: "Warning: cannot yet handle MBCS in html_entity_decode()!".

This is working solution combining several workarounds:

<?php
function html_entity_decode_utf8($string)
{
    static
$trans_tbl;
   
   
// replace numeric entities
   
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'code2utf(hexdec("\\1"))', $string);
   
$string = preg_replace('~&#([0-9]+);~e', 'code2utf(\\1)', $string);

   
// replace literal entities
   
if (!isset($trans_tbl))
    {
       
$trans_tbl = array();
       
        foreach (
get_html_translation_table(HTML_ENTITIES) as $val=>$key)
           
$trans_tbl[$key] = utf8_encode($val);
    }
   
    return
strtr($string, $trans_tbl);
}

// Returns the utf string corresponding to the unicode value (from php.net, courtesy - romans@void.lv)
function code2utf($num)
{
    if (
$num < 128) return chr($num);
    if (
$num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
    if (
$num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
    if (
$num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
    return
'';
}
?>
teecee[(a)]teecee[pont]hu
13.05.2007 17:29
Hi!

The main problem with the UTF-8 strings if You try to unhtmlentities them is that the get_html_translation_table() gives back a non-UTF8 conversion table. So the idea is to get the translation table and then translate the needed non-UTF8 strings to UTF8...

I have this code working, actually this code is the one sent by 'daviscabral', just with an extra foreach in it ( http://hu.php.net/manual/en/function.htmlentities.php#68479 )

And the code is:
<?
function unhtmlentitiesUtf8($string) {
   
// replace numeric entities
   
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
   
$string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
   
// replace literal entities
   
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
   
$trans_tbl = array_flip($trans_tbl);
   
// changing translation table to UTF-8
   
foreach( $trans_tbl as $key => $value ) {
       
$trans_tbl[$key] = iconv( 'ISO-8859-1', 'UTF-8', $value );
    }
    return
strtr($string, $trans_tbl);
}
?>

If You need this in production code, I suggest to get the $trans_tbl into a common-includable file I think it should be faster. ( Maybe the easiest way to do this is to write after the translation: die(var_export($trans_tbl, true)); and copy&paste the source of the displaying text. And don't forget to check if the browser uses UTF8 codepage! ;)
elektronaut gmx.net
10.01.2007 14:11
I made my own fix to allow numerical entities in utf8 in php4...

<?
   
function utf8_replaceEntity($result){
       
$value = (int)$result[1];
       
$string = '';
       
       
$len = round(pow($value,1/8));
       
        for(
$i=$len;$i>0;$i--){
           
$part = ($value & (255>>2)) | pow(2,7);
            if (
$i == 1 ) $part |= 255<<(8-$len);
           
           
$string = chr($part) . $string;
           
           
$value >>= 6;
        }
       
        return
$string;
    }
   
    function
utf8_html_entity_decode($string){
        return
preg_replace_callback(
           
'/&#([0-9]+);/u',
           
'utf8_replaceEntity',
           
$string
       
);
    }
   
   
$string = '&#8217;&#8216; &#8211; &#8220; &#8221;'
       
.'&#61607; &#263; &#324; &#345;'
   
;
   
$string = utf8_html_entity_decode($string,null,'UTF-8');
   
   
header('Content-Type: text/html; charset=UTF-8');
    echo
'<li>'.$string;
?>
inco
28.12.2006 21:26
@ romekt:

iconv could not be implemented, so alternatively use utf8_decode and utf8_encode to solve the utf-8 / iso-8859-1 problem
jojo
4.11.2006 5:27
The decipherment does the character encoded by the escape function of JavaScript.
When the multi byte is used on the page, it is effective.

javascript escape('aaああaa') ..... 'aa%u3042%u3042aa'
php  jsEscape_decode('aa%u3042%u3042aa')..'aaああaa'

<?
function jsEscape_decode($jsEscaped,$outCharCode='SJIS'){
   
$arrMojis = explode("%u",$jsEscaped);
    for (
$i = 1;$i < count($arrMojis);$i++){
       
$c = substr($arrMojis[$i],0,4);
       
$cc = mb_convert_encoding(pack('H*',$c),$outCharCode,'UTF-16');
       
$arrMojis[$i] = substr_replace($arrMojis[$i],$cc,0,4);
    }
    return
implode('',$arrMojis);
}
?>
romekt at CUTTHISgmail dot com
1.09.2006 23:15
here's a simple workaround for the UTF-8 support problem

$var=iconv("UTF-8","ISO-8859-1",$var);
$var=html_entity_decode($var, ENT_QUOTES, 'ISO-8859-1');
$var=iconv("ISO-8859-1","UTF-8",$var);
derernst at gmx dot ch
1.08.2006 12:09
Combining the suggestions by buraks78 at gmail dot com, gaui at gaui dot is, daniel at brightbyte dot de, and the version in PEAR_PHP_Compat, I come to the following, which should work in an UTF-8 environment, with PHP < or > 4.3:

<?php
function decode_entities($text, $quote_style = ENT_COMPAT) {
    if (
function_exists('html_entity_decode')) {
       
$text = html_entity_decode($text, $quote_style, 'ISO-8859-1'); // NOTE: UTF-8 does not work!
   
}
    else {
       
$trans_tbl = get_html_translation_table(HTML_ENTITIES, $quote_style);
       
$trans_tbl = array_flip($trans_tbl);
       
$text = strtr($text, $trans_tbl);
    }
   
$text = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $text);
   
$text = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $text);
    return
$text;
}
?>

Note that I omitted the line
$trans_table['&#39;'] = "'";
as it would override the quote_style setting and thus lead to unexpected results for quote_styles ENT_NOQUOTES and ENT_COMPAT.
grvg (at) free (dot) fr
29.07.2006 18:44
Here is the ultimate functions to convert HTML entities to UTF-8 :
The main function is htmlentities2utf8
Others are helper functions

function chr_utf8($code)
    {
        if ($code < 0) return false;
        elseif ($code < 128) return chr($code);
        elseif ($code < 160) // Remove Windows Illegals Cars
        {
            if ($code==128) $code=8364;
            elseif ($code==129) $code=160; // not affected
            elseif ($code==130) $code=8218;
            elseif ($code==131) $code=402;
            elseif ($code==132) $code=8222;
            elseif ($code==133) $code=8230;
            elseif ($code==134) $code=8224;
            elseif ($code==135) $code=8225;
            elseif ($code==136) $code=710;
            elseif ($code==137) $code=8240;
            elseif ($code==138) $code=352;
            elseif ($code==139) $code=8249;
            elseif ($code==140) $code=338;
            elseif ($code==141) $code=160; // not affected
            elseif ($code==142) $code=381;
            elseif ($code==143) $code=160; // not affected
            elseif ($code==144) $code=160; // not affected
            elseif ($code==145) $code=8216;
            elseif ($code==146) $code=8217;
            elseif ($code==147) $code=8220;
            elseif ($code==148) $code=8221;
            elseif ($code==149) $code=8226;
            elseif ($code==150) $code=8211;
            elseif ($code==151) $code=8212;
            elseif ($code==152) $code=732;
            elseif ($code==153) $code=8482;
            elseif ($code==154) $code=353;
            elseif ($code==155) $code=8250;
            elseif ($code==156) $code=339;
            elseif ($code==157) $code=160; // not affected
            elseif ($code==158) $code=382;
            elseif ($code==159) $code=376;
        }
        if ($code < 2048) return chr(192 | ($code >> 6)) . chr(128 | ($code & 63));
        elseif ($code < 65536) return chr(224 | ($code >> 12)) . chr(128 | (($code >> 6) & 63)) . chr(128 | ($code & 63));
        else return chr(240 | ($code >> 18)) . chr(128 | (($code >> 12) & 63)) . chr(128 | (($code >> 6) & 63)) . chr(128 | ($code & 63));
    }

    // Callback for preg_replace_callback('~&(#(x?))?([^;]+);~', 'html_entity_replace', $str);
    function html_entity_replace($matches)
    {
        if ($matches[2])
        {
            return chr_utf8(hexdec($matches[3]));
        } elseif ($matches[1])
        {
            return chr_utf8($matches[3]);
        }
        switch ($matches[3])
        {
            case "nbsp": return chr_utf8(160);
            case "iexcl": return chr_utf8(161);
            case "cent": return chr_utf8(162);
            case "pound": return chr_utf8(163);
            case "curren": return chr_utf8(164);
            case "yen": return chr_utf8(165);
            //... etc with all named HTML entities
        }
        return false;
    }
   
    function htmlentities2utf8 ($string) // because of the html_entity_decode() bug with UTF-8
    {
        $string = preg_replace_callback('~&(#(x?))?([^;]+);~', 'html_entity_replace', $string);
        return $string;
    }
hurricane at cyberworldz dot org
23.12.2005 5:33
I shortened the function repace_num_entity a bit to make more understandable and clean. Maybe now someone sees the problem it possibly has... (as mentioned below)

<?php
function replace_num_entity($ord) {
   
$ord = $ord[1];
    if (
preg_match('/^x([0-9a-f]+)$/i', $ord, $match)) $ord = hexdec($match[1]);
        else
$ord = intval($ord);
   
$no_bytes = 0;
   
$byte = array();
    if (
$ord < 128) return chr($ord);
    if (
$ord < 2048) $no_bytes = 2;
        else if (
$ord < 65536) $no_bytes = 3;
        else if (
$ord < 1114112) $no_bytes = 4;
        else return;
    switch(
$no_bytes) {
        case
2: $prefix = array(31, 192); break;
        case
3: $prefix = array(15, 224); break;
        case
4: $prefix = array(7, 240);
    }
    for (
$i=0; $i < $no_bytes; ++$i)
       
$byte[$no_bytes-$i-1] = (($ord & (63 * pow(2,6*$i))) / pow(2,6*$i)) & 63 | 128;
   
$byte[0] = ($byte[0] & $prefix[0]) | $prefix[1];
   
$ret = '';
    for (
$i=0; $i < $no_bytes; ++$i) $ret .= chr($byte[$i]);
    return
$ret;
}
?>
loufoque
8.10.2005 22:15
If you want to decode NCRs to utf-8 use this function instead of chr().

function utf8_chr($code)
{
    if($code<128) return chr($code);
    else if($code<2048) return chr(($code>>6)+192).chr(($code&63)+128);
    else if($code<65536) return chr(($code>>12)+224).chr((($code>>6)&63)+128).chr(($code&63)+128);
    else if($code<2097152) return chr($code>>18+240).chr((($code>>12)&63)+128)
                                  .chr(($code>>6)&63+128).chr($code&63+128));
}
emilianomartinezluque at yahoo dot com
26.09.2005 2:22
I've been using the great replace_num_entity function posted below. But there seems to be some problems with the 128 to 160 characters range. Ie, try:

<?php header("Content-type: text/html; charset=utf-8"); ?>
<html><body>
<?php
for($x=128; $x<161; $x++) {
      echo(
'&#' . $x . '; -- ' . preg_replace_callback('/&#([0-9a-fx]+);/mi', 'replace_num_entity', '&#' . $x . ';') . '</br>');
}
?>
</body></html>

I really dont know the reason for this (since according to UTF-8 specs the function should have worked) but I did a modified version of the function to address this. Hope it helps.

function replace_num_entity($ord)
   {
       $ord = $ord[1];
       if (preg_match('/^x([0-9a-f]+)$/i', $ord, $match))
       {
           $ord = hexdec($match[1]);
       }
       else
       {
           $ord = intval($ord);
       }
     
       $no_bytes = 0;
       $byte = array();

        if($ord == 128) {
            return chr(226).chr(130).chr(172);
        } elseif($ord == 129) {
            return chr(239).chr(191).chr(189);
        } elseif($ord == 130) {
            return chr(226).chr(128).chr(154);
        } elseif($ord == 131) {
            return chr(198).chr(146);
        } elseif($ord == 132) {
            return chr(226).chr(128).chr(158);
        } elseif($ord == 133) {
            return chr(226).chr(128).chr(166);
        } elseif($ord == 134) {
            return chr(226).chr(128).chr(160);
        } elseif($ord == 135) {
            return chr(226).chr(128).chr(161);
        } elseif($ord == 136) {
            return chr(203).chr(134);
        } elseif($ord == 137) {
            return chr(226).chr(128).chr(176);
        } elseif($ord == 138) {
            return chr(197).chr(160);
        } elseif($ord == 139) {
            return chr(226).chr(128).chr(185);
        } elseif($ord == 140) {
            return chr(197).chr(146);
        } elseif($ord == 141) {
            return chr(239).chr(191).chr(189);
        } elseif($ord == 142) {
            return chr(197).chr(189);
        } elseif($ord == 143) {
            return chr(239).chr(191).chr(189);
        } elseif($ord == 144) {
            return chr(239).chr(191).chr(189);
        } elseif($ord == 145) {
            return chr(226).chr(128).chr(152);
        } elseif($ord == 146) {
            return chr(226).chr(128).chr(153);
        } elseif($ord == 147) {
            return chr(226).chr(128).chr(156);
        } elseif($ord == 148) {
            return chr(226).chr(128).chr(157);
        } elseif($ord == 149) {
            return chr(226).chr(128).chr(162);
        } elseif($ord == 150) {
            return chr(226).chr(128).chr(147);
        } elseif($ord == 151) {
            return chr(226).chr(128).chr(148);
        } elseif($ord == 152) {
            return chr(203).chr(156);
        } elseif($ord == 153) {
            return chr(226).chr(132).chr(162);
        } elseif($ord == 154) {
            return chr(197).chr(161);
        } elseif($ord == 155) {
            return chr(226).chr(128).chr(186);
        } elseif($ord == 156) {
            return chr(197).chr(147);
        } elseif($ord == 157) {
            return chr(239).chr(191).chr(189);
        } elseif($ord == 158) {
            return chr(197).chr(190);
        } elseif($ord == 159) {
            return chr(197).chr(184);
        } elseif($ord == 160) {
            return chr(194).chr(160);
        }

       if ($ord < 128)
       {
           return chr($ord);
       }
       elseif ($ord < 2048)
       {
           $no_bytes = 2;
       }
       elseif ($ord < 65536)
       {
           $no_bytes = 3;
       }
       elseif ($ord < 1114112)
       {
           $no_bytes = 4;
       }
       else
       {
           return;
       }

       switch($no_bytes)
       {
           case 2:
           {
               $prefix = array(31, 192);
               break;
           }
           case 3:
           {
               $prefix = array(15, 224);
               break;
           }
           case 4:
           {
               $prefix = array(7, 240);
           }
       }

       for ($i = 0; $i < $no_bytes; $i++)
       {
           $byte[$no_bytes - $i - 1] = (($ord & (63 * pow(2, 6 * $i))) / pow(2, 6 * $i)) & 63 | 128;
       }

       $byte[0] = ($byte[0] & $prefix[0]) | $prefix[1];

       $ret = '';
       for ($i = 0; $i < $no_bytes; $i++)
       {
           $ret .= chr($byte[$i]);
       }

       return $ret;
   }
florianborn (at) yahoo (dot) de
20.07.2005 12:43
Note that

<?php

 
echo urlencode(html_entity_decode("&nbsp;"));

?>

will output "%A0" instead of "+".
gaui at gaui dot is
5.07.2005 2:15
if( !function_exists( 'html_entity_decode' ) )
{
    function html_entity_decode( $given_html, $quote_style = ENT_QUOTES ) {
        $trans_table = array_flip(get_html_translation_table( HTML_SPECIALCHARS, $quote_style ));
        $trans_table['&#39;'] = "'";
        return ( strtr( $given_html, $trans_table ) );
       }
}
marius (at) hot (dot) ee
8.04.2005 15:40
To convert html entities into unicode characters, use the following:

        $trans_tbl = get_html_translation_table(HTML_ENTITIES);
        foreach($trans_tbl as $k => $v)
        {
            $ttr[$v] = utf8_encode($k);
        }
   
        $text = strtr($text, $ttr);
php dot net at c dash ovidiu dot tk
18.03.2005 9:37
Quick & dirty code that translates numeric entities to UTF-8.

<?php

   
function replace_num_entity($ord)
    {
       
$ord = $ord[1];
        if (
preg_match('/^x([0-9a-f]+)$/i', $ord, $match))
        {
           
$ord = hexdec($match[1]);
        }
        else
        {
           
$ord = intval($ord);
        }
       
       
$no_bytes = 0;
       
$byte = array();

        if (
$ord < 128)
        {
            return
chr($ord);
        }
        elseif (
$ord < 2048)
        {
           
$no_bytes = 2;
        }
        elseif (
$ord < 65536)
        {
           
$no_bytes = 3;
        }
        elseif (
$ord < 1114112)
        {
           
$no_bytes = 4;
        }
        else
        {
            return;
        }

        switch(
$no_bytes)
        {
            case
2:
            {
               
$prefix = array(31, 192);
                break;
            }
            case
3:
            {
               
$prefix = array(15, 224);
                break;
            }
            case
4:
            {
               
$prefix = array(7, 240);
            }
        }

        for (
$i = 0; $i < $no_bytes; $i++)
        {
           
$byte[$no_bytes - $i - 1] = (($ord & (63 * pow(2, 6 * $i))) / pow(2, 6 * $i)) & 63 | 128;
        }

       
$byte[0] = ($byte[0] & $prefix[0]) | $prefix[1];

       
$ret = '';
        for (
$i = 0; $i < $no_bytes; $i++)
        {
           
$ret .= chr($byte[$i]);
        }

        return
$ret;
    }

   
$test = 'This is a &#269;&#x5d0; test&#39;';

    echo
$test . "<br />\n";
    echo
preg_replace_callback('/&#([0-9a-fx]+);/mi', 'replace_num_entity', $test);

?>
Silvan
29.01.2005 4:33
Passing NULL or FALSE as a string will generate a '500 Internal Server Error' (or break the script when inside a function).

So always test your string first before passing it to html_entity_decode().
daniel at brightbyte dot de
14.11.2004 3:12
This function seems to have to have two limitations (at least in PHP 4.3.8):

a) it does not work with multibyte character codings, such as UTF-8
b) it does not decode numeric entity references

a) can be solved by using iconv to convert to ISO-8859-1, then decoding the entities, than convert to UTF-8 again. But that's quite ugly and detroys all characters not present in Latin-1.

b) can be solved rather nicely using the following code:

<?php
function decode_entities($text) {
   
$text= html_entity_decode($text,ENT_QUOTES,"ISO-8859-1"); #NOTE: UTF-8 does not work!
   
$text= preg_replace('/&#(\d+);/me',"chr(\\1)",$text); #decimal notation
   
$text= preg_replace('/&#x([a-f0-9]+);/mei',"chr(0x\\1)",$text);  #hex notation
   
return $text;
}
?>

HTH
aidan at php dot net
14.09.2004 9:57
This functionality is now implemented in the PEAR package PHP_Compat.

More information about using this function without upgrading your version of PHP can be found on the below link:

http://pear.php.net/package/PHP_Compat



PHP Powered Diese Seite bei php.net
The PHP manual text and comments are covered by the Creative Commons Attribution 3.0 License © the PHP Documentation Group - Impressum - mail("TO:Reinhard Neidl",...)