PHP Doku:: Convert character encoding - function.mb-convert-encoding.html

25 BenutzerBeiträge:
- Beiträge aktualisieren...

gullevek at gullevek dot org
25.08.2010 9:27


If you want to convert japanese to ISO-2022-JP it is highly recommended to use ISO-2022-JP-MS as the target encoding instead. This includes the extended character set and avoids ? in the text. For example the often used "1 in a circle" ① will be correctly converted then.

regrunge at hotmail dot it
14.05.2010 17:00


I've been trying to find the charset of a norwegian (with a lot of ø, æ, å) txt file written on a Mac, i've found it in this way:





<?php


$text = "A strange string to pass, maybe with some ø, æ, å characters.";





foreach(mb_list_encodings() as $chr){


        echo mb_convert_encoding($text, 'UTF-8', $chr)." : ".$chr."<br>";    


 } 


?>





The line that looks good, gives you the encoding it was written in.





Hope can help someone

Daniel Trebbien
23.07.2009 20:25


Note that `mb_convert_encoding($val, 'HTML-ENTITIES')` does not escape '\'', '"', '<', '>', or '&'.

me at gsnedders dot com
19.06.2009 0:06


It appears that when dealing with an unknown "from encoding" the function will both throw an E_WARNING and proceed to convert the string from ISO-8859-1 to the "to encoding".

alexandrefelipemuller at gmail dot com
18.02.2009 21:06


I used this function insted mb_convert_encoding, because mbstring wasn't enabled at my comercial server. It only suports utf7, 8 e iso 8859-1:





<?php


function my_convert_encoding($string,$to,$from)


{


        // Convert string to ISO_8859-1


        if ($from == "UTF-8")


                $iso_string = utf8_decode($string);


        else


                if ($from == "UTF7-IMAP")


                        $iso_string = imap_utf7_decode($string);


                else


                        $iso_string = $string;





        // Convert ISO_8859-1 string to result coding


        if ($to == "UTF-8")


                return(utf8_encode($iso_string));


        else


                if ($to == "UTF7-IMAP")


                        return(imap_utf7_encode($iso_string));


                else


                        return($iso_string);


}


?>

chzhang at gmail dot com
5.01.2009 9:34


instead of ini_set(), you can try this



mb_substitute_character("none");

francois at bonzon point com
11.11.2008 2:05


aaron, to discard unsupported characters instead of printing a ?, you might as well simply set the configuration directive:



mbstring.substitute_character = "none"



in your php.ini. Be sure to include the quotes around none. Or at run-time with



<?php

ini_set('mbstring.substitute_character', "none");

?>

aaron at aarongough dot com
7.11.2008 17:24


My solution below was slightly incorrect, so here is the correct version (I posted at the end of a long day, never a good idea!)



Again, this is a quick and dirty solution to stop mb_convert_encoding from filling your string with question marks whenever it encounters an illegal character for the target encoding. 



<?php

function convert_to ( $source, $target_encoding )

    {

    // detect the character encoding of the incoming file

    $encoding = mb_detect_encoding( $source, "auto" );

       

    // escape all of the question marks so we can remove artifacts from

    // the unicode conversion process

    $target = str_replace( "?", "[question_mark]", $source );

       

    // convert the string to the target encoding

    $target = mb_convert_encoding( $target, $target_encoding, $encoding);

       

    // remove any question marks that have been introduced because of illegal characters

    $target = str_replace( "?", "", $target );

       

    // replace the token string "[question_mark]" with the symbol "?"

    $target = str_replace( "[question_mark]", "?", $target );

   

    return $target;

    }

?>



Hope this helps someone! (Admins should feel free to delete my previous, incorrect, post for clarity)

-A

Edward
16.09.2008 12:54


If mb_convert_encoding doesn't work for you, and iconv gives you a headache, you might be interested in this free class I found. It can convert almost any charset to almost any other charset. I think it's wonderful and I wish I had found it earlier. It would have saved me tons of headache.



I use it as a fail-safe, in case mb_convert_encoding is not installed. Download it from http://mikolajj.republika.pl/



This is not my own library, so technically it's not spamming, right? ;)



Hope this helps.

StigC
14.08.2008 0:38


For the php-noobs (like me) - working with flash and php.



Here's a simple snippet of code that worked great for me, getting php to show special Danish characters, from a Flash email form:



<?php

// Name Escape

$escName = mb_convert_encoding($_POST["Name"], "ISO-8859-1", "UTF-8");



// message escape

$escMessage = mb_convert_encoding($_POST["Message"], "ISO-8859-1", "UTF-8");



// Headers.. and so on...

?>

nospam at nihonbunka dot com
16.05.2008 3:51


rodrigo at bb2 dot co dot jp wrote that inconv works better than mb_convert_encoding, I find that when converting from uft8 to shift_jis 

$conv_str = mb_convert_encoding($str,$toCS,$fromCS); 

works while

$conv_str = iconv($fromCS,$toCS.'//IGNORE',$str); 

removes tildes from $str.

katzlbtjunk at hotmail dot com
25.01.2008 13:36


Clean a string for use as filename by simply replacing all unwanted characters with underscore (ASCII converts to 7bit). It removes slightly more chars than necessary. Hope its useful. 



$fileName = 'Test:!"$%&/()=ÖÄÜöäü<<';

echo strtr(mb_convert_encoding($fileName,'ASCII'), 

    ' ,;:?*#!§$%&/(){}<>=`´|\\\'"', 

    '____________________________');

rodrigo at bb2 dot co dot jp
15.01.2008 12:47


For those who can´t use mb_convert_encoding() to convert from one charset to another as a metter of lower version of php, try iconv().



I had this problem converting to japanese charset:



$txt=mb_convert_encoding($txt,'SJIS',$this->encode);



And I could fix it by using this:



$txt = iconv('UTF-8', 'SJIS', $txt);



Maybe it´s helpfull for someone else! ;)

mightye at gmail dot com
13.11.2007 18:24


To petruzanauticoyahoo?com!ar



If you don't specify a source encoding, then it assumes the internal (default) encoding.  ñ is a multi-byte character whose bytes in your configuration default (often iso-8859-1) would actually mean Ã±.  mb_convert_encoding() is upgrading those characters to their multi-byte equivalents within UTF-8.



Try this instead:

<?php

print mb_convert_encoding( "ñ", "UTF-8", "UTF-8" );

?>

Of course this function does no work (for the most part - it can actually be used to strip characters which are not valid for UTF-8).

volker at machon dot biz
25.09.2007 6:05


Hey guys. For everybody who's looking for a function that is converting an iso-string to utf8 or an utf8-string to iso, here's your solution:



public function encodeToUtf8($string) {

     return mb_convert_encoding($string, "UTF-8", mb_detect_encoding($string, "UTF-8, ISO-8859-1, ISO-8859-15", true));

}



public function encodeToIso($string) {

     return mb_convert_encoding($string, "ISO-8859-1", mb_detect_encoding($string, "UTF-8, ISO-8859-1, ISO-8859-15", true));

}



For me these functions are working fine. Give it a try

aofg
22.08.2007 3:49


When converting Japanese strings to ISO-2022-JP or JIS on PHP >= 5.2.1, you can use "ISO-2022-JP-MS" instead of them.

Kishu-Izon (platform dependent) characters are converted correctly with the encoding, as same as with eucJP-win or with SJIS-win.

David Hull
20.12.2006 19:52


As an alternative to Johannes's suggestion for converting strings from other character sets to a 7bit representation while not just deleting latin diacritics, you might try this:



<?php

$text = iconv($from_enc, 'US-ASCII//TRANSLIT', $text);

?>



The only disadvantage is that it does not convert "ä" to "ae", but it handles punctuation and other special characters better.

-- 

David

phpdoc at jeudi dot de
5.09.2006 15:46


I'd like to share some code to convert latin diacritics to their

traditional 7bit representation, like, for example,



- à,ç,é,î,... to a,c,e,i,...

- ß to ss

- ä,Ä,... to ae,Ae,...

- ë,... to e,...



(mb_convert "7bit" would simply delete any offending characters). 



I might have missed on your country's typographic 

conventions--correct me then. 

<?php

/**

 * @args string $text line of encoded text

 *       string $from_enc (encoding type of $text, e.g. UTF-8, ISO-8859-1)

 *

 * @returns 7bit representation

 */

function to7bit($text,$from_enc) {

    $text = mb_convert_encoding($text,'HTML-ENTITIES',$from_enc);

    $text = preg_replace(

        array('/&szlig;/','/&(..)lig;/',

             '/&([aouAOU])uml;/','/&(.)[^;]*;/'),

        array('ss',"$1","$1".'e',"$1"),

        $text);

    return $text;

}   

?>



Enjoy :-)

Johannes

mac.com@nemo
8.07.2006 16:38


For those wanting to convert from $set to MacRoman, use iconv():



<?php



$string = iconv('UTF-8', 'macintosh', $string);



?>



('macintosh' is the IANA name for the MacRoman character set.)

eion at bigfoot dot com
21.02.2006 1:54


many people below talk about using 


<?php


    mb_convert_encode($s,'HTML-ENTITIES','UTF-8');


?>


to convert non-ascii code into html-readable stuff.  Due to my webserver being out of my control, I was unable to set the database character set, and whenever PHP made a copy of my $s variable that it had pulled out of the database, it would convert it to nasty latin1 automatically and not leave it in it's beautiful UTF-8 glory.





So [insert korean characters here] turned into ?????.





I found myself needing to pass by reference (which of course is deprecated/nonexistent in recent versions of PHP)


so instead of


<?php


    mb_convert_encode(&$s,'HTML-ENTITIES','UTF-8');


?>


which worked perfectly until I upgraded, so I had to use


<?php


    call_user_func_array('mb_convert_encoding', array(&$s,'HTML-ENTITIES','UTF-8'));


?>





Hope it helps someone else out

Tom Class
11.11.2005 16:35


Why did you use the php html encode functions? mbstring has it's own Encoding which is (as far as I tested it) much more usefull:



HTML-ENTITIES



Example:



$text = mb_convert_encoding($text, 'HTML-ENTITIES', "UTF-8");

Stephan van der Feest
9.09.2005 13:47


To add to the Flash conversion comment below, here's how I convert back from what I've stored in a database after converting from Flash HTML text field output, in order to load it back into a Flash HTML text field:



function htmltoflash($htmlstr)

{

  return str_replace("&lt;br /&gt;","\n",

    str_replace("<","&lt;",

      str_replace(">","&gt;",

        mb_convert_encoding(html_entity_decode($htmlstr),

        "UTF-8","ISO-8859-1"))));

}

Stephan van der Feest
9.09.2005 12:50


Here's a tip for anyone using Flash and PHP for storing HTML output submitted from a Flash text field in a database or whatever.



Flash submits its HTML special characters in UTF-8, so you can use the following function to convert those into HTML entity characters:



function utf8html($utf8str)

{

  return htmlentities(mb_convert_encoding($utf8str,"ISO-8859-1","UTF-8"));

}

jamespilcher1 - hotmail
2.02.2004 4:55


be careful when converting from iso-8859-1 to utf-8.



even if you explicitly specify the character encoding of a page as iso-8859-1(via headers and strict xml defs), windows 2000 will ignore that and interpret it as whatever character set it has natively installed. 



for example, i wrote char #128 into a page, with char encoding iso-8859-1, and it displayed in internet explorer (& mozilla) as a euro symbol.



it should have displayed a box, denoting that char #128 is undefined in iso-8859-1. The problem was it was displaying in "Windows: western europe" (my native character set).



this led to confusion when i tried to convert this euro to UTF-8 via mb_convert_encoding()  



IE displays UTF-8 correctly- and because PHP correctly converted #128 into a box in UTF-8, IE would show a box.



so all i saw was mb_convert_encoding() converting a euro symbol into a box. It took me a long time to figure out what was going on.

lanka at eurocom dot od dot ua
7.02.2003 17:03


Another sample of recoding without MultiByte enabling.

(Russian koi->win, if input in win-encoding already, function recode() returns unchanged string)



<?php

  // 0 - win

  // 1 - koi

  function detect_encoding($str) {

    $win = 0;

    $koi = 0;



    for($i=0; $i<strlen($str); $i++) {

      if( ord($str[$i]) >224 && ord($str[$i]) < 255) $win++;

      if( ord($str[$i]) >192 && ord($str[$i]) < 223) $koi++;

    }



    if( $win < $koi ) {

      return 1;

    } else return 0;



  }



  // recodes koi to win

  function koi_to_win($string) {



    $kw = array(128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183,  184, 185, 186, 187, 188, 189, 190, 191, 254, 224, 225, 246, 228, 229, 244, 227, 245, 232, 233, 234, 235, 236, 237, 238, 239, 255, 240, 241, 242, 243, 230, 226, 252, 251, 231, 248, 253, 249, 247, 250, 222, 192, 193, 214, 196, 197, 212, 195, 213, 200, 201, 202, 203, 204, 205, 206, 207, 223, 208, 209, 210, 211, 198, 194, 220, 219, 199, 216, 221, 217, 215, 218);

    $wk = array(128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183,  184, 185, 186, 187, 188, 189, 190, 191, 225, 226, 247, 231, 228, 229, 246, 250, 233, 234, 235, 236, 237, 238, 239, 240, 242,  243, 244, 245, 230, 232, 227, 254, 251, 253, 255, 249, 248, 252, 224, 241, 193, 194, 215, 199, 196, 197, 214, 218, 201, 202, 203, 204, 205, 206, 207, 208, 210, 211, 212, 213, 198, 200, 195, 222, 219, 221, 223, 217, 216, 220, 192, 209);



    $end = strlen($string);

    $pos = 0;

    do {

      $c = ord($string[$pos]);

      if ($c>128) {

        $string[$pos] = chr($kw[$c-128]);

      }



    } while (++$pos < $end);



    return $string;

  }



  function recode($str) {



    $enc = detect_encoding($str);

    if ($enc==1) {

      $str = koi_to_win($str);

    }



    return $str;

  }

?>

Ein Service von Reinhard Neidl - Webprogrammierung.

mb_convert_encoding

Beschreibung

Parameter-Liste

Rückgabewerte

Beispiele

Siehe auch