Going UTF-8 (utf8) with PHP & MySQL

UTF-8 is a character encoding standard which supports characters for (nearly) all the languages in the world. Older standards, such as US-ASCII and ISO-8859-1, contains only characters for English (US-ASCII) and Western European languages (ISO-8859-1).

There are a lot of good reasons to use UTF-8, especially if your app will (or may eventually need to) support international users. If you're developing in PHP, there's also a few good reasons not to use UTF-8 right now… at least not until PHP6 comes out. PHP5 doesn't natively support multibyte characters (which UTF-8 uses), so you may have to do some special handling.

If you want to use UTF-8, here is a quick guide to upgrading your LAMP application. First & foremost, you want to ensure that your entire application stack is using UTF-8. That means you're serving pages to the browser in UTF-8, the browser is sending data back in UTF-8, and you're storing data in your database in UTF-8. If some portion of the app stack isn't passing data in UTF-8, then characters in the data stream will be mangled or lost.

Getting the Browser to use UTF-8

You'll want to make sure to tell the browser that you're sending data as UTF-8, and that it should send data back as UTF-8. To do that, you should put in an HTTP header on every page:

header('Content-Type: text/html; charset=utf-8');

and also include a Content-Type meta tag in your actual HTML document:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

Note that some folks recommend you do add an 'accept-charset' attribute on your <form> tags:

<form action="yourposturl" method="post" accept-charset="UTF-8">

but bobince at StackOverflow says that IE doesn't support that, and so its not a good idea to use it.

PHP - Using the mb_* functions

PHP4/5 treats every string as a sequence of bytes, rather than a sequence of characters. If one char = one byte, that's fine. A function like strlen() will simply count the number of bytes in the string, and return that as the number of characters in the string. So strlen('ab') is 2, as you'd expect.

The trouble comes with UTF-8; it uses anywhere between one and three bytes to represent a single character. Now, one char != one byte… so strlen('汉语') is 6, even though there are only two characters there. Each of those Chinese characters takes up 3 bytes.

To combat this, PHP introduced the mbstring extension, which contains functions to process multibyte strings. For example, if you do: mb_strlen('汉语', 'utf-8'), you get the expected length of 2.

Therefore, the general recommendation is to go through all of your code and replace any standard string function with the mb_* equivalent. strlen() becomes mb_strlen(), split() becomes mb_split() and so on. Note that you don't have to pass the encoding with each call; you can use mb_internal_encoding('utf8'); before calling any string functions to set the encoding that they should all use.

Other Options in PHP

Replacing all of your string functions is a ton of work, so do you have to? Yes and no:

PHP includes a function overloading feature, which will force PHP to use the mb_* functions whenever you call thier non-mb equivalents. Set mbstring.func_overload=7 in php.ini to use it. Problem is, as of this writing in September 2009, this feature is not well tested, and may lead to 'undefined behavior'. General consensus seems to be not to use this feature.

You could also just not do anything, and let PHP treat multibyte strings as a sequence of individual bytes. Depending on what you're doing, this may not be as dumb as it sounds: if you only expect to be passing mutlibyte data back and forth between a webpage and a database, you'll probably be OK. The PHP WACT site has a great summary on what can break if you use standard PHP string functions with multibyte UTF-8 data. In practice, the biggest issues will likely be around data validation: for example, if you require a string to be no more than 20 characters, and you use the strlen() function to check for this, then a 7-chinese character string will come back as invalid (7 chars * 3 bytes = 21 bytes).

Regular Expressions & UTF-8

PHP has two types of regular expressions: the POSIX-compliant ereg_* functions, and the Perl-compatible preg_* functions. The ereg_* functions all have multibyte equivalents, e.g. mb_ereg_*. The preg_* functions don't, but you can pass the /u modifier to force them to parse strings in UTF-8 mode. See the Regular Expressions in PHP page for more info.

PHP 6

In theory, PHP6 will natively support multibyte strings. It is introducing a new string type: you'll be able to have binary strings (like in PHP4/5), and you'll also be able to have multibyte character based strings, which will let all of the standard string functions work properly. As of this writing in September 2009, PHP6 is still under development with no firm timeframe for release.

MySQL

There are two things that you need to worry about when dealing with MySQL & UTF-8.

Setting the connection charset

First, you need to make sure that you set the character set of your connection to be utf8. The exact mechanics of doing that depend on the connection method you're using; here are a few of the common ones:

mysql_set_charset('utf8'); // mysql extension
mysqli_set_charset('utf8'); // mysqli extension
$dbAdapterMySQLi->getConnection()->set_charset("utf8"); // Zend DB MySQLi

NOTE: it is VERY important that you use the built-in set_charset() call to change the character set. Many sites recommend simply making a query to the database with “SET NAMES utf8”. The problem with that is that the MySQL extension doesn't know you're passing data as UTF-8, and that means mysql_real_escape_string() will be escaping data using the default latin1 character set. That could open your app up to weird behavior and possibly an SQL-injection vulnerability; see this StackOverflow post for more details.

Creating utf8 tables

Second, you need to make sure your database and database tables are using the utf8 charset. The easiest way to do this is to specify the charset when you create the database and tables:

CREATE DATABASE `my_db` CHARACTER SET = utf8 COLLATE = utf8_general_ci;
CREATE TABLE `my_table` ([TABLE spec]) CHARACTER SET = utf8 COLLATE = utf8_general_ci;

Converting from latin1 to utf8

What if you already have an existing application, and it uses a database with a latin1 charset (ISO 8859-1)? You'll have to convert your database to use utf8. The only way I've found to reliably do this is outlined below; unfortunately it requires taking the DB offline during the conversion. There may be better ways, but nothing I've tried has worked as well as this. As always, make yourself about six backup copies first, just in case!

Dump the database:

mysqldump --default_character_set=latin1 -u root -p my_db > my_db.sql

Delete (drop) the database from the DB server.
Use iconv to convert any latin1 (iso-8859-1) characters to utf8 characters:
```
iconv -f iso-8859-1 -t utf8 my_db.sql > my_db-utf8.sql
```
Use sed to replace any mentions of the latin1 character set with the utf8 character set:
```
sed s/latin1/utf8/ < my_db-utf8.sql > my_db-utf8-final.sql
```
Create a new database with the proper UTF-8 character set and collation:
```
CREATE DATABASE `my_db` CHARACTER SET = utf8 COLLATE = utf8_general_ci;
```

Reload your data into the new database:

mysql -u root -p my_db < my_db-utf8-final.sql

UPDATE: 6 April 2010 This technique does not appear to properly translate MySQL BLOB (binary) data. mysqldump dumps the blob as a binary string, which is represented in the dumpfile as a series of characters with a latin1 encoding. When iconv comes through, it changes the binary representation of those characters into the UTF-8 encoding, which corrupts the underlying binary data.

The only solution I've found so far is to manually copy the binary data afterwards from the old DB to the new DB, using a multi-database UPDATE query like this:

UPDATE new_db_utf8.mytable SET
new_db_utf8.mytable.blob_col = (SELECT old_db_latin1.mytable.blob_col FROM old_db_latin1.mytable WHERE old_db_latin1.mytable.id = new_db_utf8.mytable.id)