Having spent the last 5 months supporting on a multi-language multi-site project, I've learned quite a bit about multibyte safe code, so I thought I'd share some key highlights.
In Drupal everything is UTF-8. Therefore, whenever possible, use the Drupal API rather than php functions for strings as the API handles the majority of UTF-8 issues for you.
A few key functions are:
- converts a string to UTF-8
- safe trucation of multibyte strings. If you use split, you can end up with invalid utf8 characters or unexpected data by splitting a multibyte character in half. This gets particularly interesting if the remnant happens to be a control character.
- This gives you the site's default language. I've found this particularly useful for setting LC_COLLATE for use with strcol(). By the way, in a multilingual world, strcol() versus strcmp() can matter a lot!
- Although this is intended as an internal function, it fixes something we see fairly often with imported data - the correct letter but wrong case is displayed.
- the translation function. It's not UTF-8 specific, but here just as a reminder that it's important to use this with text strings, including for comparison.
If you can't do what you need to using the Drupal API, there is an excellent short article Using Multi-Byte Character Sets in PHP (Unicode, UTF-8, etc) that lists all the dangerous PHP functions and suggestions for use or replacements. This article should be required reading for all PHP programmers. (I want this on a t-shirt!)
Finally, be aware of problems with determining the character set and encoding of imported data. Because UTF-8 is inclusive of the Latin-1 character which is inclusive of the ASCII character set, it is impossible to determine which character encoding was used for certain string (i.e. if only characters in the ASCII character set are used.) However, if you encode something that has already been UTF-8 encoded it can cause problems in certain cases. There's no easy answer here if you can't use content that you know has already been UTF-8 encoded. This is almost exclusively a problem when importing data from unknown sources.
If you have any good resources along these lines, please share them in the comments!