Trouble with Drupal, MySQL and non UTF-8 formats

I just figured out a fix for an annoyingly circuitous, tortuous, and baffling MySQL problem. I am not yet an expert of back-end system configuration, but I can tool around a bit (i.e. I know where the my.cnf, php.ini, etc. files live and when (mostly) to edit them). With that...

On our multi-lingual, multi-site Drupal platform we were recently dealing with what looked to be formatting errors in character encoding. We have a process by which the site admins will upload information in their local language (and therefore character set). We recently added our first fully Cyrillic site for the Russian region. We were finding that the encoding for the string

Шпионка

was being stored (and thus displayed on the site) as

Шпионка


I was able to pinpoint that the issue was occurring at the point when the information was being stored into MySQL.

Then I went through a plethora of possible solutions failed attempts. Some of my thwarted efforts included:

  • Saving the input files in multiple different character encodings
  • Changing the collation values for the affected tables
  • PHP functions and libraries that deal with character formatting. Just a few of those attempted: mb_convert_encoding, utf8_encode, iconv

This included chasing all of the additional tips and tricks posted by other users on the respective manual pages.

I even tried downloading additional text editors, as I found a dis/encouraging post, that my current one, TextMate, didn't yet support the text encoding I (thought) was looking for.


After much reading, googling and numerous head vs. desk showdowns, I finally stumbled along this Drupal.org post. It is a little outdated (Drupal 4.7 was the stable release at the time I think); but with some updates for the currently supported Drupal functions I was able to add the following call right before I insert the data into MySQL:

// make sure format is in UTF format
db_query('SET NAMES "UTF8"');
db_query("SET collation_connection='utf8_general_ci'");
db_query("SET collation_server='utf8_general_ci'");
db_query("SET character_set_client='utf8'");
db_query("SET character_set_connection='utf8'");
db_query("SET character_set_results='utf8'");
db_query("SET character_set_server='utf8'");
// Insert!
drupal_write_record('your_table_name', $insert_var);

(I tried placing these calls into my.cnf, but was unable to get mysqld to restart)

Some additional info that I found useful from this post was in response #18, from which I grabbed a MySQL command-line call that I used to verify that my tables affected were set to UTF8 format:

ALTER TABLE my_table CONVERT TO CHARACTER SET 'utf8'

Also in my search to transcode the gibberish strings that were being stored in MySQL, I found a nifty Cyrillic gibberish translator. It also gives you insight as to what format your input code *may* be transcoded from.


Lesson to take home: ALWAYS use UTF8 encoding!

I am aware that these suggestions may not work for everyone in their respective environments; but I felt obliged to share my experience, if only to save some time with suggestions not to try. :)

Filed under:

Ready to get started?