MySQL - utf8 is not really utf8

Turns out that utf8 is not really utf8. I was using Drupal 7 and MySQL and started to encounter a strange error when saving some captured user input. This was the reported error symptom

SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF0\x9F\x98\x8D H...' for column 'field_raw_content_value' at row 1

 

A quick Google search led me (incorrectly) down the path of trying to "clean" my data. My assumption was that the end user was somehow providing corrupt utf8 data. A few sample hits are:

In order to decide the best means to clean my corrupted data, I took a closer look at it. And discovered it was in fact valid utf8 data. Oops.Something was terribly wrong with my world.Why was Drupal and MySQL complaining about valid data?

Much more research revealed this gem:

which documented the real problem:

The reason of this error is

The MYSQL's character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per character ...

So MySQL was lying to me. utf8 is not really utf8. utf8 means "utf8 lite". MySQL only supports a subset of utf8 if you use "utf8". What you really need is the new and improved utf8mb4. Sounds a bit like some  shifty marketing speak. Seems like they had a bug and packaged the fix as a new feature?

No big deal, my current version of MySQL supported the real utf8mb4. All I had to do was use utf8mb4 instead of utf8.

This handy link https://mathiasbynens.be/notes/mysql-utf8mb4 provided all the answers on how to really support utf8 in MySQL. For an existing database. That is not Drupal. I was trying to spin up a fresh Drupal install and hence needed to adjust things a bit.

My first utf8mb4 attempt was to alter all of the system variables in my MySQL server via phpMyAdmin. Settings changed and verified via the command line

SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';

Time to recreate the Drupal instance == total fail. Everything is still utf8. Seems MySQL defaults were overriding my current settings?

Next step was to alter the MySQL my.ini file to explicitly specify each of the settings. Restart MySQL and verify everything looks okay. Recreate the Drupal instance again == total fail. Still utf8. No utf8mb4 to be found.

MySQL was ready and willing. Drupal was not co-operating. Time to dig into Drupal. Which eventually revealed Drupal MySQL does not support full UTF-8 a 4 year old thread that spoke to Drupal 8. More digging revealed this still open thread Drupal 7 MySQL does not support full utf8. Bottom line: Drupal 7 does not support full utf8 on a fresh install.

I confirmed this by hacking core and modifying utf8 to utf8mb4 for table creation. A fresh install failed miserably.

So Drupal 7 currently does not support full utf8. Your choices are use Drupal 8 or restrict your Drupal 7 uft8 data to be utf8 lite data.

A workaround that cleans (corrupts) your data can be found here: https://www.drupal.org/project/strip_utf8mb4