Fixing double-encoding by converting into Blob

You will now need to verify that your database does not contain double-encoded data. What does that mean?

It means a situation in which your database fields have Latin1 encoding or collation. However, the data inside those fields is actually UTF-8 data. This does not manifest itself too obviously, since MySQL can trans-encode on the fly to match data given or expected.

UTF-8 data was entered into the database because data is usually entered thru a browser-based WYSIWYG editor. If the page HTML HEAD META specifies (a common setting of):

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Then data passed to the editor will be UTF-8 data. The editor isn't very smart, and just writes UTF-8 data into the database field.

 

What is the difference between ASCII, Latin1, and UTF-8? For the first 8 bits of values, containing English alphabet characters only; nothing. Trouble comes if you wrote a single non-English character. These could not be actually stored in a Latin1-only encoding. In UTF-8, however, they are stored as a variable width (of bits) character. UTF-8 is a variable-width version of 32-bit Unicode. In UTF-8, ASCII and Latin characters are stored as bit values equivalent to ASCII encoding, with 8 bits. For non-Latin characters, two or more sets of 8 bits are used.

 

While for Latin characters the problem of UTF-8 data stored in a Latin1 encoded and / or collated table is not very apparent, it manifests itself usually as the character  sprinkled in the page.

To verify first that it is not just the browser (or the page meta head specification), check that the browser encoding is set to Unicode (UTF-8).

If it is, you DO have a double-encoding problem.

Notice that other websites naively show ways to batch delete that character from the database. That is not a solution, but a patch fix. This patch fix does not repair the double-encoding issue, and it does not detect other strange characters in the database. Additionally, it will do nothing if your website has actual data in non-Latin characters that you wish to recover.

 

For the remainder of this article, I will expand upon the excellent article, "Getting out of MySQL Character Set Hell". However, the article has a few shortcomings, omissions, and the like. It is also paced too quickly for a novice who has little or no experience with MySQL. I will attempt to rectify these shortcomings.

Like the article suggests, it is only possible to detect double-encoding by comparing the character byte width of the content of a database field.

If you are using SSH, you can login into a MySQL server with a command similar to the one you have seen for MySQL dump and import. Just omit the redirect and the filename. For example, for a remotely-hosted (not on localhost) MySQL, the command becomes:

mysql -u USERNAME -h HOSTNAME --password=dbpassword DATABASENAME

If your connection details are correct, your prompt will change to

mysql>

Indicating that you are no longer talking to the Bash shell, but are talking to MySQL server instead. If you wish to quit MySQL at any time, type in

exit

 

You should never work on the original database and / or on a live (production) server. Create a MySQL dump, and import it into a second database. Work to fix the second database. If you like your results, you can just point your CMS to the fixed database, and the original one will be untouched.

Create backups of all databases which you will be working with before the procedure.

 

You should also open up PHPMyAdmin.

I don't have a script for the following process, so it is labor-intensive and mistake-prone.

We need to make a list of all tables in the database which contain large blocks of textual data. Such data is that which is entered thru a WYSIWYG editor. This would usually be database tables for page content, page template, CSS associated with the template, website comments, blog entries, etc. This does require some understanding of CMS operation and organization, but is not so hard to do visually. Simply click on one particular table on the left frame of the PHPMyAdmin window, and on the right frame, click on the Browse tab.

MySQL double-encoded UTF-8 in Latin1 encoded

We are looking for fields containing textual data, and that which contains either strange characters or the  character.

Go thru the database tables, noting the database table names and fields which contain textual data, writing the names down on a sheet of paper.

We can confirm our guess of double-encoding going on by running a MySQL query:

select count(*) from dbprefix_tablename where LENGTH(field) !=CHAR_LENGTH(field);

Where field in the illustration above would be either content_name or menu_text.

If this query returns a zero, then the field in the table specified does not have double-encoded data. Move on.

 

We will now create a temporary table, and place messed-up data in this temporary table.

First, we need to ensure that the table has a PRIMARY KEY. Click on the Structure tab.

Under the list of fields, there's a list of Indexes. Make sure there is an entry for a PRIMARY KEY.

MySQL PHPMyAdmin PRIMARY KEY

If there isn't one, we need to create one with the query:

ALTER TABLE `dbprefix_tablename` ADD `KEY_FIELD_NAME` INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT AFTER `last_field_name`;

Now we can create the temporary table.

create table temptable (select * from dbprefix_tablename where LENGTH(field) != CHAR_LENGTH(field));

View the resulting temporary table you have just created to see what the double-encoded data looks like. Note: to refresh the list of tables in the left frame of the PHPMyAdmin window and have the new table you created appear in the list, click in the left frame area, and refresh the browser.

As an example, here's a tidbit of data from one of my old tables:

content_id type prop_name content

59 string content_en <p>Sub-pages:</p>

<p></p>

<p>Â </...

Notice the A character at the end of that cell that I am trying to get rid of!

 

We will now fix the double-encoding. For details of what these commands do, refer to the article referenced above.

alter table temptable modify temptable.field_name text character set latin1;
alter table temptable modify temptable.field_name blob;
alter table temptable modify temptable.field_name text character set utf8;

This will fix double-encoding of a single field. If there are more than one (content_name and menu_text in the example above), then these three queries need to be repeated for each field to be fixed.

Now run the command:
delete from temptable where LENGTH(content) = CHAR_LENGTH(content);

 

View the resulting table. It should have the  character gone. Non-Latin characters should be converted to meaningful UTF-8 characters, if they actually were writing in a non-English alphabet.

Here is an example of conversion of the tidbit listed above:

content_id type prop_name content

59 string content_en "<p>Sub-pages:</p>

<p></p>

<p> </p…"

 

And the data shown on the illustration above was converted to its proper form of UTF-8 formatted Cyrillic.



We can now overwrite the original table data with the fixed contents of the temporary table.

replace into dbprefix_tablename (select * from temptable);

For this to work, both the old (original) table and the temporary table must have the same PRIMARY KEY.

Note: replace command does two things. First, it deletes the original database record, and then it creates a new record with the data given. From my research, some people suggest that this might cause some problems. There's an Update alternative which does not delete records.

If you had to create a PRIMARY KEY for the original table like I have suggested, you can drop the field containing the PRIMARY KEY now.

After you are done going thru every relevant table, create a second backup.

There is another use for the second backup you have just created. Open it in a UTF-8, no BOM editor like Notepad++. Search for  and see if you have missed tables and fields.

 

Now, inspect your database or your website for any other strange characters present. If there are none, then YOU ARE DONE CONVERTING YOUR DATABASE! Hurray!

Be cheerful! It only took me three weeks to figure all of this out! No you can read the summary and do the fix in a few hours.

 

 

I have prepared a Website Management Commands Master Reference which can aid you in this process.



˅˅˅ Additional valuable information is available at one of the links below: ˅˅˅

 

Did you like the article? Let Google Search know by clicking this button: . Please link to content that you find useful on this website on your own website, forum or blog! You can also comment on this page below, or contact me to ask a question or suggest a topic for me to research. There is a user-editable Wiki available on my website, as well as a Forum that you can contribute to. Site Map.

Page last modified 06-Jan-13 21:02:26 EST
Comments on this page:


Add a comment to this page
Comment Title:
Your Name:
Your Email Address:
Notify me of new comments to this item:
Additional Comments:
This is a captcha-picture. It is used to prevent mass-access by robots. (see: www.captcha.net)