You are viewing our old blog site. For latest posts, please visit us at the new space. Follow our publication there to stay updated with tech articles, tutorials, events & more.

How we converted Latin1 encoding to UTF8 in Mysql DB

0.00 avg. rating (0% score) - 0 votes

The problem statement of converting latin 1 encoding to UTF-8 encoding in mysql  database is something similar to upgrading your linguistic skills from english to any other language in the world.

Sounds impractical!!

Correct, you may find it difficult to learn all the languages in the world as it is practically impossible to learn the rules of so many languages.

But machines can do it easily. You just need to feed it with character set of that language and rules i.e. in technical term collation.

We at naukrigulf deal with middle east countries so our system needs to understand ARABIC along with English. We had this support since inception. But implementation was not so clean and efficient. We used to inform our database at runtime to understand the arabic(default english).

How we used to do it dynamically?

By firing the below query during connection with MYSQL.

SET names utf8.

What was the problem?

  • With the site of our sort which is serving a huge user base(in millions), the cost of firing the above query was too much. QPS on database servers was too high(in 4 digit figure).
  • Extra headache of writing a query on each connection which leads to more code maintenance.
  • Code was not clean

Solution:

We can avoid the above problems  if our mysql server are configured to understand UTF-8 rather than latin1(default).  To configure mysql to understand UTF-8 required to change mysql config and altering whole database to change character encoding, which is relatively tedious task as compared to change settings. This would save us from firing query at run time during each connection.

But the risk of altering the whole database and losing data was too high. A deeper understanding on how these encoding system works was required to mitigate any risk that may corrupt data.

Let’s understand encoding of latin1 and utf-8 .

Latin1:

It is a 8 bit encoding system which has 255 character sets.

Its collation is latin1_general_ci. Any character in this encoding is represented by combination of 0’s and 1’s. with 8 bits set.

utf-8:

It is a multi byte encoding system where each byte is stuffed with some meta information that only utf-8 systems understand.

First Byte : Characters till 127 have information in lower 7 bits and 8th bit is 0(not an information bit) and understand by UTF-8 .

Two Bytes : Any character having ASCII value above 128 till 255 is represented by two bytes with some higher bits of each byte reserved for meta information.

But Latin1 representation of the above character range is done with using 8 bits of  only one Byte.

For more info on how multi bytes (for characters having ASCII values greater than 127)are encoded refer the table attached at the end.

So it is clear from above that these encoding are different.  Can you infer now what was the impacted character set?

Any utf8 character till 7 bits has same encoding like latin(8 bit encoding) and anything above that are the characters that were creating impact.

Above study was for us to understand and mitigate the risk.

How to switch to UTF8 encoding from latin1 in mysql:

  • Identify all latin tables with below query.

SELECT table_name, table_SCHEMA, table_collation,    (DATA_LENGTH+INDEX_LENGTH)/1024/1024/1024 as size FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE = ‘BASE TABLE’ and TABLE_COLLATION like ‘%latin1%’ order by (DATA_LENGTH+INDEX_LENGTH)/1024/1024/1024 desc;

  • Change mysql settings to deal with utf8.

[client]

default-character-set=utf8

[mysql]

default-character-set=utf8

[mysqld]

character-set-server=utf8

collation-server=utf8_unicode_ci

skip-character-set-client-handshake

  • Convert the old latin datables to utf-8 by changing character set and collation of each table and database selected above with the below query:

ALTER DATABASE <database> CHARACTER SET utf8 COLLATE utf8_general_ci

ALTER TABLE <database>.<table> CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci
<database>  is database name

<table> is table name

Challenge we faced during table alteration:

InnoDB tables with foreign keys were giving mysql error 1025 while firing above alter table query.

This can be solved by first dropping off the foreign keys from all main and reference tables and then altering tables for character set and collation with above query and then create foreign keys again on the tables.

For detailed implementation refer the post http://stackoverflow.com/questions/29845224/how-to-change-character-set-in-an-innodb-table-with-foreign-keys

Impact of the above activity:

Pros:

  • Reduction of QPS by 30%. And number of queries corresponding to this in absolute terms is in lakhs(10.9 lakhs) for an hour.
  • Clean code.
  • Less maintenance.

Cons:

  • Increase in db size due to change in structure of database. But that was inevitable as utf-8 use extra bits to store meta information.

It looks very simple to convert your DB. But to mention we did this activity on multiple live and slave servers with zero down time.

That required a great team effort ,a night and a constant vigil.

Come join us and help us make a world class multi lingual job search platform.

Extra Bites:

Below is the chart that UTF-8 encoding follows to encode characters.  ‘X’ char represents the bit of ASCII value corresponding to characters. 0’s and 1’s represent the meta info stuffed by utf in bytes.

 

Bits of code point First code point (in hexadecimal) Last code point (in hexadecimal) Bytes in sequence Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
7 U+0000 U+007F (127) 1 0xxxxxxx
11 U+0080 U+07FF 2 110xxxxx 10xxxxxx
16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
21 U+10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 U+200000 U+3FFFFFF 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 U+4000000 U+7FFFFFFF 6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Source: https://en.wikipedia.org/wiki/UTF-8