You are viewing our old blog site. For latest posts, please visit us at the new space. Follow our publication there to stay updated with tech articles, tutorials, events & more.

MySQL Data Sharding

0.00 avg. rating (0% score) - 0 votes
Over the time with increased traffic and user base, data increases enormously. Large number of operations on Huge MySQL table becomes a bottleneck for an application.
At Naukri, there was a case where a single myISAM MySQL table had more than 46 million records. Table size on disk was more than 250GB. Operations (select/insert/update/delete) on that table had become a bottleneck due to table locks and having to update large column indexes. With increasing traffic, scalability had become a serious concern.
Another major risk involved in having a single large table is data corruption. Once corrupted, restoration of the entire table would be time consuming and can hamper the application usage for long duration.
Solutions considered
First tried converting table’s storage engine to InnoDB to reduce locking from table level to row level. With this, application can be scaled with higher concurrency. But InnoDB engine has its own limitations. First the table size on disk increased three folds which needed additional hard disk. Second, column index size were still huge, hence table operations (insert/update/delete) were still a bottleneck.
Due to these reasons, horizontal sharding of the main table into smaller chunks was opted.
What is Horizontal Partitioning?
Horizontal partitioning is a database design principle whereby rows of a database table are held separately, rather than being split into columns (which is what normalization and vertical partitioning do, to differing extents). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.
Some of the advantages of Sharding are :
  • Reduction in total number of tuples
  • Reduction in index sizes
  • Split the tables over multiple database servers
Sharding high level details
First challenge was to have a consensus over the shard key basis which table will be split while keeping shards sizes even and data consistent across applications. There were three prime candidates.
  1. Primary key of source database
  2. Primary key of the table to be sharded
  3. Unique id associated with each records across applications
Since the associated primary keys were not consistent across the databases and applications, unique id was chosen for sharding.
Next concern was to decide the shard size, too small a shard will result in a number of tables and may become unmanageable whereas too large a shard may restrict achieving complete benefits of sharding either in terms of data size or traffic handling.
Another factor was, should new entries go into existing shards or incrementally create new shards for upcoming entries (keeping the number of records in a shard fixed). We opted for latter.
Challenges and issues faced
Prime concern while sharding the table was to avoid any impact on the live traffic as well as handle live insert/update/delete on sharded tables. To ensure this, created another parallel database setup for sharding. Handled live insert/update/delete operation on sharded tables through trigger on main table.
Another challenge was to refactor existing code base to work with sharded tables.
Some more Use Cases
Comments (Fixed Shards)
The dataset containing the Comments posted by recruiters has been sharded into 100 shards in Resdex. This is done by sharding the dataset on the basis of Company Ids.
Similar CVs (Dynamic Shards)
Problem : Due to the complexity of the feature, there was many to many mapping between different usernames or tuples. This caused various issues in porting of data and data retrieval to show on different pages of Resdex.
New Architecture Design:
The dataset containing the Similar CVs corresponding to a certain username if first collected in one table. This is then copied to temporary sharded tables. Then the data is moved to the corresponding shards, with the limit of 10 lac UserId Range per shard. A new shard is then dynamically added once the dataset is greater than the shard limit.
  • Sharding of Mysql table resulted in reduced data corruption risk, hence reduced time to restore data in case one of the shard gets corrupted.
  • Select/Insert/update/delete operations became fast due to small sized tables and indexes.
  • Sharded tables offer massive scalability and performance improvements as it can handle higher concurrency collectively with fast select/insert/update/delete.
What next?
Considering the present solution and its advantages, we will be able to deal with enormous traffic with existing infrastructure and no further changes will be required in near future. Still we are open for any change better solution and looking for rock star people who can help us scale Naukri to next level.
Posted in General