Scaling a database and why Sharding is Crucial.
Scaling a database is hard and one of the ways of doing it is via Sharding. However, sharding is a complex process. Some companies such as Discord, Notion, and Quora use sharding to scale their databases.
Sharding is a database architecture pattern that evolves from horizontal partitioning. The rows of one table are separated into multiple tables called partitions. The main aim is to reduce the data that needs to be searched for a query. Sharding is of two types:
Logical.
Horizontal partitions of data sitting on the same DB instance or server.
Physical.
Each shard is hosted on a different node or server instance.
One physical shard contains one or more logical shards.
Below is an image illustrating this.
Sharding strategies.
There are multiple sharding strategies. Below are some.
- Key or Hash-based sharding.
Data is partitioned based on a hash function applied to one or more columns in the table. The hash function can use the primary key or a group of columns. This is a great choice if the key is monotonically changing and you want even distribution. Below is what it looks like:
- Range-based sharding.
Partitioning is based on a specific range of values in a particular column. Example - Sharding products table based on price band. This is perhaps the easiest to implement.
Below is what it looks like:
- Directory-based sharding.
This is sharding based on lookUp table. Think of this table as a directory or address book that tells you which shard contains a particular record. For instance, the sharding could be based on the field like the Location of the Customer.
Below is what it looks like:
Benefits of sharding.
Helps achieve Horizontal scaling.
Reduces the Query response times.
Increases the reliability.
Drawbacks of Sharding.
It increases the complexity of data storage.
Shards can get unbalanced over time.
No joining of data across multiple shards.
Instances of When should Shards be used:
I pointed out initially that Sharding is a complex business process and shouldn't be used as the first option when other similar and cheaper alternatives exist. It should only be considered when all other options such as Replication, Indexing, and partitioning are not plausible.
Thank you for your time and see you in the next.
Special credits to @progressivecod on X(Formerly Twitter).