The most significant difference between natural sorting and your typical sorting is the order of a string when there are digits involved. For instance, the string “something2” will be placed before the string “something10” under normal sorting, but 10 is more than 2 for that matter.
This default sorting algorithm causes sorting problems with Datatables and we will see 2 ways to solve it, using a natural sort plugin and the data-sort or data-order attribute.
Natural Sort Plugin In Datatables
The documentation for this plugin lives here. This is how it is implemented in my Rails projects that are running on Turbolinks.
First, install the datatables plugin package via the command below:
yarn add datatables.net-plugins
There is quite a bit of complexity in implementing this. The explanation can be found in this article and is largely due to the need to adapt to a turbolinks driven environment.
The key implementation takes place in line 9 and 10. The natural-nohtml type is specified to strip any html during sorting, while the _all value for the target key means to apply natural sort as the default sorting for all columns. More information and configuration options can be found in its documentation.
data-sort Or data-order Attribute
We can use the data-sort or data-order attribute of the table cell to indicate to Datatables to use these value to do the sorting instead of the values in the cell.
This will place the string in the correct numerical order. However, you would have to do the heavy lifting of padding the digits with the appropriate number of 0s.
It is a more useful feature if you want to sort values with a vastly different display value that its actual value that can mess up the sorting order, like date.
<td data-order="1332979200">Thu 29th Mar 12</td>
<td data-order="1354406400">Sun 2nd Dec 12</td>
Under normal sorting, the second <td> element will be placed above the first because ‘S’ comes before ‘T’. However, when we dictate the sort order to be using their epoch timestamp, the story will be different.
I used to work in a web shop / app agency and now as a full stack app and web developer. My work’s focus has always been about churning out application fast and furious. There is not much investment, budget and love to shower upon optimization, security and, in particular, scaling concerns.
With some free time on my hand, I decided to look at the topic that has been bothering me since my first ever Ruby On Rails application: database scaling.
The methods to scale a database can be split into 3 categories.
Application and code
Database design and schema
11. Obliterate N + 1 Queries
This is a common problem that can overwork the database with unnecessary number of queries. It generally lacked the use of JOINs in the queries to prefetch data that will be used eventually in the same request.
Common problem, common solution. Use JOINs to eager load the data beforehand to spare your server multiple trips to the database.
As an application level optimization on the code base, it should be a basic practice for backend developers.
I myself, however, do not practise this all the time. As a word of justice to explain myself and probably a fair number of my fellow confrère, we don’t optimize right away because we have clients who don’t really know what they want and the project requirements were not that all that clear to start with. It just does not make financial sense to spend resources optimizing the project.
From a developer’s point of view and experience building up applications from scratch, the priority is functionality, low budget and speed. Optimization is a bonus we do not paid
10. Optimize Data Type
Data type optimization is an optimization done on the database schema.
Optimize the space needed for a column. An email should not need the full VARCHAR(255) of a typical string data type for example.
Optimizing the space that each column in each table takes will reduce the time taken for the database to get the required data as it will “traverse less bytes”.
Normalization is an optimization done on the database design and schema.
Split out common but less accessed data into separate tables so that there is less computation required when reading the key tables. This is a form of enumeration at the database level. It keeps reading data efficient and thus reduces the load on the database.
However, be careful not to over normalized your tables or it will require a lot of unnecessary JOINs that can quickly bloat up computational needs. An example that I have came across is the normalization of ‘days’ into a table of permanently 7 rows and 2 columns.
Indexing is an optimization done in the database design.
Indexing allows the database to look through a mapping table to find the required row of data in the corresponding table rather than the whole table of data itself.
A mapping table is much lightweight, hence it reduces the time needed to get to a data, freeing up more resources for the database to handle more incoming requests.
Think of it as the table of content in a lengthy web page, book or catalog of grocery. You would do a “control F” to look for the information you want via the table of content rather than read from start to finish until you get to your data. That is indexing essentially.
7. Database Views
Database views is an optimization done on both application and database designlevel.
Database views are stored queries in the database. They can store temporary results and have an index attached to them.
The advantage of using database views over only indexing your tables is that the database now only has to go through the filtered results from the SQL query in the database view, as compared to an index which consist of all results without the filtered.
For example, if a database view has a query that only looks at records for this year, then your database will only be searching the records for only this year. Using only an index, it will have to search through the records since the start of time till this year, which is a lot less efficient. This stackoverflow answer answers it better.
Caching is an optimization done on the database infrastructure level.
Some of the information we display on our websites and apps are derived data from our database. Derived data are raw data from the database that are computed within your application based on business logic. Some examples I can think of are tabulating the total spent by users from an online shop which involves calculating the individual prices of each item, the quantity bought, discounts and miscellaneous fees like taxes and shipping.
These data would not change, given the same raw data and same computing algorithm. So rather than running through the same requests to get the same raw data from the database, and going through the same algorithm, it may make sense to cache it. We typically use cache servers, are not part of the typical databases and are add ons to the infrastructure, to handle this.
Cache servers like Redis stores data in RAM and not on the hard disk like typical databases. The significance of this distinction is that these memory on RAM can be accessed much quickly than those on the hard disk. It is also this exact reason that RAM is much more expensive than memory in the hard disk, thus destroying any idea you might have about using purely cache as the database.
This gives you much to think about what data should be cached and what should not. The art of using cache efficiently to break up the bottlenecks in your application requires much experience and experiment.
You also have to use them wisely because most cache have a limit on how much data you can store in it. Redis, for example, as a key-value store, at the very basic level, has a limit of 512MB for each value.
On top of that, you need to be smart about when and how to auto expire and explicitly expire cached data so that they show the latest data according to your application needs. For example, when there is a change in the raw data in your database, the computation has to be done again since we are talking about new and different input values.
5. Read Replicas
The use of read replicas is an optimization done on the database infrastructure level.
Read replicas involves spinning up more database copies of the master database to handle read loads. This spreads the load up, leaving mainly the write requests to the master database.
Some read requests that required strong consistency still need to go through the master database. This is due to the latency of data propagation from the master database to these read replicas when there are new changes made to the master database.
If your application is write intensive, this may not be the best tool for the job and it will achieve little improvement in performance.
4. Vertical Scaling
Vertical scaling is an optimization done on the database infrastructure level.
This is ultimately a mere stop gap solution as there is a limit on how far this can take us. It is also a costly upgrade for a non future proof solution.
I perceive its main advantage as simply buying us time to prepare for the next level of scaling.
3. Vertical Partitioning
Vertical partitioning is an optimization done in the database infrastructure level.
Disclaimer: I have never experienced doing this, but I believe this is what vertical partitioning is theoretically about and loved to be pointed out if I am wrong about it.
This step is slightly different from what most people perceive of sharding, which is more commonly horizontal sharding that we will cover later. Vertical partitioning is a form of sharding that is easier to implement. I deem it the appetizer for sharding.
It involves splitting columns and even tables into a separate databases or “shards”. This reduces the data in the main database and thus its computing load. It also spreads the traffic, in particular write requests that replicas are not able to solve, to other shards.
Each shard itself can have its own read replica clusters to further reduce the distribute the load.
However, this complexity will seep into the application level as now your application needs to know which database to connect to to write or read whatever data.
2. Hybrid Databases
Vertical partitioning is an optimization done on the database infrastructure level.
Disclaimer: I have never experienced doing this, but I believe this is what vertical partitioning is theoretically and loved to be pointed out if I am wrong about it.
This is a follow up on vertical partitioning. We can use new and more appropriate technologies to the new shards that can manage that part of the application better.
For example, we can use NoSQL databases to handle the historical coordinates of vehicles for a location tracking module. The requirements of this module is places itself more towards the availability and partition tolerance in the CAP theorem. There is no imperative need for ACID properties to be upheld in database transactions, and eventual consistency under the principles of BASE is sufficient for this module, on a very general level. This allow us to utilize the scaling capabilities of the NoSQL database, at the expense of consistency in the data, which is something we can deal with.
That said, not all NoSQL databases are made equal. They do not all sacrifice consistency for availability and scaling. There are many flavors of NoSQL that will fit different requirement of your module and it is all about finding the correct tool for the correct job.
Another example of using hybrid databases is when you have a highly analytical application. We can partition the tables involved in the data computation into a shard, and perform an ETL process to store the data in a more efficient structure in databases that are more appropriate for analytical functions, like Amazon Redshift and Google’s BigQuery. It takes away the computational load from your application and database.
The advantage of this partitioning allow you to scale only the bottlenecks of your application in the most cost productive manner.
An example of this vertical partitioning is done by Airbnb for their chat module. They identified it as a bottle neck in their application and acted accordingly to it. They did not use a different technology for that partition in this case.
Sharding, or horizontal partitioning, is an optimization done on the database infrastructure level.
Disclaimer: I have never experienced doing this, but I believe this is what horizontal partitioning is theoretically and loved to be pointed out if I am wrong about it.
Eventually, some of your tables will have so much data that there is a need to split the rows in the tables into different shards. For example, the first 10 million rows will be, in the same table, moved to a shard located in USA, the next 10 million will be, in the same table, placed in another shard located in Germany, and so on.
Usually at this point of time, you will have a handful of clusters of vertical shards. Horizontally sharding each of these clusters will not be manageable. I believe it is a complex mess to be handling this.
This also bring about new problems like cross shard latency at a global scale and application complexity to route the data to the correct shard. Add in the requirement for data recovery it is time to update your resume, as Mr. Sugu Sougoumarane mentions below, in his talk about the Vitess tool, which will bring me to the next point on this tool as a solution to sharding.
Before you carry out sharding, even for vertical partitions, you may want to consider Vitess. It is a database clustering management system for horizontal scaling to save you the complexity of handling that yourself as you scale so that you can spend your resources on the improving the application itself, which is what ultimately matters.
If I ever get to the point of having to do sharding, at least this will be the first tool that I will research and study more about to tackle the problem.