Disclaimer: note that I mostly redacted product’s internal details. In places I also used simplified explanations regarding MySQL indexes, especially where detailed explanations already exist elsewhere on the web. I provided links where appropriate.
In the rest of this part 1 post I first provide a short context of how we do certain things at Productive. Then I briefly explain what deadlocks are, followed by the approach we took to fix them. In part 2 which will follow, I will show a different approach we took to resolve deadlocks for our end-to-end environment.
Asynchronous jobs in Productive
In Productive we use Sidekiq for scheduling asynchronous jobs. Some of these jobs are sending emails to users when they do not track their time. Others are batch jobs which need to update some financial data, copy certain objects in a transaction safe manner, etc. Starting a few months ago we’ve started noticing a surge of deadlocks in our application. In particular, these deadlocks started appearing for batched Sidekiq jobs which have to process a high number of relatively heavy database transactions.
What are deadlocks?
Database deadlocks occur when two transactions cannot proceed as one holds a lock that the other one needs. There is nothing inherently wrong with deadlocks, as long as they do not impact your database in a way that you cannot process certain transactions at all. You do need, however, to have a mechanism in place to retry deadlocked transactions. It is generally sufficient to retry only the second transaction (the one that caused the deadlock), as MySQL only aborts that transaction to release the lock, allowing the first transaction to continue.
Finding deadlocks’ root causes in production is difficult. Deadlocks are difficult to reproduce (as they often depend on the throughput of your transactions at any given moment). Enabling extra logging specifically for deadlocks is also tricky as they produce logs which are large in size. Even when you find transactions and the locks they hold, an appropriate course of action for their mitigation is almost never immediately clear.
The experiment
Prep work
However, as the number of deadlocks in our application started to increase month on month, we had to do something to prevent more serious issues. We started by setting up a separate temporary environment which was a very close copy of our production environment. We also did certain code changes to the product so we can run “simulations” of the problematic Sidekiq job easier. We also hooked up the environment to our monitoring system so that we could capture logs. As soon as we started mass running our jobs we immediately began noticing deadlocks.
Deadlock detection
There are more ways to detect deadlocks. We used two. First is innodb status
which displays operational information about a storage engine, amongst which the latest occurring deadlocks. Here is how an output from innodb status
looks like:
The statement only shows one, the latest detected deadlock, with prints details about the table on which the deadlock occurred, as well as the records locked and the type of locks used. This information is reported for both transactions that participated in the deadlock.
The second place to find about deadlocks is the performance_schema
database that keeps the records of all database locks that are granted. Here is an example output of the data_locks
table which shows all currently granted locks (the data in this table is “live” which means that locks appear and disappear as statements get executed):
From that output you can figure out whether some SQL statements hold too many rows locked, as this can sometimes lead to deadlocks (at least in our case it did).
Analysing the deadlocks
Analysing the statements involved in deadlocks and looking at the locks they were holding didn’t really help us in addressing the chunks of code responsible for their execution. But even if it did, we would need to find a way to ensure that all transactions execute statements in the same order to avoid deadlocks.
Looking at the output of innodb status
we found out which indexes were being used and on which tables. The output of innodb status
is fairly straightforward and merely reveals which SQL transaction is being involved in a deadlock, which index and what type of lock was used. Every new deadlock overwrites innodb status
, so by refreshing the status you can analyse which deadlocks occur.
From the output we observed that MySQL would pick two different indexes on the same table at different times. In both cases indexes were indexing a single column (simple index). It’s worth noting that depending on which index is being used, different records will end up being locked, which is the way MySQL ensures the integrity of data.
Knowing the indexes and type of locks that were causing the deadlocks helped us to further analyse performance_schema
, in particular its data_locks
table. This table holds live information about all locks being granted. Inside we found the offending locks and what index records were being locked by them (note that in MySQL it’s not rows that are being locked, but rather index records).
Index optimisation
Seeing the table helped us to realise that by using these two indexes individually, MySQL had to lock wider ranges of index records than it is really necessary. For example, our job needed to update one record D
, but MySQL locked all records between B
and F
, as that was the smallest index record it could lock. As more rows than needed were being locked, and the throughput of our Sidekiq jobs was high, this led to a high number of deadlocks.
As a result of our analysis we concluded that we needed to optimise these two indexes. As the problematic MySQL statements were using both indexed columns in their WHERE
conditions, we decided to drop these indexes.
Instead, we opted in for a composite index containing both of these columns. By doing that we hypothesised that less rows will be locked due to more fine grained index records. Combining two columns narrows down the search MySQL has to do to find the records that need to be altered, incidentally locking less rows. By executing the experiment in our testing environment we reduced deadlocks from 50.000 to 0.
The second experiment
New deadlocks
After re-running the experiment again, with our new composite index, we started noticing deadlocks again, but on a different table. Symptoms were very similar to the ones before, so we took a similar approach looking at the innodb status
and locks and deciding to drop simple indexes, finally replacing them with a composite index. Repeating the experiment we found that the number of deadlocks actually increased two fold (from about 300 to 600) with the new composite index, so we decided not to keep the composite index here.
The increased number of deadlocks in the second experiment was confusing so we decided to dig deeper. It wasn’t until we looked at our database schema that we noticed a BEFORE INSERT
database trigger. Inside there was a statement with two column conditions in the WHERE condition. One of these columns held a pattern whose typical value for most of our clients was “{n}”.
Trying composite indexes again
Unlike in our first experiment where we reduced 50.000 deadlocks to 0 by adding a composite index, the same approach didn’t help us in this situation. The problem was no variability in the column holding a pattern (>98%
of pattern was {n}
), so MySQL was unable to narrow down the search and lock less records. As a result, MySQL had to lock almost the entire table, so deadlocks were inevitable. We couldn’t find other columns to form a more suitable index or to change the trigger. But as the number of deadlocks for the second table was relatively low, we decided not to do any changes.
Conclusion
In the end we decided to live with some deadlocks. It turned out that in production we ended up with some 50 deadlocks when the particular jobs are at their peak, but often much less than that. As a matter of fact, we’ve been monitoring deadlocks for the last couple of months and week on week we get about 16 of them at most. However, it’s also worth mentioning that optimising indexes might not always be a suitable solution for deadlocks. As shown above, it depends on a case by case basis.
But what is also important is that having deadlocks is not necessarily a bad thing, as long as you have a strategy to retry transactions and database can operate normally. All in all, three months in and counting, more than 50.000 deadlocks were resolved by a simple index modification.
To conclude, we found a way to address deadlocks beyond what the official MySQL documentation suggests. Optimising indexes was a cheap but effective way to reduce deadlocks. However, adding indexes judiciously pays off even better than finding yourself later removing them due to issues they may cause.