rss-bridge 2011-07-05T00:00:00+00:00

MySQL for Statistics – Old Faithful

MySQL turns out to be a good Swiss Army Knife for persistence, if used wisely. Understanding disk access patterns driven by your storage…

MySQL for Statistics – Old Faithful

July 5th, 2011 by Sean Treadway

MySQL turns out to be a good Swiss Army Knife for persistence, if used wisely. Understanding disk access patterns driven by your storage engine is key. Choosing a read or write optimized disk layout will get you very far. We chose a read-optimized disk layout using InnoDB and MySQL for statistics.

While our wheels were spinning trying to find out why our statistics storage patterns were causing MongoDB to thrash our disks, we started looking for an emergency alternative with the technology that we already had: MySQL+InnoDB.

What we knew:

We need to persist a log entry for each play

The play events are coming in sequentially ordered by time

What we needed:

Counts for 5 different dimensions of the plays: by referrer, listener, 3rd party application, and country, and the total count.

Counts for 3 different time ranges – all time, currently displayed period and previous period to calculate the percentage changed.

A time series of the totals for a single dimension over the displayed period

What we had so far (don’t do this at home!):

A single log table called plays in our online database with the below schema

A secondary index on each dimension used by the statistics pages to build aggregates using the count aggregate function

We posed the question, “what is the minimum we could do with our current InnoDB log table so that it can service our read load”.

CREATE TABLE plays (
id int(11) NOT NULL AUTO_INCREMENT,
created_at datetime DEFAULT NULL,
track_id int(11) DEFAULT NULL,
user_id int(11) DEFAULT NULL,
referer text COLLATE utf8_unicode_ci,
country varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
api_consumer_key varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (id),
KEY index_plays_on_track_id (track_id),
KEY index_plays_on_user_id (user_id),
KEY index_plays_on_created_at (created_at),
KEY index_plays_on_referer (referer(50))
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

The impending doom of letting this table continue to grow was also approaching fast. This table was ticking along at multiple billions of rows, consuming hundreds of gigabytes of data, had no clear shard key and the indexes added another 80% on top of the data and filling too much storage space on our OLTP databases.

To answer the question “what can we do to serve stats within a second”, we should also find the root cause of “why aren’t we currently servicing our current read load?”

Still bruised from getting whacked by the reality stick from our previous experiences, the first step was to trace the entire operation of a typical query from request to response, paying close attention to where the IO operations could be occurring and take the assumption that they will always occur. We no longer assumed that our dataset would reside in memory.

Our typical query looked something like this:

select track_id, country, count(*) as total
from plays
where track_id = 123
and created_at between now() - interval 7 day and now()
group by track_id, country

This was performed over date intervals of week, last week and all time. For
the other dimensions, we’d group by the additional column like country.

The first thing that gets hit is the query planner. An explain on query for a 1 row table reveals this:

select_type: SIMPLE
table: plays
type: ref
possible_keys: index_plays_on_track_id,index_plays_on_created_at
key: index_plays_on_track_id
key_len: 5
ref: const
rows: 1
Extra: Using where; Using temporary; Using filesort

Ah, nice and simple, traverse the track_id index then load the row, fill out a temporary table with the aggregate fields track_id and country, then split up that temporary table into chunked buffers sorting each buffer, potentially writing it to disk, then perform a merge sort from all the temporary buffers.

It looks so obvious now because we have spent the effort to read the MySQL source, read the documentation and walked through every step and the tuning variables that affect each step. We currently don’t have a DBA on staff to ask, so thankfully we could teach ourselves from the online manual and source code.

Taking a little detour, it’s good to explain how we’re interpreting the Extra column one at a time. In reverse:

Using filesort

A filesort doesn’t necessarily mean it goes to files. This article by Sergey Petrunia gives a very good overview of what a filesort actually does, putting in clear terms what is suggested for optimizing MySQL order by clauses.

We were obviously not hitting any group by optimizations with this query.

Using temporary

This is a byproduct of order by, group by or select distinct as described
by how MySQL uses internal temporary tables. MySQL must use a temporary
table to satisfy these kinds of query. If you have a lot of temporary tables
created going to disk by monitoring the Created_tmp_disk_tables status
variable, you should increase your tmp_table_size and max_heap_table_size
variables to prevent “swapping” on your temporary partition.

Using where

This means that the condition of the query cannot be fulfilled solely from the index. The row data must be fetched so that the condition “created_at between ? and ?” can be tested. The number of rows in the rows column will indicate how many rows MySQL guesses it needs to pull out from the storage engine.

How the rows are fetched is actually specific to the storage engine and this is where we learned our most important lesson about InnoDB: Everything is a BTree. The InnoDB table itself is a clustered index#Clustered) on the primary key with the leaves containing the row data. The secondary indexes (the ones you add to your table) are organized by indexed columns with their leaves containing the primary key of the row. References: Xarb][xarb], [MySQL Indexes.

When you see “using where”, you are making 2 index lookups. One on the secondary index to find the primary key, and one on the clustered index (table) to find the row data.

What this means for the plays table is a classic example of how easy it is to optimize for writes or reads but not both. The plays table is optimized for writes as plays come in, the surrogate ID gets incremented and ends up settling cosy next to the warm data on the right side of the BTree. A read comes in and performs a quick traversal down a secondary index. What it gets is a list of primary keys that can range from 0 to max(id). The lookup of those primary keys turn a sequential secondary index traversal into effectively multiple random lookups on the clustered index. Traverse primary key – load database page, traverse next primary key – load database page, seek to next primary key – load page, etc…

In the worst-case, we incur 1 seek for every statistic. At about 10ms per seek… and some tracks with many millions of plays… we effectively try to load almost all pages from the table. Now we have the answer why we’re not able to perform.

How to fix

What we now know:

We have effectively random reads on the full plays table using secondary indexes

What we needed:

Sequential reads by not reading the rows in our AUTO_INCREMENT primary key order

Covering indexes

[...]

Original source