Emulating Window Functions in MySQL 5.7
One of MySQL 8’s biggest improvements is the support of window functions. As I always said in conferences, there’s SQL before window functions and SQL after window functions. Once you start using them, you’ll use them everywhere. Some of you poor souls are unfortunate enough to be stuck on MySQL 5.7, either of your own … Continue reading Emulating Window Functions in MySQL 5.7 →
Emulating Window Functions in MySQL 5.7
Posted on January 20, 2023May 24, 2024 by lukaseder
One of MySQL 8’s biggest improvements is the support of window functions. As I always said in conferences, there’s SQL before window functions and SQL after window functions. Once you start using them, you’ll use them everywhere.
Some of you poor souls are unfortunate enough to be stuck on MySQL 5.7, either of your own choosing, or because you’re using a clone / fork that is still 5.7 compatible. While for most people, this blog post is just for your amusement, or nostalgia, for some of you this post will be quite useful.
Using local variables
A lot of Stack Overflow questions or blog posts out there show the same old trick using local variables. In a procedural context, local variables make perfect sense. For example, this statement batch.
SET @c = (SELECT COUNT(*) FROM information_schema.tables);
-- More processing
-- Return the result:
SELECT @c;
A bit hairier is the fact that these local variables can be declared within a query, and incremented procedurally within a query:
SELECT
-- Use and increment your variable in SELECT
@rn := @rn + 1
FROM
SELECT 3 AS a UNION ALL
SELECT 4 AS a
) AS t,
-- Declare your variable in FROM
(SELECT @rn := 0) r
ORDER BY a;
And boom, you have a ROW_NUMBER() OVER (ORDER BY a) window function! The result being:
|a |@rn := @rn + 1|
|---|--------------|
|3 |1 |
|4 |2 |
This works quite incidentally, because the expression incrementing the row number “happens to” be evaluated in the desired order, row by row, because of the query’s ORDER BY a clause. Revert it:
SELECT
a, @rn := @rn + 1
FROM (
SELECT 3 AS a UNION ALL
SELECT 4 AS a
) AS t, (SELECT @rn := 0) r
ORDER BY a DESC;
And you still get the desired result:
|a |@rn := @rn + 1|
|---|--------------|
|4 |1 |
|3 |2 |
This is really hairy, because it violates the idea of SQL’s logical order of operations, which most RDBMS agree upon. It assumes ORDER BY “happens before” SELECT, just because the optimiser chooses to do things this way. You can tamper with the optimiser and break the “feature” easily, e.g. by adding DISTINCT:
SELECT DISTINCT
a, @rn := @rn + 1
FROM (
SELECT 3 AS a UNION ALL
SELECT 4 AS a
) AS t, (SELECT @rn := 0) r
ORDER BY a DESC;
Now the result is no longer what we wanted (how could it possibly be?):
|a |@rn := @rn + 1|
|---|--------------|
|4 |2 |
|3 |1 |
The reason is that DISTINCT is typically implemented using a sort or a hashmap, both will not preserve any ordering, and according to the aforementioned logical order of operations, this is perfectly fine, because ORDER BY is supposed to “happen after” SELECT and after DISTINCT, at least logically.
But if you’re careful, and cover everything with enough tests, you could still use this trick. After all, being stuck with MySQL 5.7 is already painful enough, so why not treat yourself to an “almost window function”.
Note: Just to indicate how much of a bad idea depending on this incidental feature is, MySQL 8.x now issues a deprecation warning:
*Setting user variables within expressions is deprecated and will be removed in a future release. Consider alternatives: ‘SET variable=expression, …’, or ‘SELECT expression(s) INTO variables(s)’.*
The main reason I’ve seen this syntax being used on Stack Overflow so far is to emulate ROW_NUMBER, so, I’d say, good riddance (now that MySQL 8 has window function support)
PARTITION BY using ORDER BY
What I haven’t seen much on Stack Overflow or in blogs, is PARTITION BY support. Most solutions I’ve seen use ORDER BY to implement partitioning, which is fine. For example:
SELECT
a, b,
ROW_NUMBER() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
IF (
@prev = a,
@rn := @rn + 1,
CASE WHEN (@prev := a) IS NOT NULL OR TRUE THEN @rn := 1 END
) AS rn2
FROM (
SELECT 1 AS a, 3 AS b UNION ALL
SELECT 2 AS a, 4 AS b UNION ALL
SELECT 1 AS a, 5 AS b UNION ALL
SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := 0, @prev := NULL) r
ORDER BY a, b DESC;
Producing:
|a |b |rn1|rn2|
|---|---|---|---|
|1 |5 |1 |1 |
|1 |3 |2 |2 |
|2 |6 |1 |1 |
|2 |4 |2 |2 |
A few notes:
- The desired
PARTITION BYandORDER BYclauses both have to be reflected in the top level query. If you only wanted toORDER BY b DESC, notORDER BY aas well, tough luck. (If you want to play around with this, try removing theROW_NUMBER()function, which also orders stuff bya, implicitly)
- I’ve tried to put all the variable assignment logic into a single expression in order to avoid any extra columns being generated. This makes the expression a bit more ugly than it needed to be.
PARTITION BY using JSON
A more robust, but perhaps slower approach to emulating PARTITION BY would be to maintain a JSON object that keeps track of each partition key’s ROW_NUMBER(), because why not?
Behold this beauty:
SELECT
a, b,
ROW_NUMBER() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
json_extract(
@rn := json_set(
@rn, @path := concat('$."', a, '"'),
(coalesce(json_extract(@rn, @path), 0) + 1)
@path
) AS rn2,
@rn AS debug -- Added for debugging purposes only
FROM (
SELECT 1 AS a, 3 AS b UNION ALL
SELECT 2 AS a, 4 AS b UNION ALL
SELECT 1 AS a, 5 AS b UNION ALL
SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := '{}') r
ORDER BY b DESC;
Check out the results:
|a |b |rn1|rn2|debug |
|---|---|---|---|--------------------|
|2 |6 |1 |1.0|{"1": 2.0, "2": 1.0}|
|1 |5 |1 |1.0|{"1": 1.0} |
|2 |4 |2 |2.0|{"1": 2.0, "2": 2.0}|
|1 |3 |2 |2.0|{"1": 2.0} |
You can try this on MySQL 5.7 (removing the ROW_NUMBER(), of course), and you’ll see this works perfectly fine! How does it work?
- We start with an empty object
{}in theFROMclause.
- On every row that is incidentally ordered by the
ORDER BY b DESCclause, we’ll extract the row number value for the partition keyPARTITION BY a. This is done with a dynamically created JSON path expressionconcat('$."', a, '"'). For example:$."1"or$."2".
- At first, this value is
NULL, of course, so we turn it to zero withCOALESCE(<expr>, 0).
- We add
1to it
- Then we
JSON_SETthe value back into the object, assigning the result back to@rn.
- Then, we re-extract the value we’ve just calculated
This could be simplified a bit if it wasn’t just a single expression, but since I’m thinking about implementing this emulation in jOOQ (see #14529), I wanted to do the exercise of keeping the projection unchanged (imagine, the jOOQ user writes ROW_NUMBER() with jOOQ, and wants this to “just work”).
Caveats:
- If the
PARTITION BYclause has multiple expressions, then the composite value would have to be used as a key, e.g. using some “impossible” concatenation token (a token that can’t appear in the data set), or a hash value (risking collisions, of course), or an additional lookup, making things quite complicated.
- The
concat('$."', a, '"')expression doesn’t properly quoteayet, in case it contains double quotes.
[...]