The Performance of Various To-Many Nesting Algorithms
It’s been a while since jOOQ 3.15 has been released with its revolutionary standard SQL MULTISET emulation feature. A thing that has been long overdue and which I promised on twitter a few times is to run a few benchmarks comparing the performance of various approaches to nesting to-many relationships with jOOQ. This article will … Continue reading The Performance of Various To-Many Nesting Algorithms →
The Performance of Various To-Many Nesting Algorithms
Posted on June 9, 2022June 11, 2022 by lukaseder
It’s been a while since jOOQ 3.15 has been released with its revolutionary standard SQL MULTISET emulation feature. A thing that has been long overdue and which I promised on twitter a few times is to run a few benchmarks comparing the performance of various approaches to nesting to-many relationships with jOOQ.
This article will show, that in some real world scenarios on modest data set sizes, jOOQ’s MULTISET emulation performs about as well as
- manually running single join queries and manually deduplicating results
- manually running multiple queries per nest level and matching results in the client
In contrast, all of the above perform much better than the dreaded N+1 “approach” (or rather, accident), all the while being much more readable and maintainable.
The conclusion is:
- For jOOQ users to freely use
MULTISETwhenever smallish data sets are used (i.e. a nested loop join would be OK, too) - For jOOQ users to use
MULTISETcarefully where large-ish data sets are used (i.e. a hash join or merge join would be better, e.g. in reports) - For ORM vendors to prefer the multiple queries per nest level approach in case they’re in full control of their SQL to materialise predefined object graphs
Benchmark idea
As always, we’re querying the famous Sakila database. There are two types of queries that I’ve tested in this benchmark.
A query that double-nests child collections (DN = DoubleNesting)
The result will be of the form:
record DNCategory (String name) {}
record DNFilm (long id, String title, List<DNCategory> categories) {}
record DNName (String firstName, String lastName) {}
record DNActor (long id, DNName name, List<DNFilm> films) {}
So, the result will be actors and their films and their categories per film. If a single join is being executed, this should cause a lot of duplication in the data (although regrettably, in our test data set, each film only has a single category)
A query that nests two child collections in a single parent (MCC = Multiple Child Collections)
The result will be of the form:
record MCCName (String firstName, String lastName) {}
record MCCActor (long id, MCCName name) {}
record MCCCategory (String name) {}
record MCCFilm (
long id,
String title,
List<MCCActor> actors,
List<MCCCategory> categories
) {}
So, the result will be films and their actors as well as their categories. This is hard to deduplicate with a single join, because of the cartesian product between ACTOR × CATEGORY. But other approaches with multiple queries still work, as well as MULTISET, of course, which will be the most convenient option
Data set size
In addition to the above distinction of use-cases, the benchmark will also try to pull in either:
- The entire data set (we have 1000
FILMentries as well as 200ACTORentries, so not a huge data set), where hash joins tend to be better - Only the subset for either
ACTOR_ID = 1orFILM_ID = 1, respectively, where nested loop joins tend to be better
The expectation here is that a JOIN tends to perform better on larger result sets, because the RDBMS will prefer a hash join algorithm. It’s unlikely the MULTISET emulation can be transformed into a hash join or merge join, given that it uses JSON_ARRAYAGG which might be difficult to transform into something different, which is still equivalent.
Benchmark tests
The following things will be benchmarked for each combination of the above matrix:
- A single
MULTISETquery with its 3 available emulations usingXML(where available),JSON,JSONB - A single
JOINquery that creates a cartesian product between parent and children - An approach that runs 2 queries fetching all the necessary data into client memory, and performs the nesting in the client, thereafter
- A naive N+1 “client side nested loop join,” which is terrible but not unlikely to happen in real world client code, either with jOOQ (less likely, but still possible), or with a lazy loading ORM (more likely, because “accidental”)
The full benchmark logic will be posted at the end of this article.
1. Single MULTISET query (DN)
The query looks like this:
return state.ctx.select(
ACTOR.ACTOR_ID,
// Nested record for the actor name
row(
ACTOR.FIRST_NAME,
ACTOR.LAST_NAME
).mapping(DNName::new),
// First level nested collection for films per actor
multiset(
select(
FILM_ACTOR.FILM_ID,
FILM_ACTOR.film().TITLE,
// Second level nested collection for categories per film
multiset(
select(FILM_CATEGORY.category().NAME)
.from(FILM_CATEGORY)
.where(FILM_CATEGORY.FILM_ID.eq(FILM_ACTOR.FILM_ID))
).convertFrom(r -> r.map(mapping(DNCategory::new)))
.from(FILM_ACTOR)
.where(FILM_ACTOR.ACTOR_ID.eq(ACTOR.ACTOR_ID))
).convertFrom(r -> r.map(mapping(DNFilm::new))))
.from(ACTOR)
// Either fetch all data or filter ACTOR_ID = 1
.where(state.filter ? ACTOR.ACTOR_ID.eq(1L) : noCondition())
.fetch(mapping(DNActor::new));
To learn more about the specific MULTISET syntax and the ad-hoc conversion feature, please refer to earlier blog posts explaining the details. The same is true for the implicit JOIN feature, which I’ll be using in this post to keep SQL a bit more terse.
2. Single JOIN query (DN)
We can do everything with a single join as well. In this example, I’m using a functional style to transform the flat result into the doubly nested collection in a type safe way. It’s a bit quirky, perhaps there are better ways to do this with non-JDK APIs. Since I wouldn’t expect this to be performance relevant, I think it’s good enough:
// The query is straightforward. Just join everything from
// ACTOR -> FILM -> CATEGORY via the relationship tables
return state.ctx.select(
FILM_ACTOR.ACTOR_ID,
FILM_ACTOR.actor().FIRST_NAME,
FILM_ACTOR.actor().LAST_NAME,
FILM_ACTOR.FILM_ID,
FILM_ACTOR.film().TITLE,
FILM_CATEGORY.category().NAME)
.from(FILM_ACTOR)
.join(FILM_CATEGORY).on(FILM_ACTOR.FILM_ID.eq(FILM_CATEGORY.FILM_ID))
.where(state.filter ? FILM_ACTOR.ACTOR_ID.eq(1L) : noCondition())
// Now comes the tricky part. We first use JDK Collectors to group
// results by ACTOR
.collect(groupingBy(
r -> new DNActor(
r.value1(),
new DNName(r.value2(), r.value3()),
// dummy FILM list, we can't easily collect them here, yet
null
// For each actor, produce a list of FILM, again with a dummy
// CATEGORY list as we can't collect them here yet
groupingBy(r -> new DNFilm(r.value4(), r.value5(), null))
// Set<Entry<DNActor, Map<DNFilm, List<Record6<...>>>>>
.entrySet()
.stream()
// Re-map the DNActor record into itself, but this time, add the
// nested DNFilm list.
.map(a -> new DNActor(
a.getKey().id(),
a.getKey().name(),
a.getValue()
.entrySet()
.stream()
.map(f -> new DNFilm(
f.getKey().id(),
f.getKey().title(),
f.getValue().stream().map(
c -> new DNCategory(c.value6())
).toList()
.toList()
.toList();
Possibly, this example could be improved to avoid the dummy collection placeholders in the first collect() call, although that would probably require additional record types or structural tuple types like those from jOOλ. I kept it “simple” for this example, though I’ll take your suggestions in the comments.
3. Two queries merged in memory (DN)
[...]