An Efficient Way to Check for Existence of Multiple Values in SQL
In a previous blog post, we’ve advertised the use of SQL EXISTS rather than COUNT(*) to check for existence of a value in SQL. I.e. to check if in the Sakila database, actors called WAHLBERG have played in any films, instead of: Do this: (Depending on your dialect you may require a FROM DUAL clause, … Continue reading An Efficient Way to Check for Existence of Multiple Values in SQL →
An Efficient Way to Check for Existence of Multiple Values in SQL
Posted on February 16, 2024 by lukaseder
In a previous blog post, we’ve advertised the use of SQL EXISTS rather than COUNT(*) to check for existence of a value in SQL.
I.e. to check if in the Sakila database, actors called WAHLBERG have played in any films, instead of:
SELECT count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'
Do this:
SELECT EXISTS (
SELECT 1 FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'
(Depending on your dialect you may require a FROM DUAL clause, or a CASE expression if BOOLEAN types aren’t supported).
Check for multiple rows
But what if you want to check if there are at least 2 (or N) rows? In that case, you cannot use EXISTS, but have to revert to using COUNT(). However, instead of just counting all* matches, why not add a LIMIT clause as well? So, if you want to check if actors called WAHLBERG have played in at least 2 films, instead of this:
SELECT (
SELECT count(*)
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'
) >= 2
Write this:
SELECT (
SELECT count(*)
FROM (
SELECT *
FROM actor a
JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG'
LIMIT 2
) t
) >= 2
In other words:
- Run the join query with a
LIMIT 2in a derived table
- Then
COUNT(*)the rows (at most 2) from that derived table
- Finally, check if the count is high enough
Does it matter?
In principle, the optimiser could have figured this out itself, especially because we used a constant to compare the COUNT(*) value with. But did it really apply the transformation?
Let’s check execution plans and benchmark the query on various RDBMS.
PostgreSQL 15
No LIMIT
Result (cost=14.70..14.71 rows=1 width=1) (actual time=0.039..0.039 rows=1 loops=1)
InitPlan 1 (returns $1)
-> Aggregate (cost=14.69..14.70 rows=1 width=8) (actual time=0.037..0.037 rows=1 loops=1)
-> Nested Loop (cost=0.28..14.55 rows=55 width=0) (actual time=0.009..0.032 rows=56 loops=1)
-> Seq Scan on actor a (cost=0.00..4.50 rows=2 width=4) (actual time=0.006..0.018 rows=2 loops=1)
Filter: ((last_name)::text = 'WAHLBERG'::text)
Rows Removed by Filter: 198
-> Index Only Scan using film_actor_pkey on film_actor fa (cost=0.28..4.75 rows=27 width=4) (actual time=0.003..0.005 rows=28 loops=2)
Index Cond: (actor_id = a.actor_id)
Heap Fetches: 0
With LIMIT
Result (cost=0.84..0.85 rows=1 width=1) (actual time=0.023..0.024 rows=1 loops=1)
InitPlan 1 (returns $1)
-> Aggregate (cost=0.83..0.84 rows=1 width=8) (actual time=0.021..0.022 rows=1 loops=1)
-> Limit (cost=0.28..0.80 rows=2 width=240) (actual time=0.016..0.018 rows=2 loops=1)
-> Nested Loop (cost=0.28..14.55 rows=55 width=240) (actual time=0.015..0.016 rows=2 loops=1)
-> Seq Scan on actor a (cost=0.00..4.50 rows=2 width=4) (actual time=0.008..0.008 rows=1 loops=1)
Filter: ((last_name)::text = 'WAHLBERG'::text)
Rows Removed by Filter: 1
-> Index Only Scan using film_actor_pkey on film_actor fa (cost=0.28..4.75 rows=27 width=4) (actual time=0.005..0.005 rows=2 loops=1)
Index Cond: (actor_id = a.actor_id)
Heap Fetches: 0
To understand the difference, focus on these rows:
Before:
Nested Loop (cost=0.28..14.55 rows=55 width=0) (actual time=0.009..0.032 rows=56 loops=1)
After:
Nested Loop (cost=0.28..14.55 rows=55 width=240) (actual time=0.015..0.016 rows=2 loops=1)
In both cases, the estimated number of rows produced by the join is 55 (i.e. all WAHLBERGs are expected to have played in a total of 55 films according to statistics).
But int he second execution the actual rows value is much lower, because we only needed 2 rows before we could stop execution of the operation, because of the LIMIT above.
Benchmark results:
Using our recommended SQL benchmarking technique that compares running two queries many times (5 runs x 2000 executions in this case) on the same instance directly from within the RDBMS using procedural languages (to avoid network latency, etc.), we get these results:
RUN 1, Statement 1: 2.61927
RUN 1, Statement 2: 1.01506
RUN 2, Statement 1: 2.47193
RUN 2, Statement 2: 1.00614
RUN 3, Statement 1: 2.63533
RUN 3, Statement 2: 1.14282
RUN 4, Statement 1: 2.55228
RUN 4, Statement 2: 1.00000 -- Fastest run is 1
RUN 5, Statement 1: 2.53801
RUN 5, Statement 2: 1.02363
The fastest run is 1 units of time, slower runs run in multiples of that time. The complete COUNT(*) query is consistently and significantly slower than the LIMIT query.
Both the plans and benchmark results speak for themselves.
Oracle 23c
With Oracle 23c, we can finally use BOOLEAN types and omit DUAL, yay!
No FETCH FIRST:
SQL_ID 40yy0tskvs1zw, child number 0
-------------------------------------
SELECT /*+GATHER_PLAN_STATISTICS*/ ( SELECT count(*)
FROM actor a JOIN film_actor fa USING (actor_id)
WHERE a.last_name = 'WAHLBERG' ) >= 2
Plan hash value: 2539243977
---------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 0 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 6 |
| 2 | NESTED LOOPS | | 1 | 55 | 56 |00:00:00.01 | 6 |
| 3 | TABLE ACCESS BY INDEX ROWID BATCHED| ACTOR | 1 | 2 | 2 |00:00:00.01 | 2 |
|* 4 | INDEX RANGE SCAN | IDX_ACTOR_LAST_NAME | 1 | 2 | 2 |00:00:00.01 | 1 |
|* 5 | INDEX RANGE SCAN | IDX_FK_FILM_ACTOR_ACTOR | 2 | 27 | 56 |00:00:00.01 | 4 |
| 6 | FAST DUAL | | 1 | 1 | 1 |00:00:00.01 | 0 |
---------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("A"."LAST_NAME"='WAHLBERG')
5 - access("A"."ACTOR_ID"="FA"."ACTOR_ID")
With FETCH FIRST:
SQL_ID f88t1r0avnr7b, child number 0
-------------------------------------
SELECT /*+GATHER_PLAN_STATISTICS*/( SELECT count(*)
from ( select * FROM actor a JOIN
film_actor fa USING (actor_id) WHERE a.last_name =
'WAHLBERG' FETCH FIRST 2 ROWS ONLY ) t )
>= 2
Plan hash value: 4019277616
[...]
---
*[Original source](https://blog.jooq.org/an-efficient-way-to-check-for-existence-of-multiple-values-in-sql/)*