Optimizing `WHERE x IN` query

From: Omar Roth <omarroth(at)hotmail(dot)com>
To: "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject: Optimizing `WHERE x IN` query
Date: 2019-07-07 13:43:14
Message-ID: DM5PR18MB1051CD296F9F94AE0D67FD84DFF70@DM5PR18MB1051.namprd18.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi!

I am attempting to replicate YouTube's subscription feed. Each user has a list
of channel IDs (as text[]) that they are subscribed to. The `users` table
looks like:

```
=# \d users
Table "public.users"
Column | Type | Collation | Nullable | Default
-------------------+--------------------------+-----------+----------
+---------
email | text | | not null |
subscriptions | text[] | | |
feed_needs_update | boolean | | |
Indexes:
"users_pkey" PRIMARY KEY, btree (email)
"email_unique_idx" UNIQUE, btree (lower(email))
"users_email_key" UNIQUE CONSTRAINT, btree (email)
```

For storing videos, I have a table `channel_videos` from which I want to
select all videos where the video's `ucid` (channel ID) is in the user's
subscriptions. `channel_videos` looks like:

```
=# \d channel_videos
Table "public.channel_videos"
Column | Type | Collation | Nullable |
Default
--------------------+--------------------------+-----------+----------
+---------
id | text | | not null |
title | text | | |
published | timestamp with time zone | | |
updated | timestamp with time zone | | |
ucid | text | | |
author | text | | |
views | bigint | | |
Indexes:
"channel_videos_id_key" UNIQUE CONSTRAINT, btree (id)
"channel_videos_published_idx" btree (published)
"channel_videos_ucid_idx" btree (ucid)
```

In case it might help with indexing, a UCID always begins with `UC`, then 22
random characters in [a-zA-Z0-9_-] (Base64-urlsafe).

Currently, the query I'm using to generate a user's feed is:

```
SELECT * FROM channel_videos WHERE ucid IN (SELECT unnest(subscriptions) FROM
users WHERE email = $1) ORDER BY published DESC;
```

This works great when `subscriptions` and `channel_videos` are both fairly
small. Unfortunately, at this point `channel_videos` contains roughly 28M
rows. Attempting to generate a feed for a user with 177 subscriptions takes
around 40-70s, here's the relevant `EXPLAIN (ANALYZE, BUFFERS)`:

```
=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM channel_videos WHERE ucid IN
(SELECT unnest(subscriptions) FROM users WHERE email = 'omarroth') ORDER BY
published DESC;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=478207.59..478506.73 rows=119656 width=142) (actual
time=68599.562..68613.824 rows=46456 loops=1)
Sort Key: channel_videos.published DESC
Sort Method: external merge Disk: 6352kB
Buffers: shared hit=456 read=13454 dirtied=13 written=456, temp read=794
written=797
-> Nested Loop (cost=55.94..459526.49 rows=119656 width=142) (actual
time=0.555..68531.550 rows=46456 loops=1)
Buffers: shared hit=453 read=13454 dirtied=13 written=456
-> HashAggregate (cost=10.18..11.18 rows=100 width=32) (actual
time=0.417..0.850 rows=177 loops=1)
Group Key: unnest(users.subscriptions)
Buffers: shared hit=39 read=7
-> ProjectSet (cost=0.41..8.93 rows=100 width=32) (actual
time=0.305..0.363 rows=177 loops=1)
Buffers: shared hit=39 read=7
-> Index Scan using users_email_key on users
(cost=0.41..8.43 rows=1 width=127) (actual time=0.049..0.087 rows=1 loops=1)
Index Cond: (email = 'omarroth'::text)
Buffers: shared hit=5 read=4
-> Bitmap Heap Scan on channel_videos (cost=45.76..4583.18
rows=1197 width=142) (actual time=28.835..387.068 rows=262 loops=177)
Recheck Cond: (ucid = (unnest(users.subscriptions)))
Heap Blocks: exact=12808
Buffers: shared hit=414 read=13447 dirtied=13 written=456
-> Bitmap Index Scan on channel_videos_ucid_idx
(cost=0.00..45.46 rows=1197 width=0) (actual time=22.255..22.255 rows=263
loops=177)
Index Cond: (ucid = (unnest(users.subscriptions)))
Buffers: shared hit=291 read=762 written=27
Planning Time: 1.463 ms
Execution Time: 68619.316 ms
(23 rows)

```

Since a feed is expected to be fairly consistent to what would appear on
YouTube, `channel_videos` receives fairly frequent UPDATEs and INSERTs and is
vacuumed fairly frequently:

```
=# SELECT * FROM pg_stat_user_tables WHERE relname = 'channel_videos';
relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan |
idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup
| n_dead_tup | n_mod_since_analyze | last_vacuum |
last_autovacuum | last_analyze | last_autoanalyze
| vacuum_count | autovacuum_count | analyze_count | autoanalyze_count
-------+------------+----------------+----------+--------------+----------
+---------------+-----------+-----------+-----------+---------------
+------------+------------+---------------------
+------------------------------+-----------------
+-------------------------------+-------------------------------
+--------------+------------------+---------------+-------------------
16444 | public | channel_videos | 3 | 6346042 | 28294649 |
6605201260 | 218485 | 13280741 | 2413 | 11241541 | 24601363 |
585451 | 763200 | 2019-07-05 17:57:21.34215+00 |
| 2019-07-05 17:59:21.013308+00 | 2019-07-04 14:54:02.845591+00 | 4
| 0 | 4 | 2
(1 row)
The machine running Postgres has 4 vCPUs, 8GB of RAM (which I expect is the
problem), and 160GB SSD. `uname -srv`:

# uname -srv
Linux 4.4.0-154-generic #181-Ubuntu SMP Tue Jun 25 05:29:03 UTC 2019
```
I am running Postgres 11.4, `SELECT version()`:

PostgreSQL 11.4 (Ubuntu 11.4-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled
by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609, 64-bit
(1 row)

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Thomas Kellerer 2019-07-07 14:33:00 Re: Optimizing `WHERE x IN` query
Previous Message Ancoron Luciferis 2019-07-06 23:17:09 Re: Custom opclass for column statistics?