much slower query in production

From: Guillaume Cottenceau <gc(at)mnc(dot)ch>
To: pgsql-performance(at)postgresql(dot)org
Subject: much slower query in production
Date: 2020-02-26 16:17:21
Message-ID: 878skpnx0u.fsf@mnc.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Dear all,

I am facing a much, much slower query in production than on my
development computer using a restored production backup, and I
don't understand why nor I see what I could do to speedup the
query on production :/

Short data model explanation: one table stores tickets, one table
stores multicards; tickets can belong to a multicard through a
many-to-one relationship, or be independant from multicards. The
query I am interested in, will be used to update a new column in
the multicards table, containing the count of related tickets.

On production:

# EXPLAIN ANALYZE SELECT transaction_uid, (SELECT COUNT(*) FROM tickets WHERE multicard_uid = multicards.uid) from multicards;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on multicards (cost=0.00..1455177.30 rows=204548 width=12) (actual time=0.178..1694987.355 rows=204548 loops=1)
SubPlan 1
-> Aggregate (cost=7.07..7.08 rows=1 width=8) (actual time=8.283..8.283 rows=1 loops=204548)
-> Index Only Scan using tickets_multicard_uid on tickets (cost=0.43..7.05 rows=9 width=0) (actual time=1.350..8.280 rows=6 loops=204548)
Index Cond: (multicard_uid = multicards.uid)
Heap Fetches: 1174940
Planning Time: 1.220 ms
Execution Time: 1695029.673 ms

iostat in the middle of this execution:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 1 479548 143652 79412 5609292 0 0 1986 116 999 1115 2 2 71 25
3 1 479548 134312 79412 5614564 0 0 2386 128 2832 3973 20 2 49 28
0 1 479548 142520 79428 5617708 0 0 1584 232 3440 3716 11 3 58 29
0 1 479548 161184 79024 5597756 0 0 1922 144 1249 1562 1 2 70 27
0 1 479548 161244 79048 5600804 0 0 1556 117 2138 3035 6 2 68 25
0 2 479548 158384 79048 5604008 0 0 1388 402 2970 4320 6 2 66 27

On my development computer:

# EXPLAIN ANALYZE SELECT transaction_uid, (SELECT COUNT(*) FROM tickets WHERE multicard_uid = multicards.uid) from multicards;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on multicards (cost=0.00..538974.69 rows=204548 width=12) (actual time=0.055..451.691 rows=204548 loops=1)
SubPlan 1
-> Aggregate (cost=2.57..2.58 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=204548)
-> Index Only Scan using tickets_multicard_uid on tickets (cost=0.43..2.56 rows=7 width=0) (actual time=0.001..0.001 rows=6 loops=204548)
Index Cond: (multicard_uid = multicards.uid)
Heap Fetches: 0
Planning Time: 0.296 ms
Execution Time: 456.677 ms

The execution time ratio is a huge 3700. I guess the Heap Fetches
difference is the most meaningful here; my understanding would
be, that the index would easily fit in the shared_buffers after a
few subselects, as configuration is 2GB shared_buffers and here's
the index on disk size:

# SELECT relname as index, reltuples as "rows estimate", pg_size_pretty(pg_table_size(quote_ident(relname))) as "on disk size" FROM pg_class, pg_namespace WHERE pg_namespace.oid = pg_class.relnamespace AND relkind = 'i' AND nspname = 'public' AND relname = 'tickets_multicard_uid';
index | rows estimate | on disk size
-----------------------+---------------+--------------
tickets_multicard_uid | 7.2136e+06 | 161 MB

Though it's not too clear for me what "heap fetches" are. It
seems it might be the actual table data fetches (e.g. not an
index fetch), but I don't really know why so many of them are
needed here, and how to reduce that (if that's the explanation
for the longer timing).

It's true that production has constant activity, but not so much
of it, load avg is typically about 0.5, most queries run very
quickly, pg_stat_activity frequently reports no currently running
query; here's iostat during normal production activity, for
reference:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 480712 151112 10144 5664784 1 1 310 69 0 0 9 1 88 3
0 0 480712 151312 10152 5664876 0 0 0 248 4911 7072 9 3 87 0
0 0 480720 165488 10072 5642564 0 4 76 210 2019 2685 6 1 92 1
0 0 480720 165332 10088 5642508 0 0 0 221 3535 17545 25 4 70 1
0 0 480720 144772 10108 5643324 0 0 84 378 3833 5096 11 2 80 7
0 0 480720 143300 10116 5644144 0 0 42 298 3446 4784 6 1 92 1
0 0 480720 143300 10136 5644256 0 0 10 2340 1073 1496 1 1 96 2

Here's also a second susequent run on production for comparison:

# EXPLAIN ANALYZE SELECT transaction_uid, (SELECT COUNT(*) FROM tickets WHERE multicard_uid = multicards.uid) from multicards;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on multicards (cost=0.00..1455177.30 rows=204548 width=12) (actual time=0.101..834176.389 rows=204548 loops=1)
SubPlan 1
-> Aggregate (cost=7.07..7.08 rows=1 width=8) (actual time=4.075..4.075 rows=1 loops=204548)
-> Index Only Scan using tickets_multicard_uid on tickets (cost=0.43..7.05 rows=9 width=0) (actual time=0.624..4.072 rows=6 loops=204548)
Index Cond: (multicard_uid = multicards.uid)
Heap Fetches: 1174941
Planning Time: 0.273 ms
Execution Time: 834209.323 ms

Heap fetches still almost the same, albeit timing divided by two.

Both environments are running postgresql 11.5, with 2GB
shared_buffers. Differences I can think of: production is using
ext4 on drbd on SATA and linux 3.2, dev is using ext4 (no drbd)
on SSD and linux 4.15. I can't believe SSD would explain the
difference alone? If positive, then I know what we should do on
production..

Thanks for any hints/help!

--
Guillaume Cottenceau

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Justin Pryzby 2020-02-26 16:28:09 Re: much slower query in production
Previous Message Thomas Munro 2020-02-26 01:14:47 Re: LDAP with TLS is taking more time in Postgresql 11.5