Re: [BUGS] BUG #11500: PRIMARY KEY index not being used

From: Marko Tiikkaja <marko(at)joh(dot)to>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: [BUGS] BUG #11500: PRIMARY KEY index not being used
Date: 2025-10-10 14:52:59
Message-ID: CAL9smLD_nZZLmFuYVnfBg268OKYPvNwoMn0O_Nh11nM8ms9LOA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Oct 6, 2025 at 2:35 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> On Mon, 6 Oct 2025 at 23:39, Marko Tiikkaja <marko(at)joh(dot)to> wrote:
> > On the production server the query still consistently took ~100ms even
> > after all that maintenance had been done. Compared to <0.05ms when
> > going through the primary key.
>
> This part surprises me. Is this on a standby server or primary?

It's a primary.

> Are
> you certain that no rows are matching the state = 'WAIT_EVENT' qual?

There are, around 80k rows, but it fluctuates quite a bit.

I spent some time working on my test program to gather more data, so
I'll give this one more go. All of these are from production, so the
schema is slightly different.

Here's what the plan looks like when things are good:

Index Scan using orders_pkey on orders (cost=0.58..2.20 rows=1
width=4) (actual time=0.172..0.172 rows=0 loops=1)
Index Cond: (orderid = '7000000000'::bigint)
Filter: (orderstatusid = 3)
Buffers: shared hit=3 read=1
Planning:
Buffers: shared hit=1
Planning Time: 0.140 ms
Execution Time: 0.178 ms

Then sometimes auto-analyze just doesn't sample enough pages which are
WAIT_EVENT (orderstatusid = 3), so the plan changes into this:

Index Scan using index_orders_wait_event on orders (cost=0.50..2.09
rows=1 width=4) (actual time=9512.921..9512.921 rows=0 loops=1)
Filter: (orderid = '7000000000'::bigint)
Rows Removed by Filter: 88520
Buffers: shared hit=161591 read=6066 dirtied=139 written=125
Planning:
Buffers: shared hit=1
Planning Time: 0.139 ms
Execution Time: 9512.934 ms

It's only this slow for the first execution, then the following
executions take between 50ms and 100ms. My program isn't intelligent
enough to catch it, but if I reproduce an equivalent manually, it
looks like:

Index Scan using index_orders_wait_event on orders (actual
time=89.004..89.004 rows=0 loops=1)
Filter: ((orderid + 1) = '7000000000'::bigint)
Rows Removed by Filter: 83280
Buffers: shared hit=91856
Planning:
Buffers: shared hit=1
Planning Time: 0.198 ms
Execution Time: 89.014 ms
(8 rows)

So still multiple orders of magnitude slower than the PRIMARY KEY plan.

Some time later another auto-analyze happens to see enough of these
(orderstatusid = 3) rows, and the plan changes back to the PRIMARY KEY
one.

The table is around 600 GB in size, so having a VACUUM running through
the table 24/7 to combat this isn't feasible. But it doesn't looke
like it would help too much, as the partial index is constantly around
300 MB in size, even after an autovacuum. If there was a VACUUM
INDEX, that probably could be running 24/7, but as far as I know there
isn't a way to do just one index like that. REINDEXing now and then
trying to get more aggressive micro-vacuuming on the index by running
SELECTs that scan through it could also work; not sure.

In short, if:

1) there was no index bloat
2) the statistics were perfectly accurate for the pre-existing data
3) the distribution of data couldn't change

one could save up to 0.178 ms for the first execution or around 0.03
ms for subsequent ones. But all of three of those are a risk, and
sometimes we end up losing 9512.934 ms for the first execution or
around 89 ms for subsequent ones.

If there's more info I can provide without too much effort I'll be happy to.

.m

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2025-10-10 15:40:52 Re: BUG #19082: Failing assert in index scan
Previous Message Álvaro Herrera 2025-10-10 11:14:19 Re: BUG #19082: Failing assert in index scan