Re: row estimate for partial index

From: Harmen <harmen(at)lijzij(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: row estimate for partial index
Date: 2023-01-16 08:55:23
Message-ID: 20230116085523.GK91568@arp.lijzij.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sat, Jan 14, 2023 at 11:23:07AM -0500, Tom Lane wrote:
> Harmen <harmen(at)lijzij(dot)de> writes:
> > Works well enough. However, we now have an org_id which has > 10% of the rows,
> > but only a handful rows where "deleted is null" matches (so the org has a lot
> > of "deleted" contacts). The planner doesn't like this and it falls back to a
> > full table scan for the above query.
>
> > I've added a dedicated index just for that org_id, to see if that helps:
> > "org123" btree (id) WHERE deleted IS NULL AND org_id = 123
> > The planner seems to use it now, however the row estimate is way off:
>
> Yeah, so that indicates that it isn't producing a good selectivity
> estimate for the combination of those two conditions: it will assume
> the org_id and deleted columns are independent, which per your statements
> they are not.
>
> If you are running a reasonably recent PG version you should be able to
> fix that by setting up "extended statistics" on that pair of columns:
>
> https://www.postgresql.org/docs/current/planner-stats.html#PLANNER-STATS-EXTENDED
>
> (I might be wrong, but I think that will help even when one of
> the troublesome conditions is a null-check. If it doesn't, then
> we have something to improve there ...)

Thanks for your explanation, Tom.
I've setup a local test scenario, where I then add a "dependencies" stat, but
that doesn't give a better plan, unfortunately.

This is my test table (I use a boolean field for "deleted" to keep this test
case as simple as possible. In my real case this is a "timestamptz null"
field):

DROP table if exists contactsbool;
CREATE table contactsbool (id int not null, org_id int not null, deleted boolean not null, firstname text);
CREATE index contactsbool_orgs on contactsbool (org_id, id) where not deleted;

Testdata has a very low number of "orgs", and one org has almost only deleted
contacts:

WITH ids as (select * from generate_series(0, 10000000)) insert into contactsbool select ids.generate_series, mod(ids.generate_series, 7), false, 'hello world' from ids;
UPDATE contactsbool set deleted = true where id > 100 and org_id = 5;
ANALYZE contactsbool;

Now the new stats:

CREATE STATISTICS dist4 (ndistinct) ON deleted, org_id FROM contactsbool;
CREATE STATISTICS dist4b (ndistinct) ON org_id, deleted FROM contactsbool;
ANALYZE contactsbool;

harmen=> explain (analyze) select id, firstname from contactsbool where org_id = 5 and not deleted order by id;
QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Gather Merge (cost=181983.91..299104.42 rows=1003820 width=16) (actual time=448.244..454.770 rows=14 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=180983.88..182238.66 rows=501910 width=16) (actual time=413.761..413.762 rows=5 loops=3)
Sort Key: id
Sort Method: quicksort Memory: 25kB
Worker 0: Sort Method: quicksort Memory: 25kB
Worker 1: Sort Method: quicksort Memory: 25kB
-> Parallel Seq Scan on contactsbool (cost=0.00..124881.86 rows=501910 width=16) (actual time=267.318..413.673 rows=5 loops=3)
Filter: ((NOT deleted) AND (org_id = 5))
Rows Removed by Filter: 3333329
Planning Time: 0.565 ms
JIT:
Functions: 12
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 2.444 ms, Inlining 0.000 ms, Optimization 1.163 ms,
Emission 13.288 ms, Total 16.895 ms
Execution Time: 456.498 ms
(17 rows)

The "rows=501910" is what I don't expect. I expect/want/hope the plan to use
the contactsbool_orgs index.
(If I really (hard) delete the "deleted" contacts everything works perfectly
for all orgs.)

Any ideas?
Thanks again,
Harmen

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Fred Habash 2023-01-16 12:48:24 Why is a Read-only Table Gets Autovacuumed "to prevent wraparound"
Previous Message HECTOR INGERTO 2023-01-16 08:41:19 RE: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?