Re: row estimate for partial index

From: Harmen <harmen(at)lijzij(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: row estimate for partial index
Date: 2023-01-19 09:04:25
Message-ID: 20230119090425.GA8608@arp.lijzij.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Jan 16, 2023 at 09:59:38AM -0500, Tom Lane wrote:
> Harmen <harmen(at)lijzij(dot)de> writes:
> > On Sat, Jan 14, 2023 at 11:23:07AM -0500, Tom Lane wrote:
> >> If you are running a reasonably recent PG version you should be able to
> >> fix that by setting up "extended statistics" on that pair of columns:
>
> > CREATE STATISTICS dist4 (ndistinct) ON deleted, org_id FROM contactsbool;
> > CREATE STATISTICS dist4b (ndistinct) ON org_id, deleted FROM contactsbool;

Hello Tom,

thanks again for your help, it's much appreciated.

> 1. ndistinct is not the correct stats type for this problem.
> (I think dependencies is, but generally speaking, it's not worth
> trying to be smarter than the system about which ones you need.
> Just create 'em all.)

I don't understand the "dependencies" stat here. The documentation has as
examples zip codes with city names, and month/day columns. Those I follow. But
in this case I generally select a single "org_id", which can have an arbitrary
number of "deleted" rows. I don't understand how having the org_id available
gives a better prediction than taking the plain ratio of all "deleted" rows in
the whole table.

> 2. Per the CREATE STATISTICS man page, the order of the columns is
> not significant, so you're just doubling the amount of work for
> ANALYZE without gaining anything.

Noted. Are these statistics expensive? Or do you expect them to be basically
noise?

> I think you will find that
>
> CREATE STATISTICS stats1 ON deleted, org_id FROM contactsbool;
>
> is enough to fix this. It improved the estimate for me in
> v14 and HEAD, anyway.

It does indeed for me as well! I'm not 100% sure it's because of the test data
being too simple, but I'll test.

Unfortunately, as you already remarked might be the case in your original
reply, I don't get the same good results if I change the "deleted" column from
a "boolean not null" to a "timestamptz null", though.

Then it's back to (pg15.1):

-> Parallel Seq Scan on contacts (cost=0.00..126284.02 rows=513913 width=16) (actual time=254.677..393.448 rows=5 loops=3)
Filter: ((deleted IS NULL) AND (org_id = 5))
Rows Removed by Filter: 3333329

Test table in case that's convenient to have:

DROP table if exists contacts;
CREATE table contacts (id int not null, org_id int not null, deleted timestamptz null, firstname text);
CREATE index contacts_orgs on contacts (org_id, id) where deleted is null;

WITH ids as (select * from generate_series(0, 10000000)) insert into contacts select ids.generate_series, mod(ids.generate_series, 7), null, 'hello world' from ids;
UPDATE contacts set deleted = now() where id > 100 and org_id = 5;

CREATE STATISTICS dist4 (ndistinct) ON deleted, org_id FROM contacts;
CREATE STATISTICS dist5 (dependencies) ON deleted, org_id FROM contacts;
ANALYZE contacts;

explain (analyze) select id, firstname from contacts where org_id = 5 and deleted is null order by id;

Again, this is only a problem when a single org_is has such a large percentage
of the table that the planner switches to table scans, which can only ever be a
few org_ids. On production I've added an index with the problematic org_id in
the "where" part, and that avoids the problem for now (milliseconds vs
minutes), but it's not a great fix.

Thanks!
Harmen

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Laurenz Albe 2023-01-19 12:57:14 Re: minor bug
Previous Message vignesh C 2023-01-19 08:51:01 Re: Support logical replication of DDLs