Re: extended stats on partitioned tables

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Zhihong Yu <zyu(at)yugabyte(dot)com>, Jaime Casanova <jcasanov(at)systemguards(dot)com(dot)ec>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: extended stats on partitioned tables
Date: 2021-12-13 20:40:09
Message-ID: d93e6a93-5430-d0fe-7b5b-9cf06b6fc6a7@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/13/21 14:53, Justin Pryzby wrote:
> On Sun, Dec 12, 2021 at 11:23:19PM +0100, Tomas Vondra wrote:
>> On 12/12/21 22:32, Justin Pryzby wrote:
>>> On Sun, Dec 12, 2021 at 05:17:10AM +0100, Tomas Vondra wrote:
>>>> The one thing bugging me a bit is that the regression test checks only a
>>>> GROUP BY query. It'd be nice to add queries testing MCV/dependencies
>>>> too, but that seems tricky because most queries will use per-partitions
>>>> stats.
>>>
>>> You mean because the quals are pushed down to the scan node.
>>>
>>> Does that indicate a deficiency ?
>>>
>>> If extended stats are collected for a parent table, selectivity estimates based
>>> from the parent would be better; but instead we use uncorrected column
>>> estimates from the child tables.
>>>
>>> From what I see, we could come up with a way to avoid the pushdown, involving
>>> volatile functions/foreign tables/RLS/window functions/SRF/wholerow vars/etc.
>>> But would it be better if extended stats objects on partitioned tables were to
>>> collect stats for both parent AND CHILD ? I'm not sure. Maybe that's the
>>> wrong solution, but maybe we should still document that extended stats on
>>> (empty) parent tables are often themselves not used/useful for selectivity
>>> estimates, and the user should instead (or in addition) create stats on child
>>> tables.
>>>
>>> Or, maybe if there's no extended stats on the child tables, stats on the parent
>>> table should be consulted ?
>>
>> Maybe, but that seems like a mostly separate improvement. At this point I'm
>> interested only in testing the behavior implemented in the current patches.
>
> I don't want to change the scope of the patch, or this thread, but my point is
> that the behaviour already changed once (the original regression) and now we're
> planning to change it again to fix that, so we ought to decide on the expected
> behavior before writing tests to verify it.
>

OK. Makes sense.

> I think it may be impossible to use the "dependencies" statistic with inherited
> stats. Normally the quals would be pushed down to the child tables. But, if
> they weren't pushed down, they'd be attached to something other than a scan
> node on the parent table, so the stats on that table wouldn't apply (right?).
>

Yeah, that's probably right. But I'm not 100% sure the whole inheritance
tree can't be treated as a single relation by some queries ...

> Maybe the useless stats types should have been prohibited on partitioned tables
> since v10. It's too late to change that, but perhaps now they shouldn't even
> be collected during analyze. The dependencies and MCV paths are never called
> with rte->inh==true, so maybe we should Assert(!inh), or add a comment to that
> effect. Or the regression tests should "memorialize" the behavior. I'm still
> thinking about it.
>

Yeah, we can't really prohibit them in backbranches - that'd mean some
CREATE STATISTICS commands suddenly start failing, which would be quite
annoying. Not building them for partitioned tables seems like a better
option - BuildRelationExtStatistics can check relkind and pick what to
ignore. But I'd choose to do that in a separate patch, probably - after
all, it shouldn't really change the behavior of any tests/queries, no?

This reminds me we need to consider if these patches could cause any
issues. The way I see it is this:

1) If the table is a separate relation (not part of an inheritance
tree), this should make no difference. -> OK

2) If the table is using "old" inheritance, this reverts back to
pre-regression behavior. So people will keep using the old statistics
until the ANALYZE, and we need to tell them to ANALYZE or something.

3) If the table is using partitioning, it's guaranteed to be empty and
there are no stats at all. Again, we should tell people to run ANALYZE.

Of course, we can't be sure query plans will change in the right
direction :-(

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2021-12-13 20:53:27 Re: WIN32 pg_import_system_collations
Previous Message Robert Haas 2021-12-13 20:36:24 Re: O(n) tasks cause lengthy startups and checkpoints