Re: PoC: using sampling to estimate joins / complex conditions

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: PoC: using sampling to estimate joins / complex conditions
Date: 2022-01-21 00:06:37
Message-ID: 280088a0-e364-a528-bcca-cc487bef2f84@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/5/22 00:58, Andres Freund wrote:
> Hi,
>
> On 2021-06-27 19:55:24 +0200, Tomas Vondra wrote:
>> estimating joins is one of the significant gaps related to extended
>> statistics, and I've been regularly asked about what we might do about
>> that. This is an early experimental patch that I think might help us
>> with improving this, possible even in PG15.
>
> The tests in this patch fail:
> https://cirrus-ci.com/task/5304621299138560
> https://api.cirrus-ci.com/v1/artifact/task/5304621299138560/regress_diffs/src/test/regress/regression.diffs
>
> Looks like the regression test output hasn't been updated?
>

Yeah, I haven't updated some of the test output because some of those
changes are a bit wrong (and I think that's fine for a PoC patch). I
should have mentioned that in the message, though. Sorry about that.

There are three types of failures:

1) Changes to deparsed statistics definition in \d command:

- "public.ctlt_all_a_b_stat" ON a, b FROM ctlt_all
+ "public.ctlt_all_a_b_stat" (ndistinct, dependencies, mcv) ON a, b
FROM ctlt_all

This happens because there's a new kind "sample" but it's not set by
default if creating new statistics, and the deparsing logic decides it
means it has to list the kinds explicitly. I've fixed this in the
attached patch, but it was mostly harmless and I'm not sure this is how
sample should behave.

2) Three GUC parameters allowing to enable/disable sampling for
different parts of a query (scan, join, correlated join sampling). I
still consider those GUCs temporary, for experiments, but I've added
them to the expected output.

3) Changes in estimates for OR conditions - a couple estimates get less
accurate, because OR clauses are handled as a single clause the first
time we pass them to statext_clauselist_selectivity(). So we combine the
results incorrectly. OR clauses may need some changes, because it's
causing issues in other patches too (e.g. in the "Var op Var" one).

I haven't done anything about (3) yet - it's a valid issue and needs to
be fixed (either by changing how we handle OR clauses, or maybe handling
samples and MCVs at the same time). Or maybe some other way. In any
case, there is more important stuff that needs fixing first.

The main issue is planning overhead - for the example in the first
message, with a simple query joining two tables, you'll see this:

Planning Time: 603.442 ms
Execution Time: 1071.735 ms

There's a couple reasons why it takes this long:

1) We sample the same relation repeatedly - once as a "scan" and then
while estimating joins. And for this query we do that in three different
contexts:

- set_joinrel_size_estimates
- populate_joinrel_with_paths (twice)

I guess we'll get different number of samples for different queries, but
it'll get worse for queries joining more tables etc. It seems fairly
simple to cache the samples - for example in StatisticExtInfo (or maybe
somewhere else, to keep just one sample per relation, not per RTE).

Unfortunately, I'm not sure this works for "independent" samples, not
for correlated ones (which are just FK lookups for another sample, so
that depends on what's the other sample). Which is a bummer, because
correlated samples are the more expensive ones :-(

2) The correlated samples are currently built using a query, executed
through SPI in a loop. So given a "driving" sample of 30k rows, we do
30k lookups - that'll take time, even if we do that just once and cache
the results.

I'm sure there there's room for some improvement, though - for example
we don't need to fetch all columns included in the statistics object,
but just stuff referenced by the clauses we're estimating. That could
improve chance of using IOS etc.

I wonder if there's a more efficient way to do this, in a batched manner
or something ... But even then it'll still be quite expensive.

The only other alternative I can think of is collecting samples during
ANALYZE, and storing them somewhere. That'll be difficult for correlated
samples, particularly for transitive cases (in snowflake schema). But I
believe it's doable (at least for cases covered by FK constraints).

But I'm not sure where/how to store the samples. An obvious option would
be to serialize them into pg_statistic_ext_data, and I'll try doing that
(it's a bit like a huge MCV without any cutoff). But maybe there's a
better way that would not require constant serialization/deserialization
of many tuples.

Any ideas about these options?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
0001-PoC-Cardinality-estimation-using-runtime-sampling-20220120.patch text/x-patch 83.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shawn Debnath 2022-01-21 00:19:49 Re: MultiXact\SLRU buffers configuration
Previous Message Greg Nancarrow 2022-01-21 00:05:55 Re: PublicationActions - use bit flags.