Re: Support run-time partition pruning for hash join

From: Richard Guo <guofenglinux(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support run-time partition pruning for hash join
Date: 2023-08-25 08:48:35
Message-ID: CAMbWs4_4nVzr5=1NXyzsuu7qQ95nT=TF57WzEB+DZs8d+1YTxg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 25, 2023 at 11:03 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> On Thu, 24 Aug 2023 at 21:27, Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
> > I think we need to solve this problem first before we can
> > make this new partition pruning mechanism some useful in practice, but
> > how? Some thoughts currently in my mind include
> >
> > 1) we try our best to estimate the cost of this partition pruning when
> > creating hash join paths, and decide based on the cost whether to use it
> > or not. But this does not seem to be an easy task.
>
> I think we need to consider another Hash Join path when we detect that
> the outer side of the Hash Join involves scanning a partitioned table.
>
> I'd suggest writing some cost which costs an execution of run-time
> pruning. With LIST and RANGE you probably want something like
> cpu_operator_cost * LOG2(nparts) once for each hashed tuple to account
> for the binary search over the sorted datum array. For HASH
> partitions, something like cpu_operator_cost * npartcols once for each
> hashed tuple.
>
> You'll need to then come up with some counter costs to subtract from
> the Append/MergeAppend. This is tricky, as discussed. Just come up
> with something crude for now.
>
> To start with, it could just be as crude as:
>
> total_costs *= (Min(expected_outer_rows, n_append_subnodes) /
> n_append_subnodes);
>
> i.e assume that every outer joined row will require exactly 1 new
> partition up to the total number of partitions. That's pretty much
> worst-case, but it'll at least allow the optimisation to work for
> cases like where the hash table is expected to contain just a tiny
> number of rows (fewer than the number of partitions)
>
> To make it better, you might want to look at join selectivity
> estimation and see if you can find something there to influence
> something better.

Thank you for the suggestion. I will take some time considering it.

When we have multiple join levels, it seems the situation becomes even
more complex. One Append/MergeAppend node might be pruned by more than
one Hash node, and one Hash node might provide pruning for more than one
Append/MergeAppend node. For instance, below is the plan from the test
case added in the v1 patch:

explain (analyze, costs off, summary off, timing off)
select * from tprt p1
inner join tprt p2 on p1.col1 = p2.col1
right join tbl1 t on p1.col1 = t.col1 and p2.col1 = t.col1;
QUERY PLAN
-------------------------------------------------------------------------
Hash Right Join (actual rows=2 loops=1)
Hash Cond: ((p1.col1 = t.col1) AND (p2.col1 = t.col1))
-> Hash Join (actual rows=3 loops=1)
Hash Cond: (p1.col1 = p2.col1)
-> Append (actual rows=3 loops=1)
-> Seq Scan on tprt_1 p1_1 (never executed)
-> Seq Scan on tprt_2 p1_2 (actual rows=3 loops=1)
-> Seq Scan on tprt_3 p1_3 (never executed)
-> Seq Scan on tprt_4 p1_4 (never executed)
-> Seq Scan on tprt_5 p1_5 (never executed)
-> Seq Scan on tprt_6 p1_6 (never executed)
-> Hash (actual rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Append (actual rows=3 loops=1)
-> Seq Scan on tprt_1 p2_1 (never executed)
-> Seq Scan on tprt_2 p2_2 (actual rows=3 loops=1)
-> Seq Scan on tprt_3 p2_3 (never executed)
-> Seq Scan on tprt_4 p2_4 (never executed)
-> Seq Scan on tprt_5 p2_5 (never executed)
-> Seq Scan on tprt_6 p2_6 (never executed)
-> Hash (actual rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on tbl1 t (actual rows=2 loops=1)
(23 rows)

In this plan, the Append node of 'p1' is pruned by two Hash nodes: Hash
node of 't' and Hash node of 'p2'. Meanwhile, the Hash node of 't'
provides pruning for two Append nodes: Append node of 'p1' and Append
node of 'p2'.

In this case, meaningfully costing for the partition pruning seems even
more difficult. Do you have any suggestion on that?

> > 2) we use some heuristics when executing hash join, such as when we
> > notice that a $threshold percentage of the partitions must be visited
> > we just abort the pruning and assume that no partitions can be pruned.
>
> You could likely code in something that checks
> bms_num_members(jpstate->part_prune_result) to see if it still remains
> below the total Append/MergeAppend subplans whenever, say whenever the
> lower 8 bits of hashtable->totalTuples are all off. You can just give
> up doing any further pruning when all partitions are already required.

Yeah, we can do that. While this may not help in the tests I performed
for the worst case because the table in the hash side is designed that
tuples belong to the same partition are placed together so that we need
to scan almost all its tuples before we could know that all partitions
are already required, I think this might help a lot in real world.

Thanks
Richard

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2023-08-25 09:26:26 Re: Testing autovacuum wraparound (including failsafe)
Previous Message Alvaro Herrera 2023-08-25 08:45:28 Re: Synchronizing slots from primary to standby