Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

From: Donghang Lin <donghanglin(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, David Geier <geidav(dot)pg(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, hlinnaka(at)iki(dot)fi, melanieplageman(at)gmail(dot)com
Subject: Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
Date: 2024-03-25 06:28:52
Message-ID: CAA=D8a1ZGWTk0RDH-LyU==RsE-217T29eprS-NBhyj+eLHyRrQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas,
>
> On Sat, Feb 17, 2024 at 2:31 PM Tomas Vondra <
tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> Hi David,
>
> Do you plan to work continue working on this patch? I did take a look,
> and on the whole it looks reasonable - it modifies the right places etc.
>
> I think there are two things that may need an improvement:
>
> 1) Storing variable-length data in ParallelBitmapHeapState
>
> I agree with Robert the snapshot_and_stats name is not great. I see
> Dmitry mentioned phs_snapshot_off as used by ParallelTableScanDescData -
> the reasons are somewhat different (phs_snapshot_off exists because we
> don't know which exact struct will be allocated), while here we simply
> need to allocate two variable-length pieces of memory. But it seems like
> it would work nicely for this. That is, we could try adding an offset
> for each of those pieces of memory:
>
> - snapshot_off
> - stats_off
>
> I don't like the GetSharedSnapshotData name very much, it seems very
> close to GetSnapshotData - quite confusing, I think.
>
> Dmitry also suggested we might add a separate piece of shared memory. I
> don't quite see how that would work for ParallelBitmapHeapState, but I
> doubt it'd be simpler than having two offsets. I don't think the extra
> complexity (paid by everyone) would be worth it just to make EXPLAIN
> ANALYZE work.

This issue is now gone after Heikki's fix.

> 2) Leader vs. worker counters
>
> It seems to me this does nothing to add the per-worker values from "Heap
> Blocks" into the leader, which means we get stuff like this:
>
> Heap Blocks: exact=102 lossy=10995
> Worker 0: actual time=50.559..209.773 rows=215253 loops=1
> Heap Blocks: exact=207 lossy=19354
> Worker 1: actual time=50.543..211.387 rows=162934 loops=1
> Heap Blocks: exact=161 lossy=14636
>
> I think this is wrong / confusing, and inconsistent with what we do for
> other nodes. It's also inconsistent with how we deal e.g. with BUFFERS,
> where we *do* add the values to the leader:
>
> Heap Blocks: exact=125 lossy=10789
> Buffers: shared hit=11 read=45420
> Worker 0: actual time=51.419..221.904 rows=150437 loops=1
> Heap Blocks: exact=136 lossy=13541
> Buffers: shared hit=4 read=13541
> Worker 1: actual time=56.610..222.469 rows=229738 loops=1
> Heap Blocks: exact=209 lossy=20655
> Buffers: shared hit=4 read=20655
>
> Here it's not entirely obvious, because leader participates in the
> execution, but once we disable leader participation, it's clearer:
>
> Buffers: shared hit=7 read=45421
> Worker 0: actual time=28.540..247.683 rows=309112 loops=1
> Heap Blocks: exact=282 lossy=27806
> Buffers: shared hit=4 read=28241
> Worker 1: actual time=24.290..251.993 rows=190815 loops=1
> Heap Blocks: exact=188 lossy=17179
> Buffers: shared hit=3 read=17180
>
> Not only is "Buffers" clearly a sum of per-worker stats, but the "Heap
> Blocks" simply disappeared because the leader does nothing and we don't
> print zeros.

Heap Blocks is specific to Bitmap Heap Scan. It seems that node specific
stats
do not aggregate workers' stats into leaders for some existing nodes. For
example,
Memorize node for Hits, Misses, etc

-> Nested Loop (actual rows=166667 loops=3)
-> Parallel Seq Scan on t (actual rows=33333 loops=3)
-> Memoize (actual rows=5 loops=100000)
Cache Key: t.j
Cache Mode: logical
Hits: 32991 Misses: 5 Evictions: 0 Overflows: 0 Memory
Usage: 2kB
Worker 0: Hits: 33551 Misses: 5 Evictions: 0 Overflows:
0 Memory Usage: 2kB
Worker 1: Hits: 33443 Misses: 5 Evictions: 0 Overflows:
0 Memory Usage: 2kB
-> Index Scan using uj on u (actual rows=5 loops=15)
Index Cond: (j = t.j)

Sort, HashAggregate also do the same stuff.

> 3) I'm not sure dealing with various EXPLAIN flags may not be entirely
> correct. Consider this:
>
> EXPLAIN (ANALYZE):
>
> -> Parallel Bitmap Heap Scan on t (...)
> Recheck Cond: (a < 5000)
> Rows Removed by Index Recheck: 246882
> Worker 0: Heap Blocks: exact=168 lossy=15648
> Worker 1: Heap Blocks: exact=302 lossy=29337
>
> EXPLAIN (ANALYZE, VERBOSE):
>
> -> Parallel Bitmap Heap Scan on public.t (...)
> Recheck Cond: (t.a < 5000)
> Rows Removed by Index Recheck: 246882
> Worker 0: actual time=35.067..300.882 rows=282108 loops=1
> Heap Blocks: exact=257 lossy=25358
> Worker 1: actual time=32.827..302.224 rows=217819 loops=1
> Heap Blocks: exact=213 lossy=19627
>
> EXPLAIN (ANALYZE, BUFFERS):
>
> -> Parallel Bitmap Heap Scan on t (...)
> Recheck Cond: (a < 5000)
> Rows Removed by Index Recheck: 246882
> Buffers: shared hit=7 read=45421
> Worker 0: Heap Blocks: exact=236 lossy=21870
> Worker 1: Heap Blocks: exact=234 lossy=23115
>
> EXPLAIN (ANALYZE, VERBOSE, BUFFERS):
>
> -> Parallel Bitmap Heap Scan on public.t (...)
> Recheck Cond: (t.a < 5000)
> Rows Removed by Index Recheck: 246882
> Buffers: shared hit=7 read=45421
> Worker 0: actual time=28.265..260.381 rows=261264 loops=1
> Heap Blocks: exact=260 lossy=23477
> Buffers: shared hit=3 read=23478
> Worker 1: actual time=28.224..261.627 rows=238663 loops=1
> Heap Blocks: exact=210 lossy=21508
> Buffers: shared hit=4 read=21943
>
> Why should the per-worker buffer info be shown when combined with the
> VERBOSE flag, and not just with BUFFERS, when the patch shows the
> per-worker info always?
>

It seems that the general explain print framework requires verbose mode to
show per worker stats.
For example, how Buffers hits, JIT are printed. While in some specific
nodes which involves parallelism,
they always show worker blocks. This is why we see that some worker blocks
don't have buffers
stats in non verbose mode. There are several existing nodes have the same
issue as what this
patch does: memorize, sort, hashaggregate.

> 4) Now that I think about this, isn't the *main* problem really that we
> don't display the sum of the per-worker stats (which I think is wrong)?
> I mean, we already can get the worker details VERBOSEm right? So the
> only reason to display that by default seems to be that it the values in
> "Heap Blocks" are from the leader only.

If we print aggregate Heap Blocks in the 'leader' block and show worker
stats only in
verbose mode, does it look better? In this way, it matches better with how
the
general framework prints workers stats, but it differs from some existing
nodes which
also print worker stats. As mentioned above, they are memorize, sort, and
hashaggregate nodes.
By the way, is that also a problem for these nodes?

> BTW doesn't this also suggest some of the code added to explain.c may
> not be quite necessary? Wouldn't it be enough to just "extend" the
> existing code printing per-worker stats. (I haven't tried, so maybe I'm
> wrong and we need the new code.)

We need the new code as they are node specific stats and we do call the
worker print function
to lay out the explain plan. I think the problem is in which mode we should
show worker blocks. This
is discussed in the above section.

v3 failed on master, attached a rebased version.

Regards,
Donghang Lin
(ServiceNow)

Attachment Content-Type Size
v4-0001-Parallel-Bitmap-Heap-Scan-reports-per-worker-stat.patch application/octet-stream 11.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Paul Jungwirth 2024-03-25 06:47:26 Re: altering a column's collation leaves an invalid foreign key
Previous Message John Naylor 2024-03-25 06:25:17 Re: [PoC] Improve dead tuple storage for lazy vacuum