Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Donghang Lin <donghanglin(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, David Geier <geidav(dot)pg(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, hlinnaka(at)iki(dot)fi
Subject: Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
Date: 2024-03-25 21:11:20
Message-ID: CAAKRu_YjBPfGp85ehY1t9NN=R9pB9k=6rztaeVkAm-OeTqUK4g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 25, 2024 at 2:29 AM Donghang Lin <donghanglin(at)gmail(dot)com> wrote:
>
>
> > On Sat, Feb 17, 2024 at 2:31 PM Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> > 2) Leader vs. worker counters
> >
> > It seems to me this does nothing to add the per-worker values from "Heap
> > Blocks" into the leader, which means we get stuff like this:
> >
> > Heap Blocks: exact=102 lossy=10995
> > Worker 0: actual time=50.559..209.773 rows=215253 loops=1
> > Heap Blocks: exact=207 lossy=19354
> > Worker 1: actual time=50.543..211.387 rows=162934 loops=1
> > Heap Blocks: exact=161 lossy=14636
> >
> > I think this is wrong / confusing, and inconsistent with what we do for
> > other nodes. It's also inconsistent with how we deal e.g. with BUFFERS,
> > where we *do* add the values to the leader:
> >
> > Heap Blocks: exact=125 lossy=10789
> > Buffers: shared hit=11 read=45420
> > Worker 0: actual time=51.419..221.904 rows=150437 loops=1
> > Heap Blocks: exact=136 lossy=13541
> > Buffers: shared hit=4 read=13541
> > Worker 1: actual time=56.610..222.469 rows=229738 loops=1
> > Heap Blocks: exact=209 lossy=20655
> > Buffers: shared hit=4 read=20655
> >
> > Here it's not entirely obvious, because leader participates in the
> > execution, but once we disable leader participation, it's clearer:
> >
> > Buffers: shared hit=7 read=45421
> > Worker 0: actual time=28.540..247.683 rows=309112 loops=1
> > Heap Blocks: exact=282 lossy=27806
> > Buffers: shared hit=4 read=28241
> > Worker 1: actual time=24.290..251.993 rows=190815 loops=1
> > Heap Blocks: exact=188 lossy=17179
> > Buffers: shared hit=3 read=17180
> >
> > Not only is "Buffers" clearly a sum of per-worker stats, but the "Heap
> > Blocks" simply disappeared because the leader does nothing and we don't
> > print zeros.
>
> Heap Blocks is specific to Bitmap Heap Scan. It seems that node specific stats
> do not aggregate workers' stats into leaders for some existing nodes. For example,
> Memorize node for Hits, Misses, etc
>
> -> Nested Loop (actual rows=166667 loops=3)
> -> Parallel Seq Scan on t (actual rows=33333 loops=3)
> -> Memoize (actual rows=5 loops=100000)
> Cache Key: t.j
> Cache Mode: logical
> Hits: 32991 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 2kB
> Worker 0: Hits: 33551 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 2kB
> Worker 1: Hits: 33443 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 2kB
> -> Index Scan using uj on u (actual rows=5 loops=15)
> Index Cond: (j = t.j)
>
> Sort, HashAggregate also do the same stuff.
>
> > 3) I'm not sure dealing with various EXPLAIN flags may not be entirely
> > correct. Consider this:
> >
> > EXPLAIN (ANALYZE):
> >
> > -> Parallel Bitmap Heap Scan on t (...)
> > Recheck Cond: (a < 5000)
> > Rows Removed by Index Recheck: 246882
> > Worker 0: Heap Blocks: exact=168 lossy=15648
> > Worker 1: Heap Blocks: exact=302 lossy=29337
> >
> > EXPLAIN (ANALYZE, VERBOSE):
> >
> > -> Parallel Bitmap Heap Scan on public.t (...)
> > Recheck Cond: (t.a < 5000)
> > Rows Removed by Index Recheck: 246882
> > Worker 0: actual time=35.067..300.882 rows=282108 loops=1
> > Heap Blocks: exact=257 lossy=25358
> > Worker 1: actual time=32.827..302.224 rows=217819 loops=1
> > Heap Blocks: exact=213 lossy=19627
> >
> > EXPLAIN (ANALYZE, BUFFERS):
> >
> > -> Parallel Bitmap Heap Scan on t (...)
> > Recheck Cond: (a < 5000)
> > Rows Removed by Index Recheck: 246882
> > Buffers: shared hit=7 read=45421
> > Worker 0: Heap Blocks: exact=236 lossy=21870
> > Worker 1: Heap Blocks: exact=234 lossy=23115
> >
> > EXPLAIN (ANALYZE, VERBOSE, BUFFERS):
> >
> > -> Parallel Bitmap Heap Scan on public.t (...)
> > Recheck Cond: (t.a < 5000)
> > Rows Removed by Index Recheck: 246882
> > Buffers: shared hit=7 read=45421
> > Worker 0: actual time=28.265..260.381 rows=261264 loops=1
> > Heap Blocks: exact=260 lossy=23477
> > Buffers: shared hit=3 read=23478
> > Worker 1: actual time=28.224..261.627 rows=238663 loops=1
> > Heap Blocks: exact=210 lossy=21508
> > Buffers: shared hit=4 read=21943
> >
> > Why should the per-worker buffer info be shown when combined with the
> > VERBOSE flag, and not just with BUFFERS, when the patch shows the
> > per-worker info always?
> >
>
> It seems that the general explain print framework requires verbose mode to show per worker stats.
> For example, how Buffers hits, JIT are printed. While in some specific nodes which involves parallelism,
> they always show worker blocks. This is why we see that some worker blocks don't have buffers
> stats in non verbose mode. There are several existing nodes have the same issue as what this
> patch does: memorize, sort, hashaggregate.

I don't think passing explain the BUFFERS option should impact what is
shown for bitmap heap scan lossy/exact. BUFFERS has to do with buffer
usage. "Heap Blocks" here is actually more accurately "Bitmap Blocks".
1) it is not heap specific, so at least changing it to "Table Blocks"
would be better and 2) the meaning is about whether or not the blocks
were represented lossily in the bitmap -- yes, those blocks that we
are talking about are table blocks (in contrast to index blocks), but
the important part is the bitmap.

So, BUFFERS shouldn't cause this info to show.

As for whether or not the leader number should be inclusive of all the
worker numbers, if there is a combination of options in which
per-worker stats are not displayed, then the leader count should be
inclusive of the worker numbers. However, if the worker numbers are
always displayed, I think it is okay for the leader number to only
display its own count. Though I agree with Tomas that that can be
confusing when parallel_leader_participation is off. You end up
without a topline number.

As for whether or not per-worker stats should be displayed by default
or only with VERBOSE, it sounds like there are two different
precedents. I don't have a strong feeling one way or the other.
Whichever is most consistent.
Donghang, could you list again which plan nodes and explain options
always print per-worker stats and which only do with the VERBOSE
option?

I think there is an issue with the worker counts on rescan though. I
was playing around with this patch with one of the regression test
suite tables:

drop table if exists tenk1;
drop table if exists tenk2;
CREATE TABLE tenk1 (
unique1 int4,
unique2 int4,
two int4,
four int4,
ten int4,
twenty int4,
hundred int4,
thousand int4,
twothousand int4,
fivethous int4,
tenthous int4,
odd int4,
even int4,
stringu1 name,
stringu2 name,
string4 name
) with (autovacuum_enabled = false);

COPY tenk1 FROM '/[source directory]/src/test/regress/data/tenk.data';

CREATE TABLE tenk2 AS SELECT * FROM tenk1;
CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
VACUUM ANALYZE tenk1;
VACUUM ANALYZE tenk2;

set enable_seqscan to off;
set enable_indexscan to off;
set enable_hashjoin to off;
set enable_mergejoin to off;
set enable_material to off;
set parallel_setup_cost=0;
set parallel_tuple_cost=0;
set min_parallel_table_scan_size=0;
set max_parallel_workers_per_gather=2;
set parallel_leader_participation = off;
explain (analyze, costs off, verbose)
select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
tenk2.thousand=0;

I don't think the worker counts are getting carried across rescans.
For this query, with parallel_leader_participation, you can see the
leader has a high number of exact heap blocks (there are 30 rescans,
so it is the sum across all of those). But the workers have low
counts. That might be normal because maybe they did less work, but if
you set parallel_leader_participation = off, you see the worker
numbers are still small. I did some logging and I do see workers with
counts of lossy/exact not making it into the final count. I haven't
had time to debug more, but it is worth looking into.

parallel_leader_participation = on
-----------------------------------------------------------------------------------------------
Aggregate (actual time=62.321..63.178 rows=1 loops=1)
Output: count(*)
-> Nested Loop (actual time=2.058..55.202 rows=98000 loops=1)
-> Seq Scan on public.tenk2 (actual time=0.182..3.699 rows=10 loops=1)
Output: tenk2.unique1 ...
Filter: (tenk2.thousand = 0)
Rows Removed by Filter: 9990
-> Gather (actual time=1.706..4.142 rows=9800 loops=10)
Workers Planned: 2
Workers Launched: 2
-> Parallel Bitmap Heap Scan on public.tenk1 (actual
time=0.365..0.958 rows=3267 loops=30)
Recheck Cond: (tenk1.hundred > 1)
Heap Blocks: exact=1801
Worker 0: actual time=0.033..0.414 rows=1993 loops=10
Heap Blocks: exact=78 lossy=0
Worker 1: actual time=0.032..0.550 rows=2684 loops=10
Heap Blocks: exact=86 lossy=0
-> Bitmap Index Scan on tenk1_hundred (actual
time=0.972..0.972 rows=9800 loops=10)
Index Cond: (tenk1.hundred > 1)

parallel_leader_participation = off
-----------------------------------------------------------------------------------------------
Aggregate (actual time=84.502..84.977 rows=1 loops=1)
Output: count(*)
-> Nested Loop (actual time=6.185..77.085 rows=98000 loops=1)
-> Seq Scan on public.tenk2 (actual time=0.182..3.709 rows=10 loops=1)
Output: tenk2.unique1...
Filter: (tenk2.thousand = 0)
Rows Removed by Filter: 9990
-> Gather (actual time=5.265..6.355 rows=9800 loops=10)
Workers Planned: 2
Workers Launched: 2
-> Parallel Bitmap Heap Scan on public.tenk1 (actual
time=0.951..1.863 rows=4900 loops=20)
Recheck Cond: (tenk1.hundred > 1)
Worker 0: actual time=0.794..1.705 rows=4909 loops=10
Heap Blocks: exact=168 lossy=0
Worker 1: actual time=1.108..2.021 rows=4891 loops=10
Heap Blocks: exact=177 lossy=0
-> Bitmap Index Scan on tenk1_hundred (actual
time=1.024..1.024 rows=9800 loops=10)
Index Cond: (tenk1.hundred > 1)
Worker 1: actual time=1.024..1.024
rows=9800 loops=10

- Melanie

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-03-25 21:37:54 Re: add AVX2 support to simd.h
Previous Message Bruce Momjian 2024-03-25 21:04:31 Re: Possibility to disable `ALTER SYSTEM`