Re: explain HashAggregate to report bucket and memory stats

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject: Re: explain HashAggregate to report bucket and memory stats
Date: 2020-02-22 21:53:35
Message-ID: 20200222215335.bu4kky7el4nccls5@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've started looking at this patch, because I've been long missing the
information about hashagg hash table, so I'm pleased someone is working
on this. In general I agree it may be useful to add simila information
to other nodes using a hashtable, but IMHO the hashagg bit is the most
useful, so maybe we should focus on it first.

A couple of comments after an initial review of the patches:

1) explain.c API

The functions in explain.c (even the static ones) follow the convention
that the last parameter is (ExplainState *es). I think we should stick
to this, so the new parameters should be added before it.

For example instead of

static void ExplainNode(PlanState *planstate, List *ancestors,
const char *relationship, const char *plan_name,
ExplainState *es, SubPlanState *subplanstate);

we should do e.g.

static void ExplainNode(PlanState *planstate, SubPlanState *subplanstate,
List *ancestors,
const char *relationship, const char *plan_name,
ExplainState *es);

Also, the first show_grouping_sets should be renamed to aggstate to make
it consistent with the type change.

2) The hash_instrumentation is a bit inconsistent with what we already
have. We have Instrumentation, JitInstrumentation, WorkerInstrumentation
and HashInstrumentation, so I suggest we follow the same patter and call
this HashTableInstrumentation or something like that.

3) Almost all executor nodes that are modified to include this new
instrumentation struct also include TupleHashTable, and the data are
essentially about the hash table. So my question is why not to include
this into TupleHashTable - that would mean we don't need to modify any
executor nodes, and it'd probably simplify code in explain.c too because
we could simply pass the hashtable.

It'd also simplify e.g. SubPlanState where we have to add two new fields
and make sure to update the right one.

4) The one exception to (3) is BitmapHeapScanState, which does include
TIDBitmap and not TupleHashTable. And then we have tbm_instrumentation
which "fakes" the data based on the pagetable. Maybe this is a sign that
TIDBitmap needs a slightly different struct? Also, I'm not sure why we
actually need tbm_instrumentation()? It just copies the instrumentation
data from TIDBitmap into the node level, but why couldn't we just look
at the instrumentation data in TIDBitmap directly?

5) I think the explain for grouping sets need a rething. Teh function
show_grouping_set_keys was originally meant to print just the keys, but
now it's also printing the hash table stats. IMO we need a new function
printing a grouping set info - calling show_grouping_set_keys to print
the keys, but then also printing the extra hashtable info.

6) subplan explain

I probably agree the hashtable info should be included in the subplan,
not in the parent node. Otherwise it's confusing, particularly when the
node has multiple subplans. The one thing I find a bit strange is this:

explain (analyze, timing off, summary off, costs off)
select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
QUERY PLAN
----------------------------------------------
Result (actual rows=1 loops=1)
SubPlan 1
Buckets: 4 (originally 2)
Null hashtable: Buckets: 2
-> Append (actual rows=2 loops=1)
-> Result (actual rows=1 loops=1)
-> Result (actual rows=1 loops=1)
(7 rows)

That is, there's no indication why would this use a hash table, because
the "hashed subplan" is included only in verbose mode:

explain (analyze, verbose, timing off, summary off, costs off)
select 'foo'::text in (select 'bar'::name union all select 'bar'::name);
QUERY PLAN
----------------------------------------------------------------------------------
Result (actual rows=1 loops=1)
Output: (hashed SubPlan 1)
SubPlan 1
Buckets: 4 (originally 2) Memory Usage Hash: 1kB Memory Usage Tuples: 1kB
Null hashtable: Buckets: 2 Memory Usage Hash: 1kB Memory Usage Tuples: 0kB
-> Append (actual rows=2 loops=1)
-> Result (actual rows=1 loops=1)
Output: 'bar'::name
-> Result (actual rows=1 loops=1)
Output: 'bar'::name
(10 rows)

Not sure if this is an issue, maybe it's fine. But it's definitely
strange that we only print memory info in verbose mode - IMHO it's much
more useful info than the number of buckets etc.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-02-22 23:00:58 Re: explain HashAggregate to report bucket and memory stats
Previous Message Daniel Gustafsson 2020-02-22 21:22:27 Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?