Re: Hash support for grouping sets

From: Andres Freund <andres(at)anarazel(dot)de>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Mark Dilger <hornschnorter(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hash support for grouping sets
Date: 2017-03-23 00:14:41
Message-ID: 20170323001441.akmwcwumbxtd6sbg@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

> +/*
> + * Switch to phase "newphase", which must either be 0 or 1 (to reset) or
> * current_phase + 1. Juggle the tuplesorts accordingly.
> */
> static void
> initialize_phase(AggState *aggstate, int newphase)
> {
> - Assert(newphase == 0 || newphase == aggstate->current_phase + 1);
> + Assert(newphase <= 1 || newphase == aggstate->current_phase + 1);

I think this somewhere in the file header needs an expanded explanation
about what these "special" phases mean.

> @@ -838,17 +897,21 @@ advance_transition_function(AggState *aggstate,
> /*
> * Advance each aggregate transition state for one input tuple. The input
> * tuple has been stored in tmpcontext->ecxt_outertuple, so that it is
> - * accessible to ExecEvalExpr. pergroup is the array of per-group structs to
> - * use (this might be in a hashtable entry).
> + * accessible to ExecEvalExpr.
> + *
> + * We have two sets of transition states to handle: one for sorted aggregation
> + * and one for hashed; we do them both here, to avoid multiple evaluation of
> + * the inputs.
> *
> * When called, CurrentMemoryContext should be the per-query context.
> */

Changes to advance_aggregates() are, in my experience, quite likely to
have performance effects. This needs some performance tests.

I scheduled a small TPCH comparison run:

master q01 min: 28924.766 dev-hash min: 28893.923 [diff -0.11]
master q02 min: 2448.135 dev-hash min: 2430.589 [diff -0.72]
master q03 min: 14596.292 dev-hash min: 14421.054 [diff -1.22]
master q04 min: 1999.684 dev-hash min: 1958.727 [diff -2.09]
master q05 min: 10638.148 dev-hash min: 10753.075 [diff +1.07]
master q06 min: 3679.955 dev-hash min: 3649.114 [diff -0.85]
master q07 min: 10097.272 dev-hash min: 10358.089 [diff +2.52]
master q08 min: 3103.698 dev-hash min: 3078.108 [diff -0.83]
master q09 min: 13034.87 dev-hash min: 13049.402 [diff +0.11]
master q10 min: 11702.966 dev-hash min: 11535.05 [diff -1.46]
master q11 min: 577.114 dev-hash min: 586.767 [diff +1.65]
master q12 min: 9087.413 dev-hash min: 9272.874 [diff +2.00]
master q13 min: 16353.813 dev-hash min: 16473.882 [diff +0.73]
master q14 min: 1564.321 dev-hash min: 1552.31 [diff -0.77]
master q15 min: 3958.565 dev-hash min: 3946.728 [diff -0.30]
master q16 min: 3793.345 dev-hash min: 3784.996 [diff -0.22]
master q17 min: 828.286 dev-hash min: 834.929 [diff +0.80]
master q18 min: 13473.084 dev-hash min: 13777.327 [diff +2.21]
master q19 min: 654.241 dev-hash min: 673.863 [diff +2.91]
master q20 min: 2906.811 dev-hash min: 2882.793 [diff -0.83]
master q22 min: 1226.397 dev-hash min: 1262.045 [diff +2.82]

master total min: 154649.176 dev-hash min: 155175.645 [diff +0.34]

Looks like it could all be noise, but it seems worthwhile to look into
some.

> * GROUP BY columns. The per-group data is allocated in lookup_hash_entry(),
> * for each entry.
> *
> - * The hash table always lives in the aggcontext memory context.
> + * We have a separate hashtable and associated perhash data structure for each
> + * grouping set for which we're doing hashing.
> + *
> + * The hash tables always live in the hashcontext's per-tuple memory context
> + * (there is only one of these for all tables together, since they are all
> + * reset at the same time).
> */

I've not yet read up on the thread, or the whole patch, but an
explanation of the memory management for all those tables would be good
somewhere (maybe I've just not read that far).

> @@ -2388,7 +2597,36 @@ agg_retrieve_hash_table(AggState *aggstate)
> * ExecInitAgg
> *
> * Creates the run-time information for the agg node produced by the
> - * planner and initializes its outer subtree
> + * planner and initializes its outer subtree.
> + *
> + * What we get from the planner is actually one "real" Agg node which is part
> + * of the plan tree proper, but which optionally has an additional list of Agg
> + * nodes hung off the side via the "chain" field. This is because an Agg node
> + * happens to be a convenient representation of all the data we need for
> + * grouping sets.
> + *
> + * For many purposes, we treat the "real" node as if it were just the first
> + * node in the chain. The chain must be ordered such that hashed entries come
> + * before sorted/plain entries; the real node is marked AGG_MIXED if there are
> + * mixed types (in which case the real node describes one of the hashed

"mixed types" sounds a bit weird.

> + * We collect all hashed nodes into a single "phase", numbered 0, and create a
> + * sorted phase (numbered 1..n) for each AGG_SORTED or AGG_PLAIN node. Phase
> + * 0 is allocated even if there are no hashes, but remains unused in that
> + * case.

Ah, here's the explanation I'd been looking for ;)

> +Bitmapset *
> +DiscreteKnapsack(int max_weight, int num_items,
> + int *item_weights, double *item_values)
> +{
> + MemoryContext local_ctx = AllocSetContextCreate(CurrentMemoryContext,
> + "Knapsack",
> + ALLOCSET_SMALL_MINSIZE,
> + ALLOCSET_SMALL_INITSIZE,
> + ALLOCSET_SMALL_MAXSIZE);
> + MemoryContext oldctx = MemoryContextSwitchTo(local_ctx);
> + double *values;
> + Bitmapset **sets;
> + Bitmapset *result;
> + int i,
> + j;
> +
> + Assert(max_weight >= 0);
> + Assert(num_items > 0 && item_weights);
> +
> + values = palloc((1 + max_weight) * sizeof(double));
> + sets = palloc((1 + max_weight) * sizeof(Bitmapset *));

We usually cast the result of palloc.

> + for (i = 0; i <= max_weight; ++i)
> + {
> + values[i] = 0;
> + sets[i] = bms_make_singleton(num_items);
> + }
> +
> + for (i = 0; i < num_items; ++i)
> + {
> + int iw = item_weights[i];
> + double iv = item_values ? item_values[i] : 1;
> +
> + for (j = max_weight; j >= iw; --j)
> + {
> + int ow = j - iw;
> +
> + if (values[j] <= values[ow] + iv)
> + {
> + /* copy sets[ow] to sets[j] without realloc */
> + if (j != ow)
> + {
> + sets[j] = bms_del_members(sets[j], sets[j]);
> + sets[j] = bms_add_members(sets[j], sets[ow]);
> + }
> +
> + sets[j] = bms_add_member(sets[j], i);
> +
> + values[j] = values[ow] + iv;
> + }
> + }
> + }
> +
> + MemoryContextSwitchTo(oldctx);
> +
> + result = bms_del_member(bms_copy(sets[max_weight]), num_items);
> +
> + MemoryContextDelete(local_ctx);
> +
> + return result;
> +}

> diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
> index 252af5c870..ea1c538287 100644
> --- a/src/backend/nodes/bitmapset.c
> +++ b/src/backend/nodes/bitmapset.c
> @@ -21,7 +21,7 @@
> #include "postgres.h"
>
> #include "access/hash.h"
> -
> +#include "nodes/pg_list.h"

Pedanticism^7: You're deleting a line here...

> +/*
> * bms_nonempty_difference - do sets have a nonempty difference?
> */
> bool
> diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
> index 7418fbeded..d2d66f915b 100644
> --- a/src/backend/nodes/outfuncs.c
> +++ b/src/backend/nodes/outfuncs.c
> @@ -1936,6 +1936,28 @@ _outAggPath(StringInfo str, const AggPath *node)
> }
>
> static void
> +_outRollupData(StringInfo str, const RollupData *node)
> +{
> + WRITE_NODE_TYPE("ROLLUP");
> +
> + WRITE_NODE_FIELD(groupClause);
> + WRITE_NODE_FIELD(gsets);
> + WRITE_NODE_FIELD(gsets_data);
> + WRITE_FLOAT_FIELD(numGroups, "%.0f");
> + WRITE_BOOL_FIELD(hashable);
> + WRITE_BOOL_FIELD(is_hashed);
> +}
> +
> +static void
> +_outGroupingSetData(StringInfo str, const GroupingSetData *node)
> +{
> + WRITE_NODE_TYPE("GSDATA");
> +
> + WRITE_NODE_FIELD(set);
> + WRITE_FLOAT_FIELD(numGroups, "%.0f");
> +}

.0f?

> +static grouping_sets_data *
> +preprocess_grouping_sets(PlannerInfo *root)
> +{

It'd not hurt to add a comment as to what this function is actually
doing.

Btw, this would be a lot easier to review if the move to
preprocess_grouping_sets were done in a separate preliminary commit.

> + if (!bms_is_empty(gd->unsortable_refs))
> + {
...
> + }
> + else
> + sets = extract_rollup_sets(parse->groupingSets);

I hope there's tests for both these branches.

> + /*
> + * Is it hashable? We pretend empty sets are hashable even though we
> + * actually force them not to be hashed later. But don't bother if
> + * there's nothing but empty sets.
> + */

Why?

> @@ -3214,6 +3407,11 @@ get_number_of_groups(PlannerInfo *root,
> * estimate_hashagg_tablesize
> * estimate the number of bytes that a hash aggregate hashtable will
> * require based on the agg_costs, path width and dNumGroups.
> + *
> + * XXX this may be over-estimating the size now that hashagg knows to omit
> + * unneeded columns from the hashtable. Also for mixed-mode grouping sets,
> + * grouping columns not in the hashed set are counted here even though hashagg
> + * won't store them. Is this a problem?
> */

Hm. This seems like it could move us to use sorting, even if not
actually warranted.

> +/*
> + * For a given input path, consider the possible ways of doing grouping sets on
> + * it, by combinations of hashing and sorting. This can be called multiple
> + * times, so it's important that it not scribble on input. No result is
> + * returned, but any generated paths are added to grouped_rel.
> + */
> +
> +static void
> +consider_groupingsets_paths(PlannerInfo *root,

What's the newline after the comment about?

> + RelOptInfo *grouped_rel,
> + Path *path,
> + bool is_sorted,
> + bool can_hash,
> + PathTarget *target,
> + grouping_sets_data *gd,
> + const AggClauseCosts *agg_costs,
> + double dNumGroups)
> +{
> + Query *parse = root->parse;
> +
> + /*
> + * If we're not being offered sorted input, then only consider plans that
> + * can be done entirely by hashing.
> + *
> + * We can hash everything if it looks like it'll fit in work_mem. But if
> + * the input is actually sorted despite not being advertised as such, we
> + * prefer to make use of that in order to use less memory.
> + * If none of the grouping sets are sortable, then ignore the work_mem
> + * limit and generate a path anyway, since otherwise we'll just fail.
> + */
> + if (!is_sorted)
> + {
> + List *new_rollups = NIL;
> + RollupData *unhashable_rollup = NULL;
> + List *sets_data;
> + List *empty_sets_data = NIL;
> + List *empty_sets = NIL;
> + ListCell *lc;
> + ListCell *l_start = list_head(gd->rollups);
> + AggStrategy strat = AGG_HASHED;
> + Size hashsize;
> + double exclude_groups = 0.0;
> +
> + Assert(can_hash);
> +
> + if (pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
> + {
> + unhashable_rollup = lfirst(l_start);

a) cast result of lfirst/lnext/whatnot. b) Why is this guaranteed to be
unhashable?

> + exclude_groups = unhashable_rollup->numGroups;
> + l_start = lnext(l_start);
> + }

What guarantees that such a rollup would be at the start?

> + hashsize = estimate_hashagg_tablesize(path,
> + agg_costs,
> + dNumGroups - exclude_groups);
> +
> + if (hashsize > work_mem * 1024L && gd->rollups)
> + return; /* nope, won't fit */

Didn't you above talk about generating a path anyway? Or is rollups
guaranteed not to be set in that case (if so, not obvious on a casual
read)?

> + /*
> + * If we have sorted input but nothing we can do with it, bail.
> + */
> +
> + if (list_length(gd->rollups) == 0)
> + return;

Postgres style usually doesn't add a newline before comment and the
block it concerns - you do that all over.

> + if (can_hash && gd->any_hashable)
> + {
> + List *rollups = NIL;
> + List *hash_sets = list_copy(gd->unsortable_sets);
> + double availspace = (work_mem * 1024.0);

Why the parens, and why the .0?

> + ListCell *lc;
> +
> + /*
> + * Account first for space needed for groups we can't sort at all.
> + */
> + availspace -= (double) estimate_hashagg_tablesize(path,
> + agg_costs,
> + gd->dNumHashGroups);
> +
> + if (availspace > 0 && list_length(gd->rollups) > 1)
> + {
> + double scale;
> + int num_rollups = list_length(gd->rollups);
> + int k_capacity;
> + int *k_weights = palloc(num_rollups * sizeof(int));
> + Bitmapset *hash_items = NULL;
> + int i;
> +
> + /*
> + * To use the discrete knapsack, we need to scale the values to a
> + * reasonably small bounded range. We choose to allow a 5% error
> + * margin; we have no more than 4096 rollups in the worst possible
> + * case, which with a 5% error margin will require a bit over 42MB
> + * of workspace. (Anyone wanting to plan queries that complex had
> + * better have the memory for it. In more reasonable cases, with
> + * no more than a couple of dozen rollups, the memory usage will
> + * be negligible.)
> + *
> + * k_capacity is naturally bounded, but we clamp the values for
> + * scale and weight (below) to avoid overflows or underflows (or
> + * uselessly trying to use a scale factor less than 1 byte).
> + */

I think you need a higher level explanation of how you're mapping the
work_mem issue to knapsack somewhere.

> + /*
> + * Apply knapsack algorithm; compute the set of items which
> + * maximizes the value stored (in this case the number of sorts
> + * saved) while keeping the total size (approximately) within
> + * capacity.
> + */

Hm. So we're solely optimizing for the number of sorts, rather the cost
of the sorts. Probably an acceptable tradeoff.

> +}

Phew, this isn't simple. Need to be more awake than I'm 8 hours into a
flight.

> @@ -737,7 +739,8 @@ typedef enum AggStrategy
> {
> AGG_PLAIN, /* simple agg across all input rows */
> AGG_SORTED, /* grouped agg, input must be sorted */
> - AGG_HASHED /* grouped agg, use internal hashtable */
> + AGG_HASHED, /* grouped agg, use internal hashtable */
> + AGG_MIXED /* grouped agg, hash and sort concurrently */
> } AggStrategy;

Concurrently sounds a bit like using multiple processes / parallelism...

- Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-03-23 00:25:41 Re: Parallel Append implementation
Previous Message Stephen Frost 2017-03-23 00:08:57 Re: increasing the default WAL segment size