Re: accounting for memory used for BufFile during hash joins

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: accounting for memory used for BufFile during hash joins
Date: 2019-07-11 17:16:11
Message-ID: CA+TgmoYqpbzC1g+y0bxDFkpM60Kr2fnn0hVvT-RfVWonRY2dMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 6, 2019 at 9:49 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> Stepping back a bit, I think there is something fishy about the way we
> detect extreme skew. Is that a factor in this case? Right now we
> wait until we have a batch that gets split into child batches
> containing exactly 0% and 100% of the tuples before we give up.
> Previously I had thought of that as merely a waste of time, but
> clearly it's also a waste of unmetered memory. Oops.
>
> I think our extreme skew detector should go off sooner, because
> otherwise if you have N nicely distributed unique keys and also M
> duplicates of one bad egg key that'll never fit in memory, we keep
> repartitioning until none of the N keys fall into the batch containing
> the key for the M duplicates before we give up! You can use
> balls-into-bins maths to figure out the number, but I think that means
> we expect to keep splitting until we have N * some_constant batches,
> and that's just silly and liable to create massive numbers of
> partitions proportional to N, even though we're trying to solve a
> problem with M. In another thread I suggested we should stop when
> (say) 95% of the tuples go to one child batch. I'm not sure how you
> pick the number.

Another thing that is fishy about this is that we can't split a batch
or a bucket without splitting them all. Let's say that nbatches *
nbuckets = 16 million. One bucket in one batch contains 90% of the
tuples. Splitting *that* bucket might be a good idea if only 5% of the
tuples end up moving, perhaps even if only 1% end up moving. But, if
you have to double the total number of batches to get that benefit,
it's a lot less compelling, because now you have to rescan the outer
side more times.

I wonder whether we should be dividing things into batches unevenly,
based on the distribution of tuples we've seen so far. For example,
suppose we've gotten to 1024 buckets and that's all we can fit in
memory. If we decide to go to 2 batches, we'll use the next bit of the
hash key to decide which things go into batch 0 and which things go
into batch 1. But if we know that 50% of the data is in bucket 17, why
are we not making bucket 17 into a batch and everything else into
another batch? Then, when we process the batch that was derived from
bucket-17, we can use 10 completely new bits from the hash key to
slice the data from that bucket as finely as possible.

Now the bucket might be entirely duplicates, in which case no number
of additional bits will help. However, even in that case it's still a
good idea to make it its own batch, and then use some other algorithm
to process that batch. And if it's *not* entirely duplicates, but
there are say 2 or 3 really common values that unluckily hash to the
same bucket, then being able to use a lot more bits for that portion
of the data gives us the best chance of managing to spread it out into
different buckets.

Similarly, if we split the hash join into four batches, and batch 0
fits in memory but batch 1 does not, we cannot further split batch 1
without splitting batch 2 and batch 3 also. That's not good either,
because those batches might be small and not need splitting.

I guess what I'm trying to say is that our algorithms for dealing with
mis-estimation seem to be largely oblivious to the problem of skew,
and I don't think the problem is confined to extreme skew. Suppose you
have some data that is only moderately skewed, so that when you build
a hash table with 1024 buckets, 25% of the data is in buckets 0-19,
25% in buckets 20-768, 25% in buckets 769-946, and the last 25% in
buckets 947-1023. If you knew that, then when you discover that the
data is 4x too large to fit in memory, you can divide the data into 4
batches using those bucket number ranges, and get it done in exactly 4
batches. As it is, you'll need to split until every uniform range of
buckets fits in memory: 0-31 is going to be too big a range, so you're
going to go with 0-15, which means you'll have 64 batches instead of
4.

It seems to me that a good chunk of what's being proposed right now
basically ignores the fact that we're not really responding to the
skew in a very effective way. Thomas wants to stop splitting all the
buckets when splitting one of the buckets produces only a very small
benefit rather than when it produces no benefit at all, but he's not
asking why we're splitting all of the buckets in the first place.
Tomas wants to slice the array of batches because there are so many of
them, but why are there so many? As he said himself, "it gets to that
many batches because some of the values are very common and we don't
disable the growth earlier." Realistically, I don't see how there can
be so many batches that we can't even fit the metadata about the
matches into memory unless we're unnecessarily creating a lot of
little tiny batches that we don't really need.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2019-07-11 17:20:17 Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Previous Message Jesper Pedersen 2019-07-11 16:53:54 Re: Index Skip Scan