Re: accounting for memory used for BufFile during hash joins

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: accounting for memory used for BufFile during hash joins
Date: 2019-05-22 11:19:23
Message-ID: 20190522111923.qdkxdj7ssk7rsvxn@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 21, 2019 at 05:38:50PM -0700, Melanie Plageman wrote:
>On Wed, May 8, 2019 at 8:08 AM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
>wrote:
>
>> On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:
>> > One thing I was a little confused by was the nbatch_inmemory member
>> > of the hashtable. The comment in ExecChooseHashTableSize says that
>> > it is determining the number of batches we can fit in memory. I
>> > thought that the problem was the amount of space taken up by the
>> > BufFile data structure itself--which is related to the number of
>> > open BufFiles you need at a time. This comment in
>> > ExecChooseHashTableSize makes it sound like you are talking about
>> > fitting more than one batch of tuples into memory at a time. I was
>> > under the impression that you could only fit one batch of tuples in
>> > memory at a time.
>>
>> I suppose you mean this chunk:
>>
>> /*
>> * See how many batches we can fit into memory (driven mostly by size
>> * of BufFile, with PGAlignedBlock being the largest part of that).
>> * We need one BufFile for inner and outer side, so we count it twice
>> * for each batch, and we stop once we exceed (work_mem/2).
>> */
>> while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
>> <= (work_mem * 1024L / 2))
>> nbatch_inmemory *= 2;
>>
>> Yeah, that comment is a bit confusing. What the code actually does is
>> computing the largest "slice" of batches for which we can keep the
>> BufFile structs in memory, without exceeding work_mem/2.
>>
>> Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.
>>
>
>I definitely would prefer to see hashtable->nbatch_inmemory renamed to
>hashtable->nbatch_slice--or maybe hashtable->nbuff_inmemory?
>
>I've been poking around the code for awhile today, and, even though I
>know that the nbatch_inmemory is referring to the buffiles that can
>fit in memory, I keep forgetting and thinking it is referring to the
>tuple data that can fit in memory.
>

That's a fair point. I think nbatch_slice is a good name.

>It might be worth explicitly calling out somewhere in the comments
>that overflow slices will only be created either when the number of
>batches was underestimated as part of ExecHashIncreaseNumBatches and
>the new number of batches exceeds the value for
>hashtable->nbatch_inmemory or when creating the hashtable initially
>and the number of batches exceeds the value for
>hashtable->nbatch_inmemory (the name confuses this for me at hashtable
>creation time especially) -- the number of actual buffiles that can be
>managed in memory.
>

Yes, this definitely needs to be explained somewhere - possibly in a
comment at the beginning of nodeHash.c or something like that.

FWIW I wonder if this "slicing" would be useful even with correct
estimates. E.g. let's say we can fit 128 batches into work_mem, but we
expect to need 256 (and it's accurate). At that point it's probably too
aggressive to disable hash joins - a merge join is likely more expensive
than just using the slicing. But that should be a cost-based decision.

>
>>
>> Attached is an updated patch, fixing this. I tried to clarify some of
>> the comments too, and I fixed another bug I found while running the
>> regression tests. It's still very much a crappy PoC code, though.
>>
>>
>So, I ran the following example on master and with your patch.
>
>drop table foo;
>drop table bar;
>create table foo(a int, b int);
>create table bar(c int, d int);
>insert into foo select i, i from generate_series(1,10000)i;
>insert into bar select 1, 1 from generate_series(1,1000)i;
>insert into bar select i%3, i%3 from generate_series(1000,10000)i;
>insert into foo select 1,1 from generate_series(1,1000)i;
>analyze foo; analyze bar;
>set work_mem=64;
>
>On master, explain analyze looked like this
>
>postgres=# explain analyze verbose select * from foo, bar where a = c;
> QUERY PLAN
>
>--------------------------------------------------------------------------------------------------------------------------
> Hash Join (cost=339.50..53256.27 rows=4011001 width=16) (actual
>time=28.962..1048.442 rows=4008001 loops=1)
> Output: foo.a, foo.b, bar.c, bar.d
> Hash Cond: (bar.c = foo.a)
> -> Seq Scan on public.bar (cost=0.00..145.01 rows=10001 width=8)
>(actual time=0.030..1.777 rows=10001 loops=1)
> Output: bar.c, bar.d
> -> Hash (cost=159.00..159.00 rows=11000 width=8) (actual
>time=12.285..12.285 rows=11000 loops=1)
> Output: foo.a, foo.b
> Buckets: 2048 (originally 2048) Batches: 64 (originally 16)
> Memory Usage: 49kB
> -> Seq Scan on public.foo (cost=0.00..159.00 rows=11000 width=8)
>(actual time=0.023..3.786 rows=11000 loops=1)
> Output: foo.a, foo.b
> Planning Time: 0.435 ms
> Execution Time: 1206.904 ms
>(12 rows)
>
>and with your patch, it looked like this.
>
>postgres=# explain analyze verbose select * from foo, bar where a = c;
> QUERY PLAN
>
>--------------------------------------------------------------------------------------------------------------------------
> Hash Join (cost=339.50..53256.27 rows=4011001 width=16) (actual
>time=28.256..1102.026 rows=4008001 loops=1)
> Output: foo.a, foo.b, bar.c, bar.d
> Hash Cond: (bar.c = foo.a)
> -> Seq Scan on public.bar (cost=0.00..145.01 rows=10001 width=8)
>(actual time=0.040..1.717 rows=10001 loops=1)
> Output: bar.c, bar.d
> -> Hash (cost=159.00..159.00 rows=11000 width=8) (actual
>time=12.327..12.327 rows=11000 loops=1)
> Output: foo.a, foo.b
> Buckets: 2048 (originally 2048) Batches: 16384 (originally 16,
>in-memory 2) Memory Usage: 131160kB
> -> Seq Scan on public.foo (cost=0.00..159.00 rows=11000 width=8)
>(actual time=0.029..3.569 rows=11000 loops=1)
> Output: foo.a, foo.b
> Planning Time: 0.260 ms
> Execution Time: 1264.995 ms
>(12 rows)
>
>I noticed that the number of batches is much higher with the patch,
>and, I was checking $PGDATA/base/pgsql_tmp and saw that the number of
>temp files which are the overflow files any given time was quite high.
>
>I would imagine that the desired behaviour is to keep memory usage
>within work_mem.

There's definitely something fishy going on. I suspect it's either because
of the duplicate values (which might fit into 64kB on master, but not when
accounting for BufFile). Or maybe it's because the initial 16 batches
can't possibly fit into work_mem.

If you try with a larger work_mem, say 256kB, does that behave OK?

>In this example, the number of slices is about 8000, each of which
>would have an overflow file. Is this the case you mention in the
>comment in ExecChooseHashTableSize ?
>
>* We ignore (per-slice)
>* overflow files, because those serve as "damage control" for cases
>* when per-batch BufFiles would exceed work_mem. Given enough batches
>* it's impossible to enforce work_mem strictly, because the overflow
>* files alone will consume more memory.
>

Yes. 8000 slices is ~64MB, so considering we need them on both sides of
the join that'd be ~128MB. Which is pretty much exactly 131160kB.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Antonin Houska 2019-05-22 11:53:23 Re: Remove page-read callback from XLogReaderState.
Previous Message Amit Kapila 2019-05-22 11:17:06 Re: POC: Cleaning up orphaned files using undo logs