Re: accounting for memory used for BufFile during hash joins

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: accounting for memory used for BufFile during hash joins
Date: 2019-05-08 15:08:44
Message-ID: 20190508150844.rij36rtuk4lhvztw@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 07, 2019 at 05:30:27PM -0700, Melanie Plageman wrote:
> On Mon, May 6, 2019 at 8:15 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
> wrote:
> Nope, that's not how it works. It's the array of batches that gets
> sliced, not the batches themselves.
> It does slightly increase the amount of data we need to shuffle between
> the temp files, because we can't write the data directly to batches in
> "future" slices. But that amplification is capped to ~2.2x (compared to
> the ~1.4x in master) - I've shared some measurements in [1].
> [1]
> Cool, I misunderstood. I looked at the code again today, and, at the email
> thread where you measured "amplification".

Oh! I hope you're not too disgusted by the code in that PoC patch ;-)

> In terms of how many times you write each tuple, is it accurate to
> say that a tuple can now be spilled three times (in the worst case)
> whereas, before, it could be spilled only twice?
> 1 - when building the inner side hashtable, tuple is spilled to a "slice"
> file
> 2 - (assuming the number of batches was increased) during execution, when
> a tuple belonging to a later slice's spill file is found, it is re-spilled
> to that slice's spill file
> 3 - during execution, when reading from its slice file, it is re-spilled
> (again) to its batch's spill file

Yes, that's mostly accurate understanding. Essentially this might add
one extra step of "reshuffling" from the per-slice to per-batch files.

> Is it correct that the max number of BufFile structs you will have
> is equal to the number of slices + number of batches in a slice
> because that is the max number of open BufFiles you would have at a
> time?

Yes. With the caveat that we need twice that number of BufFile structs,
because we need them on both sides of the join.

> By the way, applying v4 patch on master, in an assert build, I am tripping
> some
> asserts -- starting with
> Assert(!file->readOnly);
> in BufFileWrite

Whoooops :-/

> One thing I was a little confused by was the nbatch_inmemory member
> of the hashtable.  The comment in ExecChooseHashTableSize says that
> it is determining the number of batches we can fit in memory.  I
> thought that the problem was the amount of space taken up by the
> BufFile data structure itself--which is related to the number of
> open BufFiles you need at a time. This comment in
> ExecChooseHashTableSize makes it sound like you are talking about
> fitting more than one batch of tuples into memory at a time. I was
> under the impression that you could only fit one batch of tuples in
> memory at a time.

I suppose you mean this chunk:

* See how many batches we can fit into memory (driven mostly by size
* of BufFile, with PGAlignedBlock being the largest part of that).
* We need one BufFile for inner and outer side, so we count it twice
* for each batch, and we stop once we exceed (work_mem/2).
while ((nbatch_inmemory * 2) * sizeof(PGAlignedBlock) * 2
<= (work_mem * 1024L / 2))
nbatch_inmemory *= 2;

Yeah, that comment is a bit confusing. What the code actually does is
computing the largest "slice" of batches for which we can keep the
BufFile structs in memory, without exceeding work_mem/2.

Maybe the nbatch_inmemory should be renamed to nbatch_slice, not sure.

> So, I was stepping through the code with work_mem set to the lower
> bound, and in ExecHashIncreaseNumBatches, I got confused.
> hashtable->nbatch_inmemory was 2 for me, thus, nbatch_tmp was 2 so,
> I didn't meet this condition if (nbatch_tmp >
> hashtable->nbatch_inmemory) since I just set nbatch_tmp using
> hashtable->nbatch_inmemory So, I didn't increase the number of
> slices, which is what I was expecting. What happens when
> hashtable->nbatch_inmemory is equal to nbatch_tmp?

Ah, good catch. The condition you're refering to

if (nbatch_tmp > hashtable->nbatch_inmemory)

should actually be

if (nbatch > hashtable->nbatch_inmemory)

because the point is to initialize BufFile structs for the overflow
files, and we need to do that once we cross nbatch_inmemory.

And it turns out this actually causes the assert failures in regression
tests, you reported earlier. It failed to initialize the overflow files
in some cases, so the readOnly flag seemed to be set.

Attached is an updated patch, fixing this. I tried to clarify some of
the comments too, and I fixed another bug I found while running the
regression tests. It's still very much a crappy PoC code, though.


Tomas Vondra
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
v4-per-slice-overflow-file-20190508.patch text/plain 27.8 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-05-08 15:19:18 Re: [HACKERS] Detrimental performance impact of ringbuffers on performance
Previous Message Laurenz Albe 2019-05-08 14:49:23 Re: Identity columns should own only one sequence