Re: WIP: [[Parallel] Shared] Hash

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: [[Parallel] Shared] Hash
Date: 2017-03-22 10:17:19
Message-ID: CAEepm=2ug5CZ7S9c6yqaDH0er9oLNH=eXa7WzkFHrBV_FDmTag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi,

Here is a new version addressing feedback from Peter and Andres.
Please see below.

On Wed, Mar 22, 2017 at 3:18 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Tue, Mar 21, 2017 at 5:07 AM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>>> buffile.c should stop pretending to care about anything other than
>>> temp files, IMV. 100% of all clients that want temporary files go
>>> through buffile.c. 100% of all clients that want non-temp files (files
>>> which are not marked FD_TEMPORARY) access fd.c directly, rather than
>>> going through buffile.c.
>>
>> I still need BufFile because I want buffering.
>>
>> There are 3 separate characteristics enabled by flags with 'temporary'
>> in their name. I think we should consider separating the concerns by
>> splitting and renaming them:
>>
>> 1. Segmented BufFile behaviour. I propose renaming BufFile's isTemp
>> member to isSegmented, because that is what it really does. I want
>> that feature independently without getting confused about lifetimes.
>> Tested with small MAX_PHYSICAL_FILESIZE as you suggested.
>
> I would have proposed to get rid of the isTemp field entirely. It is
> always true with current usage, any only #ifdef NOT_USED code presumes
> that it could be any other way. BufFile is all about temp files, which
> ISTM should be formalized. The whole point of BufFile is to segment
> fd.c temp file segments. Who would ever want to use BufFile without
> that capability anyway?

Yeah, it looks like you're probably right, but I guess others could
have uses for BufFile that we don't know about. It doesn't seem like
it hurts to leave the variable in existence.

>> 2. The temp_file_limit system. Currently this applies to fd.c files
>> opened with FD_TEMPORARY. You're right that we shouldn't be able to
>> escape that sanity check on disk space just because we want to manage
>> disk file ownership differently. I propose that we create a new flag
>> FD_TEMP_FILE_LIMIT that can be set independentlyisTemp of the flags
>> controlling disk file lifetime. When working with SharedBufFileSet,
>> the limit applies to each backend in respect of files it created,
>> while it has them open. This seems a lot simpler than any
>> shared-temp-file-limit type scheme and is vaguely similar to the way
>> work_mem applies in each backend for parallel query.
>
> I agree that that makes sense as a user-visible behavior of
> temp_file_limit. This user-visible behavior is what I actually
> implemented for parallel CREATE INDEX.

Ok, good.

>> 3. Delete-on-close/delete-at-end-of-xact. I don't want to use that
>> facility so I propose disconnecting it from the above. We c{ould
>> rename those fd.c-internal flags FD_TEMPORARY and FD_XACT_TEMPORARY to
>> FD_DELETE_AT_CLOSE and FD_DELETE_AT_EOXACT.
>
> This reliably unlink()s all files, albeit while relying on unlink()
> ENOENT as a condition that terminates deletion of one particular
> worker's BufFile's segments. However, because you effectively no
> longer use resowner.c, ISTM that there is still a resource leak in
> error paths. ResourceOwnerReleaseInternal() won't call FileClose() for
> temp-ish files (that are not quite temp files in the current sense) in
> the absence of no other place managing to do so, such as
> BufFileClose(). How can you be sure that you'll actually close() the
> FD itself (not vFD) within fd.c in the event of an error? Or Delete(),
> which does some LRU maintenance for backend's local VfdCache?

Yeah, I definitely need to use resowner.c. The only thing I want to
opt out of is automatic file deletion in that code path.

> If I follow the new code correctly, then it doesn't matter that you've
> unlink()'d to take care of the more obvious resource management chore.
> You can still have a reference leak like this, if I'm not mistaken,
> because you still have backend local state (local VfdCache) that is
> left totally decoupled with the new "shadow resource manager" for
> shared BufFiles.

You're right. The attached version fixes these problems. The
BufFiles created or opened in this new way now participate in both of
our leak-detection and clean-up schemes: the one in resowner.c
(because I'm now explicitly registering with it as I had failed to do
before) and the one in CleanupTempFiles (because FD_CLOSE_AT_EOXACT is
set, which I already had in the previous version for the creator, but
not the opener of such a file). I tested by commenting out my
explicit BufFileClose calls to check that resowner.c starts
complaining, and then by commenting out the resowner registration too
to check that CleanupTempFiles starts complaining.

>> As shown in 0008-hj-shared-buf-file-v8.patch. Thoughts?
>
> A less serious issue I've also noticed is that you add palloc() calls,
> implicitly using the current memory context, within buffile.c.
> BufFileOpenTagged() has some, for example. However, there is a note
> that we don't need to save the memory context when we open a BufFile
> because we always repalloc(). That is no longer the case here.

I don't see a problem here. BufFileOpenTagged() is similar to
BufFileCreateTemp() which calls makeBufFile() and thereore returns a
result that is allocated in the current memory context. This seems
like the usual deal.

Thanks for the review!

On Wed, Mar 22, 2017 at 1:07 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Thu, Feb 16, 2017 at 3:36 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> I think the synchronization protocol with the various phases needs to be
>> documented somewhere. Probably in nodeHashjoin.c's header.
>
> I will supply that shortly.

Added in the attached version.

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
parallel-shared-hash-v9.tgz application/x-gzip 64.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2017-03-22 10:20:52 Re: Push down more UPDATEs/DELETEs in postgres_fdw
Previous Message Pavan Deolasee 2017-03-22 10:14:02 Re: Patch: Write Amplification Reduction Method (WARM)