Re: WIP: [[Parallel] Shared] Hash

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: WIP: [[Parallel] Shared] Hash
Date: 2017-03-26 22:03:24
Message-ID: CAEepm=21SKC-MBGqWUC4FXvh37=o2faXWAbF8A=LGW0Wfbubhw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Mon, Mar 27, 2017 at 9:41 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hi,
>
>
> SharedBufFile allows temporary files to be created by one backend and
> then exported for read-only access by other backends, with clean-up
> managed by reference counting associated with a DSM segment. This includes
> changes to fd.c and buffile.c to support new kinds of temporary file.
>
>
> diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
> index 4ca0ea4..a509c05 100644
> --- a/src/backend/storage/file/buffile.c
> +++ b/src/backend/storage/file/buffile.c
>
> I think the new facilities should be explained in the file's header.

Will do.

> @@ -68,9 +71,10 @@ struct BufFile
> * avoid making redundant FileSeek calls.
> */
>
> - bool isTemp; /* can only add files if this is TRUE */
> + bool isSegmented; /* can only add files if this is TRUE */
>
> That's a bit of a weird and uncommented upon change.

I was trying to cut down on the number of places we use the word
'temporary' to activate various different behaviours. In this case,
the only thing it controls is whether the BufFile is backed by one
single fd.c File or many segments, so I figured it should be renamed.

As Peter and you have pointed out, there may be a case for removing it
altogether.

> @@ -79,6 +83,8 @@ struct BufFile
> */
> ResourceOwner resowner;
>
> + BufFileTag tag; /* for discoverability between backends */
>
> Not perfectly happy with the name tag here, the name is a bit too
> similar to BufferTag - something quite different.

Yeah, will rename.

> +static void
> +make_tagged_path(char *tempdirpath, char *tempfilepath,
> + const BufFileTag *tag, int segment)
> +{
> + if (tag->tablespace == DEFAULTTABLESPACE_OID ||
> + tag->tablespace == GLOBALTABLESPACE_OID)
> + snprintf(tempdirpath, MAXPGPATH, "base/%s", PG_TEMP_FILES_DIR);
> + else
> + {
> + snprintf(tempdirpath, MAXPGPATH, "pg_tblspc/%u/%s/%s",
> + tag->tablespace, TABLESPACE_VERSION_DIRECTORY,
> + PG_TEMP_FILES_DIR);
> + }
> +
> + snprintf(tempfilepath, MAXPGPATH, "%s/%s%d.%d.%d.%d.%d", tempdirpath,
> + PG_TEMP_FILE_PREFIX,
> + tag->creator_pid, tag->set, tag->partition, tag->participant,
> + segment);
>
> Is there a risk that this ends up running afoul of filename length
> limits on some platforms?

Hmm. I didn't think so. Do we have a project guideline on maximum
path lengths based on some kind of survey? There are various limits
involved (filesystem and OS per-path-component limits, total limits,
and the confusing PATH_MAX, MAX_PATH etc macros), but I was under the
impression that these numbers were always at least 255. This scheme
seems capable of producing ~50 bytes in the final component
(admittedly more if int is 64 bits), and then nowhere near enough to
reach a limit of that order in the earlier components.

> +}
> +
> +static File
> +make_tagged_segment(const BufFileTag *tag, int segment)
> +{
> + File file;
> + char tempdirpath[MAXPGPATH];
> + char tempfilepath[MAXPGPATH];
> +
> + /*
> + * There is a remote chance that disk files with this (pid, set) pair
> + * already exists after a crash-restart. Since the presence of
> + * consecutively numbered segment files is used by BufFileOpenShared to
> + * determine the total size of a shared BufFile, we'll defend against
> + * confusion by unlinking segment 1 (if it exists) before creating segment
> + * 0.
> + */
>
> Gah. Why on earth aren't we removing temp files when restarting, not
> just on the initial start? That seems completely wrong?

See the comment above RemovePgTempFiles in fd.c. From comments on
this list I understand that this is a subject that Robert and Tom
don't agree on. I don't mind either way, but as long as
RemovePgTempFiles works that way and my patch uses the existence of
files to know how many files there are, I have to defend against that
danger by making sure that I don't accidentally identify files from
before a crash/restart as active.

> If we do decide not to change this: Why is that sufficient? Doesn't the
> same problem exist for segments later than the first?

It does exist and it is handled. The comment really should say
"unlinking segment N + 1 (if it exists) before creating segment N".
Will update.

> +/*
> + * Open a file that was previously created in another backend with
> + * BufFileCreateShared.
> + */
> +BufFile *
> +BufFileOpenTagged(const BufFileTag *tag)
> +{
> + BufFile *file = (BufFile *) palloc(sizeof(BufFile));
> + char tempdirpath[MAXPGPATH];
> + char tempfilepath[MAXPGPATH];
> + Size capacity = 1024;
> + File *files = palloc(sizeof(File) * capacity);
> + int nfiles = 0;
> +
> + /*
> + * We don't know how many segments there are, so we'll probe the
> + * filesystem to find out.
> + */
> + for (;;)
> + {
> + /* See if we need to expand our file space. */
> + if (nfiles + 1 > capacity)
> + {
> + capacity *= 2;
> + files = repalloc(files, sizeof(File) * capacity);
> + }
> + /* Try to load a segment. */
> + make_tagged_path(tempdirpath, tempfilepath, tag, nfiles);
> + files[nfiles] = PathNameOpenTemporaryFile(tempfilepath);
> + if (files[nfiles] <= 0)
> + break;
>
> Isn't 0 a theoretically valid return value from
> PathNameOpenTemporaryFile?

I was confused by that too, because it isn't the way normal OS fds
work. But existing code dealing with Postgres vfd return values
treats 0 as an error. See for example OpenTemporaryFile and
OpenTemporaryFileInTablespace.

> +/*
> + * Delete a BufFile that was created by BufFileCreateTagged. Return true if
> + * at least one segment was deleted; false indicates that no segment was
> + * found, or an error occurred while trying to delete. Errors are logged but
> + * the function returns normally because this is assumed to run in a clean-up
> + * path that might already involve an error.
> + */
> +bool
> +BufFileDeleteTagged(const BufFileTag *tag)
> +{
> + char tempdirpath[MAXPGPATH];
> + char tempfilepath[MAXPGPATH];
> + int segment = 0;
> + bool found = false;
> +
> + /*
> + * We don't know if the BufFile really exists, because SharedBufFile
> + * tracks only the range of file numbers. If it does exists, we don't
> + * know many 1GB segments it has, so we'll delete until we hit ENOENT or
> + * an IO error.
> + */
> + for (;;)
> + {
> + make_tagged_path(tempdirpath, tempfilepath, tag, segment);
> + if (!PathNameDeleteTemporaryFile(tempfilepath, false))
> + break;
> + found = true;
> + ++segment;
> + }
> +
> + return found;
> +}
>
> If we crash in the middle of this, we'll leave the later files abanded,
> no?

Yes. In general, there are places we can crash or unplug the server
etc and leave files behind. In that case, RemovePgTempFiles cleans up
(or declines to do so deliberately to support debugging, as
discussed).

> +/*
> + * BufFileSetReadOnly --- flush and make read-only, in preparation for sharing
> + */
> +void
> +BufFileSetReadOnly(BufFile *file)
> +{
> + BufFileFlush(file);
> + file->readOnly = true;
> +}
>
> That flag is unused, right?

It's used for an assertion in BufFileWrite. Maybe could be
elog(ERROR, ...) instead, but either way it's a debugging aid to
report misuse.

> + * PathNameCreateTemporaryFile, PathNameOpenTemporaryFile and
> + * PathNameDeleteTemporaryFile are used for temporary files that may be shared
> + * between backends. A File created or opened with these functions is not
> + * automatically deleted when the file is closed, but it is automatically
> + * closed and end of transaction and counts agains the temporary file limit of
> + * the backend that created it. Any File created this way must be explicitly
> + * deleted with PathNameDeleteTemporaryFile. Automatic file deletion is not
> + * provided because this interface is designed for use by buffile.c and
> + * indirectly by sharedbuffile.c to implement temporary files with shared
> + * ownership and cleanup.
>
> Hm. Those name are pretty easy to misunderstand, no? s/Temp/Shared/?

Hmm. Yeah these may be better. Will think about that.

> /*
> + * Called whenever a temporary file is deleted to report its size.
> + */
> +static void
> +ReportTemporaryFileUsage(const char *path, off_t size)
> +{
> + pgstat_report_tempfile(size);
> +
> + if (log_temp_files >= 0)
> + {
> + if ((size / 1024) >= log_temp_files)
> + ereport(LOG,
> + (errmsg("temporary file: path \"%s\", size %lu",
> + path, (unsigned long) size)));
> + }
> +}
>
> Man, the code for this sucks (not your fault). Shouldn't this properly
> be at the buffile.c level, where we could implement limits above 1GB
> properly?

+1

> +/*
> + * Open a file that was created with PathNameCreateTemporaryFile in another
> + * backend. Files opened this way don't count agains the temp_file_limit of
> + * the caller, are read-only and are automatically closed at the end of the
> + * transaction but are not deleted on close.
> + */
>
> This really reinforces my issues with the naming scheme. This ain't a
> normal tempfile.

It sort of makes sense if you consider that a 'named' temporary file
is different... but yeah, point taken.

> +File
> +PathNameOpenTemporaryFile(char *tempfilepath)
> +{
> + File file;
> +
> + /*
> + * Open the file. Note: we don't use O_EXCL, in case there is an orphaned
> + * temp file that can be reused.
> + */
> + file = PathNameOpenFile(tempfilepath, O_RDONLY | PG_BINARY, 0);
>
> If so, wouldn't we need to truncate the file?

Yes, this lacks O_TRUNC. Thanks.

> + * A single SharedBufFileSet can manage any number of 'tagged' BufFiles that
> + * are shared between a fixed number of participating backends. Each shared
> + * BufFile can be written to by a single participant but can be read by any
> + * backend after it has been 'exported'. Once a given BufFile is exported, it
> + * becomes read-only and cannot be extended. To create a new shared BufFile,
> + * a participant needs its own distinct participant number, and needs to
> + * specify an arbitrary partition number for the file. To make it available
> + * to other backends, it must be explicitly exported, which flushes internal
> + * buffers and renders it read-only. To open a file that has been shared, a
> + * backend needs to know the number of the participant that created the file,
> + * and the partition number. It is the responsibily of calling code to ensure
> + * that files are not accessed before they have been shared.
>
> Hm. One way to make this safer would be to rename files when exporting.
> Should be sufficient to do this to the first segment, I guess.

Interesting idea. Will think about that. That comment isn't great
and repeats itself. Will improve.

> + * Each file is identified by a partition number and a participant number, so
> + * that a SharedBufFileSet can be viewed as a 2D table of individual files.
>
> I think using "files" as a term here is a bit dangerous - they're
> individually segmented again, right?

True. It's a 2D matrix of BufFiles. The word "file" is super
overloaded here. Will fix.

> +/*
> + * The number of bytes of shared memory required to construct a
> + * SharedBufFileSet.
> + */
> +Size
> +SharedBufFileSetSize(int participants)
> +{
> + return offsetof(SharedBufFileSet, participants) +
> + sizeof(SharedBufFileParticipant) * participants;
> +}
>
> The function name sounds a bit like a function actuallize setting some
> size... s/Size/DetermineSize/?

Hmm yeah "set" as verb vs "set" as noun. I think "estimate" is the
established word for this sort of thing (even though that seems
strange because it sounds like it doesn't have to be exactly right:
clearly in all these shmem-space-reservation functions it has to be
exactly right). Will change.

>
> +/*
> + * Create a new file suitable for sharing. Each backend that calls this must
> + * use a distinct participant number. Behavior is undefined if a participant
> + * calls this more than once for the same partition number. Partitions should
> + * ideally be numbered consecutively or in as small a range as possible,
> + * because file cleanup will scan the range of known partitions looking for
> + * files.
> + */
>
> Wonder if we shouldn't just create a directory for all such files.

Hmm. Yes, that could work well. Will try that.

> I'm a bit unhappy with the partition terminology around this. It's
> getting a bit confusing. We have partitions, participants and
> segements. Most of them could be understood for something entirely
> different than the meaning you have here...

Ok. Let me try to explain and defend them and see if we can come up
with something better.

1. Segments are what buffile.c already calls the individual
capped-at-1GB files that it manages. They are an implementation
detail that is not part of buffile.c's user interface. There seems to
be no reason to change that.

My understanding is that this was done to support pre-large-file
filesystems/OSs which limited files to 2^31 or 2^32 bytes, and we
decided to cap the segments at 1GB (maybe some ancient OS had a 2^30
limit, or maybe it was just a conservative choice that's easy for
humans to think about). We could perhaps get rid of that entirely
today without anyone complaining and just use one big file, though
don't know that and I'm not suggesting it. (One argument against that
is that the parallel CREATE INDEX patch actually makes use of the
segmented nature of BufFiles to splice them together, to 'unify' a
bunch of worker LogicalTapeSets to create one LogicalTapeSet. That's
off topic here but it's in the back of my mind as a potential client
of this code. I'll have more to say about that over on the parallel
CREATE INDEX thread shortly.)

2. Partitions here = 'batches'. The 'batches' used by the hash join
code are formally partitions in all the literature on hash joins, and
I bet that anyone else doing parallel work that involves sharing
temporary disk files will run into the need for partitioning. I think
you are complaining that we now have a database object called a
PARTITION, and that may be a problem because we're overloading the
term. But it's the same name because it's mathematically the same
thing. We don't complain about the existence of 'lock tables' or
'hash tables' just because there is a database object called a TABLE.
I considered other names for this, like "file number", but it was
confusing and vague. I keep coming back to "partition" for this,
because fundamentally this is for partitioning temporary data. I
could maybe call it "file_partition" or something?

3. Participants are what I have taken to calling the processes
involved in parallel query, when I mean the larger set that includes
workers + leader. It may seem a little odd that such a thing appears
in an API that deals with temporary files. But the basic idea here is
that each participant gets to write out its own partial results, for
each partition. Stepping back a bit, that means that there are two
kinds of partitioning going on at the same time. Partitioning the
keyspace into batch numbers, and then the arbitrary partitioning that
comes from each participant processing partial plans. This is how
SharedBufFileSet finishes up managing a 2D matrix of BufFiles.

You might argue that buffile.c shouldn't know about partitions and
participants. The only thing I really need here is for BufFileTag (to
be renamed) to be able to name things differently. Perhaps it should
just include a char[] buffer for a name fragment, and the
SharedBufFileSet should encode the partition and participant numbers
into it, rather than exposing these rather higher level concepts to
buffile.c. I will think about that.

(Perhaps SharedBufFileSet should be called PartitionedBufFileSet?)

> +static void
> +shared_buf_file_on_dsm_detach(dsm_segment *segment, Datum datum)
> +{
> + bool unlink_files = false;
> + SharedBufFileSet *set = (SharedBufFileSet *) DatumGetPointer(datum);
> +
> + SpinLockAcquire(&set->mutex);
> + Assert(set->refcount > 0);
> + if (--set->refcount == 0)
> + unlink_files = true;
> + SpinLockRelease(&set->mutex);
>
> I'm a bit uncomfortable with releasing a refcount, and then still using
> the memory from the set... I don't think there's a concrete danger
> here as the code stands, but it's a fairly dangerous pattern.

Will fix.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2017-03-26 22:13:09 Re: crashes due to setting max_parallel_workers=0
Previous Message Dmitry Dolgov 2017-03-26 21:57:10 Re: [PATCH] few fts functions for jsonb