Quick Links

Re: WIP: [[Parallel] Shared] Hash

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: WIP: [[Parallel] Shared] Hash
Date:	2017-03-23 07:35:09
Message-ID:	CAEepm=0vp18JyWa5o_=ehK1QTZrymXk3Q0NPJoSqb=tdQTifGw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Here is a new patch series responding to feedback from Peter and Andres:

1. Support pgstat_report_tempfile and log_temp_files, which I had
overlooked as Peter pointed out.

2. Use a patch format that is acceptable to git am, per complaint
off-list from Andres. (Not actually made with git format-patch; I
need to learn some more git-fu, but they now apply cleanly with git
am).

On Thu, Mar 23, 2017 at 12:55 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I took a quick look at your V9 today. This is by no means a
> comprehensive review of 0008-hj-shared-buf-file-v9.patch, but it's
> what I can manage right now.

Thanks. I really appreciate your patience with the resource
management stuff I had failed to think through.

> ...
>
> However, I notice that the place that this happens instead,
> PathNameDelete(), does not repeat the fd.c step of doing a final
> stat(), and using the stats for a pgstat_report_tempfile(). So, a new
> pgstat_report_tempfile() call is simply missing. However, the more
> fundamental issue in my mind is: How can you fix that? Where would it
> go if you had it?

You're right. I may be missing something here (again), but it does
seem straightforward to implement because we always delete each file
that really exists exactly once (and sometimes we also try to delete
files that don't exist due to imprecise meta-data, but that isn't
harmful and we know when that turns out to be the case).

> If you do the obvious thing of just placing that before the new
> unlink() within PathNameDelete(), on the theory that that needs parity
> with the fd.c stuff, that has non-obvious implications. Does the
> pgstat_report_tempfile() call need to happen when going through this
> path, for example?:
>
>> +/*
>> + * Destroy a shared BufFile early. Files are normally cleaned up
>> + * automatically when all participants detach, but it might be useful to
>> + * reclaim disk space sooner than that. The caller asserts that no backends
>> + * will attempt to read from this file again and that only one backend will
>> + * destroy it.
>> + */
>> +void
>> +SharedBufFileDestroy(SharedBufFileSet *set, int partition, int participant)
>> +{

Yes, I think it should definitely go into
PathNameDeleteTemporaryFile() (formerly PathNameDelete()).

> The theory with the unlink()'ing() function PathNameDelete(), I
> gather, is that it doesn't matter if it happens to be called more than
> once, say from a worker and then in an error handling path in the
> leader or whatever. Do I have that right?

Yes, it may be called for a file that doesn't exist either because it
never existed, or because it has already been deleted. To recap,
there are two reasons it needs to tolerate attempts to delete files
that aren't there:

1. To be able to delete the fd.c files backing a BufFile given only a
BufFileTag. We don't know how many segment files there are, but we
know how to build the prefix of the filename so we try to delete
[prefix].0, [prefix].1, [prefix].2 ... until we get ENOENT and
terminate. I think this sort of thing would be more questionable for
durable storage backing a database object, but for temporary files I
can't think of a problem with it.

2. SharedBufFileSet doesn't actually know how many partitions exist,
it just knows the *range* of partition numbers (because of its
conflicting fixed space and increasable partitions requirements).
From that information it can loop building BufFileTags for all backing
files that *might* exist, and in practice they usually do because we
don't tend to have a 'sparse' range of partitions.

The error handling path isn't a special case: whoever is the last to
detach from the DSM segment will delete all the files, whether that
results from an error or not. Now someone might call
SharedBufFileDestroy() to delete files sooner, but that can't happen
at the same time as a detach cleanup (the caller is still attached).

As a small optimisation avoiding a bunch of pointless unlink syscalls,
I shrink the SharedBufFileSet range if you happen to delete explicitly
with a partition number at the extremities of the range, and it so
happens that Parallel Hash Join explicitly deletes them in partition
order as the join runs, so in practice the range is empty by the time
SharedBufFileSet's cleanup runs and there is nothing to do, unless an
error occurs.

> Obviously the concern I have about that is that any stat() call you
> might add for the benefit of a new pgstat_report_tempfile() call,
> needed to keep parity with fd.c, now has a risk of double counting in
> error paths, if I'm not mistaken. We do need to do that accounting in
> the event of error, just as we do when there is no error, at least if
> current stats collector behavior is to be preserved. How can you
> determine which duplicate call here is the duplicate? In other words,
> how can you figure out which one is not supposed to
> pgstat_report_tempfile()? If the size of temp files in each worker is
> unknowable to the implementation in error paths, does it not follow
> that it's unknowable to the user that queries pg_stat_database?

There is no double counting, if you only report after you successfully
unlink (ie if you don't get ENOENT).

In the attached patch I have refactored the reporting code into a
small function, and I added a stat call to
PathNameDeleteTemporaryFile() which differs from the FileClose()
coding only in that it tolerates ENOENT.

Now when I SET log_temp_files = 1 and then \i hj-test-queries.sql[1] I
see temporary file log messages resulting from both private and shared
temporary files being deleted:

2017-03-23 18:59:55.999 NZDT [30895] LOG: temporary file: path
"base/pgsql_tmp/pgsql_tmp30895.203", size 920400
2017-03-23 18:59:55.999 NZDT [30895] STATEMENT: EXPLAIN ANALYZE
SELECT COUNT(*) FROM simple r JOIN bigger_than_it_looks s USING (id);

2017-03-23 19:00:03.007 NZDT [30903] LOG: temporary file: path
"base/pgsql_tmp/pgsql_tmp30895.8.1.0.0", size 9749868
2017-03-23 19:00:03.007 NZDT [30903] STATEMENT: EXPLAIN ANALYZE
SELECT COUNT(*) FROM simple r JOIN awkwardly_skewed s USING (id);

Am I missing something?

> Now, I don't imagine that this should stump you. Maybe I'm wrong about
> that possibility (that you cannot have exactly once
> unlink()/stat()/whatever), or maybe I'm right and you can fix it while
> preserving existing behavior, for example by relying on unlink()
> reliably failing when called a second time, no matter how tight any
> race was. What exact semantics does unlink() have with concurrency, as
> far as the link itself goes?

On Unixoid systems at least, concurrent unlink() for the same file
must surely only succeed in one process and fail with ENOENT in any
others, but there is no chance for this to happen anyway:
SharedBufFileDestroy() is documented as only callable once for a given
set of parameters (even though nothing bad would happen if you broke
that rule AFAIK), and the code in the later patch that uses it adheres
to that rule, and the SharedBufFileSet cleanup can only run when the
last person detaches so there can't be a concurrent call to
SharedBufFileDestroy().

> If I'm not wrong about the general possibility, then maybe the
> existing behavior doesn't need to be preserved in error paths, which
> are after all exceptional -- it's not as if the statistics collector
> is currently highly reliable. It's not obvious that you are
> deliberately accepting of any of these risks or costs, though, which I
> think needs to be clearer, at a minimum. What trade-off are you making
> here?

There seems no reason not to make every effort to keep the stats
collector and logs posted on these files just as we do with regular
private temporary files, and it was pure oversight that I didn't.
Thanks!

> Unfortunately, that's about the only useful piece of feedback that I
> can think of right now -- be more explicit about what is permissible
> and not permissible in this area, and do something with
> pgstat_report_tempfile(). This is a bit like the
> unlink()-ENOENT/-to-terminate (ENOENT ignore) issue. There are no
> really hard questions here, but there certainly are some awkward
> questions.

Much appreciated.

[1] https://www.postgresql.org/message-id/CAEepm%3D2PRCtpo6UL4RxSbp%3DOXpyty0dg3oT3Vyk0eb%3Dr8JwZhg@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

Attachment	Content-Type	Size
parallel-shared-hash-v10.tgz	application/x-gzip	66.0 KB

In response to

Re: WIP: [[Parallel] Shared] Hash at 2017-03-22 23:55:28 from Peter Geoghegan

Responses

Re: WIP: [[Parallel] Shared] Hash at 2017-03-26 00:53:23 from Peter Geoghegan
Re: WIP: [[Parallel] Shared] Hash at 2017-03-26 20:41:28 from Andres Freund
Re: WIP: [[Parallel] Shared] Hash at 2017-03-26 21:47:26 from Andres Freund
Re: WIP: [[Parallel] Shared] Hash at 2017-03-28 05:33:03 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ashutosh Sharma	2017-03-23 07:48:22	Re: Speed up Clog Access by increasing CLOG buffers
Previous Message	Amit Kapila	2017-03-23 06:53:32	Re: pageinspect and hash indexes