Quick Links

Re: Parallel tuplesort (for parallel B-Tree index creation)

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject:	Re: Parallel tuplesort (for parallel B-Tree index creation)
Date:	2017-11-01 00:07:39
Message-ID:	CAEepm=3i==PySOcT0js4EofAE=eZ6DR2tB_-q77fQbmXjgLD-Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Nov 1, 2017 at 11:29 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia
> <rushabh(dot)lathia(at)gmail(dot)com> wrote:
>> Attaching the re based patch according to the v22 parallel-hash patch sets
>
> I took a quick look at this today, and noticed a few issues:
>
> * make_name() is used to name files in sharedtuplestore.c, which is
> what is passed to BufFileOpenShared() for parallel hash join. Your
> using your own logic for that within the equivalent logtape.c call to
> BufFileOpenShared(), presumably because make_name() wants to identify
> participants by PID rather than by an ordinal identifier number.

So that's this bit:

+ pg_itoa(worker, filename);
+ lts->pfile = BufFileCreateShared(fileset, filename);

... and:

+ pg_itoa(i, filename);
+ file = BufFileOpenShared(fileset, filename);

What's wrong with using a worker number like this?

> I think that we need some kind of central registry for things that use
> shared buffiles. It could be that sharedtuplestore.c is further
> generalized to support this, or it could be that they both call
> something else that takes care of naming. It's not okay to have this
> left to random chance.

It's not random choice: buffile.c creates a uniquely named directory
(or directories, if you have more than one location configured in the
temp_tablespaces GUC) to hold all the backing files involved in each
BufFileSet. Naming of BufFiles within the BufFileSet is the caller's
problem, and a worker number seems like a reasonable choice to me. It
won't collide with a concurrent parallel CREATE INDEX because that'll
be using its own BufFileSet.

> You're going to have to ask Thomas about this. You should also use
> MAXPGPATH for the char buffer on the stack.

Here's a summary of namespace management scheme I currently have at
the three layers fd.c, buffile.c, sharedtuplestore.c:

1. fd.c has new lower-level functions provides
PathNameCreateTemporaryFile(const char *path) and
PathNameOpenTemporaryFile(const char *path). It also provides
PathNameCreateTemporaryDir(). Clearly callers of these interfaces
will need to be very careful about managing the names they use.
Callers also own the problem of cleaning up files, since there is no
automatic cleanup of files created this way. My intention was that
these facilities would *only* be used by BufFileSet, since it has
machinery to manage those things.

2. buffile.c introduces BufFileSet, which is conceptually a set of
BufFiles that can be shared by multiple backends with DSM
segment-scoped cleanup. It is implemented as a set of directories:
one for each tablespace in temp_tablespaces. It controls the naming
of those directories. The BufFileSet directories are named similarly
to fd.c's traditional temporary file names using the usual recipe
"pgsql_tmp" + PID + per-process counter but have an additional ".set"
suffix. RemovePgTempFilesInDir() recognises directories with that
prefix and suffix as junk left over from a crash when cleaning up. I
suppose it's that knowledge about reserved name patterns and cleanup
that you are thinking of as a central registry? As for the BufFiles
that are in a BufFileSet, buffile.c has no opinion on that: the
calling code (parallel CREATE INDEX, sharedtuplestore.c, ...) is
responsible for coming up with its own scheme. If parallel CREATE
INDEX wants to name shared BufFiles "walrus" and "banana", that's OK
by me, and those files won't collide with anything in another
BufFileSet because each BufFileSet has its own directory (-ies).

One complaint about the current coding that someone might object to:
MakeSharedSegmentPath() just dumps the caller's BufFile name into a
path without sanitisation: I should fix that so that we only accept
fairly limited strings here. Another complaint is that perhaps fd.c
knows too much about buffile.c's business. For example,
RemovePgTempFilesInDir() knows about the ".set" directories created by
buffile.c, which might be called a layering violation. Perhaps the
set/directory logic should move entirely into fd.c, so you'd call
FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then
BufFileOpenShared() would take a FileSet *, not a BufFileSet *.
Thoughts?

3. sharedtuplestore.c takes a caller-supplied BufFileSet and creates
its shared BufFiles in there. Earlier versions created and owned a
BufFileSet, but in the current Parallel Hash patch I create loads of
separate SharedTuplestore objects but I didn't want to create load of
directories to back them, so you can give them all the same
BufFileSet. That works because SharedTuplestores are also given a
name, and it's the caller's job (in my case nodeHash.c) to make sure
the SharedTuplestores are given unique names within the same
BufFileSet. For Parallel Hash you'll see names like 'i3of8' (inner
batch 3 of 8). There is no need for there to be in any sort of
central registry for that though, because it rides on top of the
guarantees from 2 above: buffile.c will put those files into a
uniquely named directory, and that works as long as no one else is
allowed to create files or directories in the temp directory that
collide with its reserved pattern /^pgsql_tmp.+\.set$/. For the same
reason, parallel CREATE INDEX is free to use worker numbers as BufFile
names, since it has its own BufFileSet to work within.

> * What's this all about?:
>
> + /* Accessor for the SharedBufFileSet that is at the end of Sharedsort. */
> + #define GetSharedBufFileSet(shared) \
> + ((BufFileSet *) (&(shared)->tapes[(shared)->nTapes]))

In an earlier version, BufFileSet was one of those annoying data
structures with a FLEXIBLE_ARRAY_MEMBER that you'd use as an
incomplete type (declared but not defined in the includable header),
and here it was being used "inside" (or rather after) SharedSort,
which *itself* had a FLEXIBLE_ARRAY_MEMBER. The reason for the
variable sized object was that I needed all backends to agree on the
set of temporary tablespace OIDs, of which there could be any number,
but I also needed a 'flat' (pointer-free) object I could stick in
relocatable shared memory. In the newest version I changed that
flexible array to tablespaces[8], because 8 should be enough
tablespaces for anyone (TM). I don't really believe anyone uses
temp_tablespaces for IO load balancing anymore and I hate code like
the above. So I think Rushabh should now remove the above-quoted code
and just use a BufFileSet directly as a member of SharedSort.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Re: Parallel tuplesort (for parallel B-Tree index creation) at 2017-10-31 22:29:40 from Peter Geoghegan

Responses

Re: Parallel tuplesort (for parallel B-Tree index creation) at 2017-11-01 01:11:29 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tsunakawa, Takayuki	2017-11-01 00:07:44	Re: [bug fix] postgres.exe crashes with access violation on Windows while starting up
Previous Message	Tom Lane	2017-10-31 23:59:33	Re: Account for cost and selectivity of HAVING quals