Quick Links

Re: Parallel tuplesort (for parallel B-Tree index creation)

From:	Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject:	Re: Parallel tuplesort (for parallel B-Tree index creation)
Date:	2017-09-20 09:32:46
Message-ID:	CAGPqQf1rb+dvUtm-8JPzSTZCYSPiZbmW8NCRGxNnqkEviqbRTw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Sep 20, 2017 at 5:17 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

> On Tue, Sep 19, 2017 at 3:21 AM, Rushabh Lathia
> <rushabh(dot)lathia(at)gmail(dot)com> wrote:
> > As per the earlier discussion in the thread, I did experiment using
> > BufFileSet interface from parallel-hash-v18.patchset. I took the
> reference
> > of parallel-hash other patches to understand the BufFileSet APIs, and
> > incorporate the changes to parallel create index.
> >
> > In order to achieve the same:
> >
> > - Applied 0007-Remove-BufFile-s-isTemp-flag.patch and
> > 0008-Add-BufFileSet-for-sharing-temporary-files-between-b.patch from the
> > parallel-hash-v18.patchset.
> > - Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE
> > INDEX v10 patch.
> > - incorporate the BufFileSet API to the parallel tuple sort for CREATE
> > INDEX.
> > - Changes into few existing functions as well as added few to support the
> > BufFileSet changes.
>
> I'm glad that somebody is working on this. (Someone closer to the more
> general work on shared/parallel BufFile infrastructure than I am.)
>
> I do have some quick feedback, and I hope to be able to provide that
> to both you and Thomas, as needed to see this one through. I'm not
> going to get into the tricky details around resource management just
> yet. I'll start with some simpler questions, to get a general sense of
> the plan here.
>
>
Thanks Peter.

> I gather that you're at least aware that your v11 of the patch doesn't
> preserve randomAccess support for parallel sorts, because you didn't
> include my 0002-* testing GUCs patch, which was specifically designed
> to make various randomAccess stuff testable. I also figured this to be
> true because I noticed this FIXME among (otherwise unchanged)
> tuplesort code:
>
>
Yes, I haven't touched the randomAccess part yet. My initial goal was
to incorporate the BufFileSet api's here.

> > +static void
> > +leader_takeover_tapes(Tuplesortstate *state)
> > +{
> > + Sharedsort *shared = state->shared;
> > + int nLaunched = state->nLaunched;
> > + int j;
> > +
> > + Assert(LEADER(state));
> > + Assert(nLaunched >= 1);
> > + Assert(nLaunched == shared->workersFinished);
> > +
> > + /*
> > + * Create the tapeset from worker tapes, including a leader-owned
> tape at
> > + * the end. Parallel workers are far more expensive than logical
> tapes,
> > + * so the number of tapes allocated here should never be excessive.
> FIXME
> > + */
> > + inittapestate(state, nLaunched + 1);
> > + state->tapeset = LogicalTapeSetCreate(nLaunched + 1, shared->tapes,
> > + state->fileset, state->worker);
>
> It's not surprising to me that you do not yet have this part working,
> because much of my design was about changing as little as possible
> above the BufFile interface, in order for tuplesort.c (and logtape.c)
> code like this to "just work" as if it was the serial case.

Right. I just followed your design in the your earlier patches.

> It doesn't
> look like you've added the kind of BufFile multiplexing code that I
> expected to see in logtape.c. This is needed to compensate for the
> code removed from fd.c and buffile.c. Perhaps it would help me to go
> look at Thomas' latest parallel hash join patch -- did it gain some
> kind of transparent multiplexing ability that you actually (want to)
> use here?
>
>
Sorry, I didn't get this part. Are you talking about the your patch changes
into OpenTemporaryFileInTablespace(), BufFileUnify() and other changes
related to ltsUnify() ? If that's the case, I don't think it require with
the
BufFileSet. Correct me if I am wrong here.

Though randomAccess isn't used by CREATE INDEX in general, and so not
> supporting randomAccess within tuplesort.c for parallel callers
> doesn't matter as far as this CREATE INDEX user-visible feature is
> concerned, I still believe that randomAccess is important (IIRC,
> Robert thought so too). Specifically, it seems like a good idea to
> have randomAccess support, both on general principle (why should the
> parallel case be different?), and because having it now will probably
> enable future enhancements to logtape.c. Enhancements that have it
> manage parallel sorts based on partitioning/distribution/bucketing
> [1]. I'm pretty sure that partitioning-based parallel sort is going to
> become very important in the future, especially for parallel
> GroupAggregate. The leader needs to truly own the tapes it reclaims
> from workers in order for all of this to work.
>
>
First application for the tuplesort here is CREATE INDEX and that doesn't
need randomAccess. But as you said and in the thread its been discussed,
randomAccess is an important and we should sure put an efforts to support
the same.

Questions on where you're going with randomAccess support:
>
> 1. Is randomAccess support a goal for you here at all?
>
> 2. If so, is preserving eager recycling of temp file space during
> randomAccess (materializing a final output tape within the leader)
> another goal for you here? Do we need to preserve that property of
> serial external sorts, too, so that it remains true that logtape.c
> ensures that "the total space usage is essentially just the actual
> data volume, plus insignificant bookkeeping and start/stop overhead"?
> (I'm quoting from master's logtape.c header comments.)
>
> 3. Any ideas on next steps in support of those 2 goals? What problems
> do you foresee, if any?
>
>
To be frank its too early for me to comment anything in this area. I need
to study this more closely. As an initial goal I was just focused on
understanding the current implementation of the patch and incorporate
the BufFileSet APIs.

> > CREATE INDEX serial_idx ON parallel_sort_test (randint);
> >
> > Without patch:
> >
> > Time: 3430054.220 ms (57:10.054)
> >
> > With patch (max_parallel_workers_maintenance = 8):
> >
> > Time: 1163445.271 ms (19:23.445)
>
> This looks very similar to my v10. While I will need to follow up on
> this, to make sure, it seems likely that this patch has exactly the
> same performance characteristics as v10.
>
>
Its 2.96x, more or less similar to your v10. Might be some changes due
to different testing environment.

> Thanks
>
> [1] https://wiki.postgresql.org/wiki/Parallel_External_Sort#
> Partitioning_for_parallelism_.28parallel_external_sort_
> beyond_CREATE_INDEX.29
> --
> Peter Geoghegan
>

Thanks,
Rushabh Lathia

In response to

Re: Parallel tuplesort (for parallel B-Tree index creation) at 2017-09-19 23:47:34 from Peter Geoghegan

Responses

Re: Parallel tuplesort (for parallel B-Tree index creation) at 2017-09-20 15:17:19 from Robert Haas
Re: Parallel tuplesort (for parallel B-Tree index creation) at 2017-09-29 23:36:55 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ashutosh Sharma	2017-09-20 09:37:36	Re: Page Scan Mode in Hash Index
Previous Message	Ashutosh Bapat	2017-09-20 09:06:38	Re: Varying results when using merge joins over postgres_fdw vs hash joins