Re: Parallel bitmap heap scan

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel bitmap heap scan
Date: 2017-02-05 14:48:51
Message-ID: CAFiTN-tGOBVVLRiBs7aS3hyUh=XJKdkzbWs879nXCgUL0JQNdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 1, 2017 at 11:09 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

Here are the latest version of the patch, which address all the
comments given by Robert.

> https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3mst@alap3.anarazel.de
> that you should try to share only the iteration arrays, I believe that
> he meant the results of tbm_begin_iterate() -- that is,
> iterator->spageptr, chunkptr, schunkbit, spages, and schunks. I'm
> perhaps putting words into his mouth, but what he said was that we
> could avoid sharing "the whole underlying hash". But the patch set
> you've implemented here doesn't behave that way. Instead, you've got
> the array containing the hash table elements shared (which is good)
> plus there's a sort of dummy hash table in every worker which copies
> some but not all of the members of the original hash table, leading to
> the otherwise-unnecessary if-test that Haribabu is complaining about.
> So the hash table is kinda-shared-kinda-not-shared, which I don't
> *think* is really what Andres had in mind.
>
> In short, I think SH_COPY (implemented here as pagetable_copy) needs
> to go away. The work of tbm_begin_iterate() should be done before we
> begin the shared scan and the results of that work should be shared.

I have removed the SH_COPY and now leader performs the
tbm_begin_shared_iterate before waking up the worker. Basically, the
leader will create the page and chunk array and that is the array of
the "relptr" (offlist, suggested by Robert).

> What happens right now (it appears) is that every backend does that
> work based on the same hash table and we just assume they all get the
> same answer. And we really do assume that, because
> pbms_parallel_iterate() assumes it can shuttle private state back and
> forth between iterator in different backends and nothing will break;
> but those backends aren't actually using the same iteration arrays.
> They are using different iteration arrays that should have the same
> contents because they were all derived from the same semi-shared hash
> table. That's pretty fragile, and a waste of CPU cycles if the hash
> table is large (every backend does its own sort).
>
> On a related note, I think it's unacceptable to make the internal
> details of TBMIterator public. You've moved it from tidbitmap.c to
> tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of
> the TBMIterator, but that's not OK. Those details need to remain
> private to tidbitmap.c.
Fixed

pbms_parallel_iterate() is a nasty kludge; we
> need some better solution. The knowledge of how a shared iterator
> should iterate needs to be private to tidbitmap.c, not off in the
> executor someplace. And I think the entries need to actually be
> updated in shared memory directly, not copied back and forth between a
> backend-private iterator and a shared iterator.

I have fixed this, now there is new function called tbm_shared_iterate
which will directly iterate using shared iterator. So now no need to
copy member back and forth between local and shared iterator.

>
> Also, pbms_parallel_iterate() can't hold a spinlock around a call to
> tbm_iterate(). Note the coding rules for spinlocks mentioned in
> spin.h and src/backend/storage/lmgr/README. I think the right thing
> to do here is to use an LWLock in a new tranche (add an entry to
> BuiltinTrancheIds).

Done that way.
>
> In 0002, you have this:
>
> + tb->alloc = palloc(sizeof(SH_ALLOCATOR));
>
> This should be using MemoryContextAlloc(ctx, ...) rather than palloc.
> Otherwise the allocator object is in a different context from
> everything else in the hash table. But TBH, I don't really see why we
> want this to be a separate object. Why not just put
> HashAlloc/HashFree/args into SH_TYPE directly? That avoids some
> pointer chasing and doesn't really seem to cost anything (except that
> SH_CREATE will grow a slightly longer argument sequence).

Done as suggested
>
> I think maybe we should rename the functions to element_allocate,
> element_free, and element_allocator_ctx, or something like that. The
> current names aren't capitalized consistently with other things in
> this module, and putting the word "element" in there would make it
> more clear what the purpose of this thing is.

Fixed as per the suggestion

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
0002-hash-support-alloc-free-v16.patch application/octet-stream 5.1 KB
0003-parallel-bitmap-heap-scan-v16.patch application/octet-stream 51.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Borodin 2017-02-05 16:04:32 Re: Review: GIN non-intrusive vacuum of posting tree
Previous Message Michael Paquier 2017-02-05 13:03:07 Re: Index corruption with CREATE INDEX CONCURRENTLY