Re: Parallel bitmap heap scan

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel bitmap heap scan
Date: 2016-11-18 04:29:16
Message-ID: CAJ3gD9dYD7RhKsyRBnta6=g_0+s7y6r1SpEgW_GsnSeDSJRnPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19 October 2016 at 09:47, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> On Tue, Oct 18, 2016 at 1:45 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> I don't quite understand why the bitmap has to be parallel at all. As
>> far as I understand your approach as described here, the only thing that
>> needs to be shared are the iteration arrays. Since they never need to
>> be resized and such, it seems to make a lot more sense to just add an
>> API to share those, instead of the whole underlying hash.
>
> You are right that we only share iteration arrays. But only point is
> that each entry of iteration array is just a pointer to hash entry.
> So either we need to build hash in shared memory (my current approach)
> or we need to copy each hash element at shared location (I think this
> is going to be expensive).

While the discussion is going on regarding implementation for creating
shared tidbitmap, meanwhile I am starting with review of the bitmap
heap scan part, i.e. nodeBitmapHeapscan.c, since this looks mostly
independent of tidbitmap implementation.

> Brief design idea:
> -----------------------
> #1. Shared TIDBitmap creation and initialization
> First worker to see the state as parallel bitmap info as PBM_INITIAL
> become leader and set the state to PBM_INPROGRESS All other workers
> see the state as PBM_INPROGRESS will wait for leader to complete the
> TIDBitmap.
>
> #2 At this level TIDBitmap is ready and all workers are awake.

As far as correctness is concerned, the logic where the first worker
becomes leader while others synchronously wait, looks good. Workers
get allocated right from the beginning even though they would stay
idle for some percentage of time (5-20% ?) , but I guess there is
nothing we can do about it with the current parallel query
infrastructure.

In pbms_is_leader() , I didn't clearly understand the significance of
the for-loop. If it is a worker, it can call
ConditionVariablePrepareToSleep() followed by
ConditionVariableSleep(). Once it comes out of
ConditionVariableSleep(), isn't it guaranteed that the leader has
finished the bitmap ? If yes, then it looks like it is not necessary
to again iterate and go back through the pbminfo->state checking.
Also, with this, variable queuedSelf also might not be needed. But I
might be missing something here. Not sure what happens if worker calls
ConditionVariable[Prepare]Sleep() but leader has already called
ConditionVariableBroadcast(). Does the for loop have something to do
with this ? But this can happen even with the current for-loop, it
seems.

> #3. Bitmap processing (Iterate and process the pages).
> In this phase each worker will iterate over page and chunk array and
> select heap pages one by one. If prefetch is enable then there will be
> two iterator. Since multiple worker are iterating over same page and
> chunk array we need to have a shared iterator, so we grab a spin lock
> and iterate within a lock, so that each worker get and different page
> to process.

tbm_iterate() call under SpinLock :
For parallel tbm iteration, tbm_iterate() is called while SpinLock is
held. Generally we try to keep code inside Spinlock call limited to a
few lines, and that too without occurrence of a function call.
Although tbm_iterate() code itself looks safe under a spinlock, I was
checking if we can squeeze SpinlockAcquire() and SpinLockRelease()
closer to each other. One thought is : in tbm_iterate(), acquire the
SpinLock before the while loop that iterates over lossy chunks. Then,
if both chunk and per-page data remain, release spinlock just before
returning (the first return stmt). And then just before scanning
bitmap of an exact page, i.e. just after "if (iterator->spageptr <
tbm->npages)", save the page handle, increment iterator->spageptr,
release Spinlock, and then use the saved page handle to iterate over
the page bitmap.

prefetch_pages() call under Spinlock :
Here again, prefetch_pages() is called while pbminfo->prefetch_mutex
Spinlock is held. Effectively, heavy functions like PrefetchBuffer()
would get called while under the Spinlock. These can even ereport().
One option is to use mutex lock for this purpose. But I think that
would slow things down. Moreover, the complete set of prefetch pages
would be scanned by a single worker, and others might wait for this
one. Instead, what I am thinking is: grab the pbminfo->prefetch_mutex
Spinlock only while incrementing pbminfo->prefetch_pages. The rest
part viz : iterating over the prefetch pages, and doing the
PrefetchBuffer() need not be synchronised using this
pgbinfo->prefetch_mutex Spinlock. pbms_parallel_iterate() already has
its own iterator spinlock. Only thing is, workers may not do the
actual PrefetchBuffer() sequentially. One of them might shoot ahead
and prefetch 3-4 pages while the other is lagging with the
sequentially lesser page number; but I believe this is fine, as long
as they all prefetch all the required blocks.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2016-11-18 05:12:42 WAL recycle retading based on active sync rep.
Previous Message Joshua Drake 2016-11-18 03:43:21 Re: Mail thread references in commits