Re: Parallel bitmap heap scan

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel bitmap heap scan
Date: 2016-11-26 12:48:59
Message-ID: CAFiTN-txEPkB8pJEYqe8E-HRi_4BYo9kKXvbfVm1w7fOxgEdYw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 22, 2016 at 9:05 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> On Fri, Nov 18, 2016 at 9:59 AM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>
> Thanks for the review..

I have worked on these comments..
>
>> In pbms_is_leader() , I didn't clearly understand the significance of
>> the for-loop. If it is a worker, it can call
>> ConditionVariablePrepareToSleep() followed by
>> ConditionVariableSleep(). Once it comes out of
>> ConditionVariableSleep(), isn't it guaranteed that the leader has
>> finished the bitmap ? If yes, then it looks like it is not necessary
>> to again iterate and go back through the pbminfo->state checking.
>> Also, with this, variable queuedSelf also might not be needed. But I
>> might be missing something here.
>
> I have taken this logic from example posted on conditional variable thread[1]
>
> for (;;)
> {
> ConditionVariablePrepareToSleep(cv);
> if (condition for which we are waiting is satisfied)
> break;
> ConditionVariableSleep();
> }
> ConditionVariableCancelSleep();
>
> [1] https://www.postgresql.org/message-id/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com

>
> So it appears to me even if we come out of ConditionVariableSleep();
> we need to verify our condition and then only break.
>
> Not sure what happens if worker calls
>> ConditionVariable[Prepare]Sleep() but leader has already called
>> ConditionVariableBroadcast(). Does the for loop have something to do
>> with this ? But this can happen even with the current for-loop, it
>> seems.
> If leader has already called ConditionVariableBroadcast, but after
> ConditionVariablePrepareToSleep we will check the condition again
> before calling ConditionVariableSleep. And condition check is under
> SpinLockAcquire(&pbminfo->state_mutex);
>
> However I think there is one problem in my code (I think you might be
> pointing same), that after ConditionVariablePrepareToSleep, if
> pbminfo->state is already PBM_FINISHED, I am not resetting needWait to
> false, and which can lead to the problem.

I have fixed the defect what I have mentioned above,

>
>>
>>
>>> #3. Bitmap processing (Iterate and process the pages).
>>> In this phase each worker will iterate over page and chunk array and
>>> select heap pages one by one. If prefetch is enable then there will be
>>> two iterator. Since multiple worker are iterating over same page and
>>> chunk array we need to have a shared iterator, so we grab a spin lock
>>> and iterate within a lock, so that each worker get and different page
>>> to process.
>>
>> tbm_iterate() call under SpinLock :
>> For parallel tbm iteration, tbm_iterate() is called while SpinLock is
>> held. Generally we try to keep code inside Spinlock call limited to a
>> few lines, and that too without occurrence of a function call.
>> Although tbm_iterate() code itself looks safe under a spinlock, I was
>> checking if we can squeeze SpinlockAcquire() and SpinLockRelease()
>> closer to each other. One thought is : in tbm_iterate(), acquire the
>> SpinLock before the while loop that iterates over lossy chunks. Then,
>> if both chunk and per-page data remain, release spinlock just before
>> returning (the first return stmt). And then just before scanning
>> bitmap of an exact page, i.e. just after "if (iterator->spageptr <
>> tbm->npages)", save the page handle, increment iterator->spageptr,
>> release Spinlock, and then use the saved page handle to iterate over
>> the page bitmap.
>
> Main reason to keep Spin lock out of this function to avoid changes
> inside this function, and also this function takes local iterator as
> input which don't have spin lock reference to it. But that can be
> changed, we can pass shared iterator to it.
>
> I will think about this logic and try to update in next version.
Still this issue is not addressed.
Logic inside tbm_iterate is using same variable, like spageptr,
multiple places. IMHO this complete logic needs to be done under one
spin lock.

>
>>
>> prefetch_pages() call under Spinlock :
>> Here again, prefetch_pages() is called while pbminfo->prefetch_mutex
>> Spinlock is held. Effectively, heavy functions like PrefetchBuffer()
>> would get called while under the Spinlock. These can even ereport().
>> One option is to use mutex lock for this purpose. But I think that
>> would slow things down. Moreover, the complete set of prefetch pages
>> would be scanned by a single worker, and others might wait for this
>> one. Instead, what I am thinking is: grab the pbminfo->prefetch_mutex
>> Spinlock only while incrementing pbminfo->prefetch_pages. The rest
>> part viz : iterating over the prefetch pages, and doing the
>> PrefetchBuffer() need not be synchronised using this
>> pgbinfo->prefetch_mutex Spinlock. pbms_parallel_iterate() already has
>> its own iterator spinlock. Only thing is, workers may not do the
>> actual PrefetchBuffer() sequentially. One of them might shoot ahead
>> and prefetch 3-4 pages while the other is lagging with the
>> sequentially lesser page number; but I believe this is fine, as long
>> as they all prefetch all the required blocks.
>
> I agree with your point, will try to fix this as well.

I have fixed this part.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-11-26 13:27:51 Re: Quorum commit for multiple synchronous replication.
Previous Message Dilip Kumar 2016-11-26 12:40:13 Re: Parallel bitmap heap scan