Re: Parallel Seq Scan

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan
Date: 2015-10-04 04:51:21
Message-ID: CAA4eK1JqeUxRJ4svJXG6Z6s7EzDBz-w7ES=jbYi6iTd5zEZbMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Oct 3, 2015 at 11:35 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, Oct 2, 2015 at 11:44 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > On Thu, Oct 1, 2015 at 7:41 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
wrote:
> >> Does heap_parallelscan_nextpage really need a pscan_finished output
> >> parameter, or can it just return InvalidBlockNumber to indicate end of
> >> scan? I think the latter can be done and would be cleaner.
> >
> > I think we can do that way as well, however we have to keep the check
> > of page == InvalidBlockNumber at all the callers to indicate finish
> > of scan which made me write the function as it exists in patch.
However,
> > I don't mind changing it the way you have suggested if you feel that
would
> > be cleaner.
>
> I think it would. I mean, you just end up testing the other thing
instead.
>

No issues, will change in next version of patch.

> >> I think that heap_parallelscan_initialize() should taken an
> >> additional Boolean argument, allow_sync. It should decide whether to
> >> actually perform a syncscan using the logic from initscan(), and then
> >> it should store a phs_syncscan flag into the ParallelHeapScanDesc.
> >> heap_beginscan_internal should set rs_syncscan based on phs_syncscan,
> >> regardless of anything else.
> >
> > I think this will ensure that any future change in this area won't
break the
> > guarantee for sync scans for parallel workers, is that the reason you
> > prefer to implement this function in the way suggested by you?
>
> Basically. It seems pretty fragile the way you have it now.
>

Okay, in that case I will change it as per your suggestion.

> >> I think heap_parallel_rescan() is an unworkable API. When rescanning
> >> a synchronized scan, the existing logic keeps the original start-block
> >> so that the scan's results are reproducible,
> >
> > It seems from the code comments in initscan, the reason for keeping
> > previous startblock is to allow rewinding the cursor which doesn't hold
for
> > parallel scan. We might or might not want to support such cases with
> > parallel query, but even if we want to there is a way we can do with
> > current logic (as mentioned in one of my replies below).
>
> You don't need to rewind a cursor; you just need to restart the scan.
> So for example if we were on the inner side of a NestLoop, this would
> be a real issue.
>

Sorry, but I am not able to see the exact issue you have in mind for
NestLoop,
if we don't preserve the start block position for parallel scan. The code
in
discussion is added by below commit which doesn't indicate any such problem.

******
commit 61dd4185ffb034a22b4b40425d56fe37e7178488
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> Thu Jun 11 00:24:16 2009
Committer: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> Thu Jun 11 00:24:16 2009

Keep rs_startblock the same during heap_rescan, so that a rescan of a
SeqScan
node starts from the same place as the first scan did. This avoids
surprising
behavior of scrollable and WITH HOLD cursors, as seen in Mark Kirkwood's bug
report of yesterday.

It's not entirely clear whether a rescan should be forced to drop out of the
syncscan mode, but for the moment I left the code behaving the same on that
point. Any change there would only be a performance and not a correctness
issue, anyway.
******

> >> but no longer reports the
> >> scan position since we're presumably out of step with other backends.
> >
> > Is it true for all form of rescans or are you talking about some
> > case (like SampleScan) in particular? As per my reading of code
> > (and verified by debugging that code path), it doesn't seem to be true
> > for rescan in case of seqscan.
>
> I think it is:
>
> if (keep_startblock)
> {
> /*
> * When rescanning, we want to keep the previous startblock
setting,
> * so that rewinding a cursor doesn't generate surprising results.
> * Reset the active syncscan setting, though.
> */
> scan->rs_syncscan = (allow_sync && synchronize_seqscans);
> }
>

Okay, but this code doesn't indicate that scan position will not be
reported.
It just resets the flag based on which scan position is reported. It won't
change during rescan as compare to first scan unless the value of
synchronize_seqscans is changed in-between. So under some special
circumstances, it may change but not as a general rule.

> >> This isn't going to work at all with this API. I don't think you can
> >> swap out the ParallelHeapScanDesc for another one once you've
> >> installed it;
> >>
> >
> > Sure, but if we really need some such parameters like startblock
position,
> > then we can preserve those in ScanDesc.
>
> Sure, we could transfer the information out of the
> ParallelHeapScanDesc and then transfer it back into the new one, but I
> have a hard time thinking that's a good design.
>

I also don't think that is the best way to achieve it, but on the other hand
doing it the other way will pose some restrictions which doesn't matter much
for the current cases, but has the potential to pose similar restrictions
to support
rescan for future operations (assume a case where the scan has to be done
by worker based on some key which will be regenerated for each rescan). I
am not telling that we can't find some way to support the same, but the need
of doing so doesn't seem to be big enough.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shay Rojansky 2015-10-04 05:59:17 Odd query execution behavior with extended protocol
Previous Message Paul Ramsey 2015-10-04 02:40:40 Re: [PATCH] postgres_fdw extension support