Re: Parallel Seq Scan

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan
Date: 2015-10-03 18:05:03
Message-ID: CA+TgmoazFeUhBNYYUE0K4rhtqQAQT4=RBVkKuvod+dOxHTgvDA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 2, 2015 at 11:44 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Thu, Oct 1, 2015 at 7:41 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Does heap_parallelscan_nextpage really need a pscan_finished output
>> parameter, or can it just return InvalidBlockNumber to indicate end of
>> scan? I think the latter can be done and would be cleaner.
>
> I think we can do that way as well, however we have to keep the check
> of page == InvalidBlockNumber at all the callers to indicate finish
> of scan which made me write the function as it exists in patch. However,
> I don't mind changing it the way you have suggested if you feel that would
> be cleaner.

I think it would. I mean, you just end up testing the other thing instead.

>> I think that heap_parallelscan_initialize() should taken an
>> additional Boolean argument, allow_sync. It should decide whether to
>> actually perform a syncscan using the logic from initscan(), and then
>> it should store a phs_syncscan flag into the ParallelHeapScanDesc.
>> heap_beginscan_internal should set rs_syncscan based on phs_syncscan,
>> regardless of anything else.
>
> I think this will ensure that any future change in this area won't break the
> guarantee for sync scans for parallel workers, is that the reason you
> prefer to implement this function in the way suggested by you?

Basically. It seems pretty fragile the way you have it now.

>> I think heap_parallel_rescan() is an unworkable API. When rescanning
>> a synchronized scan, the existing logic keeps the original start-block
>> so that the scan's results are reproducible,
>
> It seems from the code comments in initscan, the reason for keeping
> previous startblock is to allow rewinding the cursor which doesn't hold for
> parallel scan. We might or might not want to support such cases with
> parallel query, but even if we want to there is a way we can do with
> current logic (as mentioned in one of my replies below).

You don't need to rewind a cursor; you just need to restart the scan.
So for example if we were on the inner side of a NestLoop, this would
be a real issue.

>> but no longer reports the
>> scan position since we're presumably out of step with other backends.
>
> Is it true for all form of rescans or are you talking about some
> case (like SampleScan) in particular? As per my reading of code
> (and verified by debugging that code path), it doesn't seem to be true
> for rescan in case of seqscan.

I think it is:

if (keep_startblock)
{
/*
* When rescanning, we want to keep the previous startblock setting,
* so that rewinding a cursor doesn't generate surprising results.
* Reset the active syncscan setting, though.
*/
scan->rs_syncscan = (allow_sync && synchronize_seqscans);
}

>> This isn't going to work at all with this API. I don't think you can
>> swap out the ParallelHeapScanDesc for another one once you've
>> installed it;
>>
>
> Sure, but if we really need some such parameters like startblock position,
> then we can preserve those in ScanDesc.

Sure, we could transfer the information out of the
ParallelHeapScanDesc and then transfer it back into the new one, but I
have a hard time thinking that's a good design.

> PARAM_EXEC parameters will be used to support initPlan in parallel query (as
> it is done in the initial patch), so it seems to me this is the main
> downside of
> this approach. I think rather than trying to come up with various possible
> solutions for this problem, lets prohibit sync scans with parallel query if
> you
> see some problem with the suggestions made by me above. Do you see any
> main use case getting hit due to non support of sync scans with
> parallel seq. scan?

Yes. Synchronized scans are particularly important with large tables,
and those are the kind you're likely to want to use a parallel
sequential scan on.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2015-10-03 18:37:35 Re: Less than ideal error reporting in pg_stat_statements
Previous Message Peter Geoghegan 2015-10-03 17:07:49 Re: Refactoring speculative insertion with unique indexes a little