Re: Parallel Seq Scan

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Fabrízio Mello <fabriziomello(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan
Date: 2015-02-10 20:56:53
Message-ID: 20150210205653.GM21017@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2015-02-10 09:23:02 -0500, Robert Haas wrote:
> On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > And good chunk sizes et al depend on higher layers,
> > selectivity estimates and such. And that's planner/executor work, not
> > the physical layer (which heapam.c pretty much is).
>
> If it's true that a good chunk size depends on the higher layers, then
> that would be a good argument for doing this differently, or at least
> exposing an API for the higher layers to tell heapam.c what chunk size
> they want. I hadn't considered that possibility - can you elaborate
> on why you think we might want to vary the chunk size?

Because things like chunk size depend on the shape of the entire
plan. If you have a 1TB table and want to sequentially scan it in
parallel with 10 workers you better use some rather large chunks. That
way readahead will be efficient in a cpu/socket local manner,
i.e. directly reading in the pages into the directly connected memory of
that cpu. Important for performance on a NUMA system, otherwise you'll
constantly have everything go over the shared bus. But if you instead
have a plan where the sequential scan goes over a 1GB table, perhaps
with some relatively expensive filters, you'll really want a small
chunks size to avoid waiting. The chunk size will also really depend on
what other nodes are doing, at least if they can run in the same worker.

Even without things like NUMA and readahead I'm pretty sure that you'll
want a chunk size a good bit above one page. The locks we acquire for
the buffercache lookup and for reading the page are already quite bad
for performance/scalability; even if we don't always/often hit the same
lock. Making 20 processes that scan pages in parallel acquire yet a
another lock (that's shared between all of them!) for every single page
won't be fun, especially without or fast filters.

> >> For this case, what I would imagine is that there is one parallel heap
> >> scan, and each PartialSeqScan attaches to it. The executor says "give
> >> me a tuple" and heapam.c provides one. Details like the chunk size
> >> are managed down inside heapam.c, and the executor does not know about
> >> them. It just knows that it can establish a parallel scan and then
> >> pull tuples from it.
> >
> > I think that's a horrible approach that'll end up with far more
> > entangled pieces than what you're trying to avoid. Unless the tuple flow
> > is organized to only happen in the necessary cases the performance will
> > be horrible.
>
> I can't understand this at all. A parallel heap scan, as I've coded
> it up, involves no tuple flow at all. All that's happening at the
> heapam.c layer is that we're coordinating which blocks to scan. Not
> to be disrespectful, but have you actually looked at the patch?

No, and I said so upthread. I started commenting because you argued that
architecturally parallelism belongs in heapam.c instead of upper layers,
and I can't agree with that. I now have, and it looks less bad than I
had assumed, sorry.

Unfortunately I still think it's wrong approach, also sorry.

As pointed out above (moved there after reading the patch...) I don't
think a chunk size of 1 or any other constant size can make sense. I
don't even believe it'll necessarily be constant across an entire query
execution (big initially, small at the end). Now, we could move
determining that before the query execution into executor
initialization, but then we won't yet know how many workers we're going
to get. We could add a function setting that at runtime, but that'd mix
up responsibilities quite a bit.

I also can't agree with having a static snapshot in shared memory put
there by the initialization function. For one it's quite awkward to end
up with several equivalent snapshots at various places in shared
memory. Right now the entire query execution can share one snapshot,
this way we'd end up with several of them. Imo for actual parallel
query execution the plan should be shared once and then be reused for
everything done in the name of the query.

Without the need to do that you end up pretty much with only with setup
for infrastructure so heap_parallelscan_nextpage is called. How about
instead renaming heap_beginscan_internal() to _extended and offering an
option to provide a callback + state that determines the next page?
Additionally provide some separate functions managing a simple
implementation of such a callback + state?

Btw, using a atomic uint32 you'd end up without the spinlock and just
about the same amount of code... Just do a atomic_fetch_add_until32(var,
1, InvalidBlockNumber)... ;)

> >> I think we're in violent agreement here, except for some
> >> terminological confusion. Are there N PartialSeqScan nodes, one
> >> running in each node, or is there one ParallelSeqScan node, which is
> >> copied and run jointly across N nodes? You can talk about either way
> >> and have it make sense, but we haven't had enough conversations about
> >> this on this list to have settled on a consistent set of vocabulary
> >> yet.
> >
> > I pretty strongly believe that it has to be independent scan nodes. Both
> > from a implementation and a conversational POV. They might have some
> > very light cooperation between them (e.g. coordinating block ranges or
> > such), but everything else should be separate. From an implementation
> > POV it seems pretty awful to have executor node that's accessed by
> > multiple separate backends - that'd mean it have to be concurrency safe,
> > have state in shared memory and everything.
>
> I don't agree with that, but again I think it's a terminological
> dispute. I think what will happen is that you will have a single node
> that gets copied into multiple backends, and in some cases a small
> portion of its state will live in shared memory. That's more or less
> what you're thinking of too, I think.

Well, let me put it that way, I think that the tuple flow has to be
pretty much like I'd ascii-art'ed earlier. And that only very few nodes
will need to coordinate between query execution happening in different
workers. With that I mean it has to be possible to have queries like:

ParallelismDrivingNode
|
---------------- Parallelism boundary
|
NestLoop
/ \
CSeqScan IndexScan

Where the 'coordinated seqscan' scans a relation so that each tuple
eventually gets returned once across all nodes, but the nested loop (and
through it the index scan) will just run normally, without any
coordination and parallelism. But everything below --- would happen
multiple nodes. If you agree, yes, then we're in violent agreement
;). The "single node that gets copied" bit above makes me a bit unsure
whether we are though.

To me, given the existing executor code, it seems easiest to achieve
that by having the ParallelismDrivingNode above having a dynamic number
of nestloop children in different backends and point the coordinated
seqscan to some shared state. As you point out, the number of these
children cannot be certainly known (just targeted for) at plan time;
that puts a certain limit on how independent they are. But since a
large number of them can be independent between workers it seems awkward
to generally treat them as being the same node across workers. But maybe
that's just an issue with my mental model.

> But what I don't want is - if we've got a parallel scan-and-aggregate
> happening in N nodes, EXPLAIN shows N copies of all of that - not only
> because it's display clutter, but also because a plan to do that thing
> with 3 workers is fundamentally the same as a plan to do it with 30
> workers. Those plans shouldn't look different, except perhaps for a
> line some place that says "Number of Workers: N".

I'm really not concerned with what explain is going to show. We can do
quite some fudging there - it's not like it's a 1:1 representation of
the query plan.

I think we're getting to the point where having a unique mapping from
the plan to the execution tree is proving to be rather limiting
anyway. Check for example discussion about join removal. But even for
current code, showing only the custom plans for the first five EXPLAIN
EXECUTEs is pretty nasty (Try explain that to somebody that doesn't know
pg internals. Their looks are worth gold and can kill you at the same
time) and should be done differently.

And I actually can very well imagine that you'd want a option to show
the different execution statistics for every worker in the ANALYZE case.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2015-02-10 21:00:22 Re: Manipulating complex types as non-contiguous structures in-memory
Previous Message Peter Geoghegan 2015-02-10 20:09:35 Re: INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0