Re: Parallel Seq Scan

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Fabrízio Mello <fabriziomello(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan
Date: 2015-02-06 19:13:33
Message-ID: CA+TgmobzZjGO9voG9J_7=U8bwbrLb1kzo5AZ2Q9JT12NH8j74g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 6, 2015 at 12:34 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> 4.

Obviously that went out a bit too soon. Anyway, what I think we
should do here is back up a bit and talk about what the problems are
that we need to solve here and how each of them should be solved. I
think there is some good code in this patch, but we really need to
think about what the interfaces should look like and achieve a clean
separation of concerns.

Looking at the code for the non-parallel SeqScan node, there are
basically two things going on here:

1. We call heap_getnext() to get the next tuple and store it into a
TupleTableSlot.
2. Via ExecScan(), we do projection and apply quals.

My first comment here is that I think we should actually teach
heapam.c about parallelism. In other words, let's have an interface
like this:

extern Size heap_parallelscan_estimate(Snapshot snapshot);
extern void heap_parallelscan_initialize(ParallelHeapScanDesc target,
Relation relation, Snapshot snapshot);
extern HeapScanDesc heap_beginscan_parallel(ParallelHeapScanDesc);

So the idea is that if you want to do a parallel scan, you call
heap_parallelscan_estimate() to figure out how much space to reserve
in your dynamic shared memory segment. Then you call
heap_parallelscan_initialize() to initialize the chunk of memory once
you have it. Then each backend that wants to assist in the parallel
scan calls heap_beginscan_parallel() on that chunk of memory and gets
its own HeapScanDesc. Then, they can all call heap_getnext() just as
in the non-parallel case. The ParallelHeapScanDesc needs to contain
the relation OID, the snapshot, the ending block number, and a
current-block counter. Instead of automatically advancing to the next
block, they use one of Andres's nifty new atomic ops to bump the
current-block counter and then scan the block just before the new
value. All this seems pretty straightforward, and if we decide to
later change the way the relation gets scanned (e.g. in 1GB chunks
rather than block-by-block) it can be handled here pretty easily.

Now, let's suppose that we have this interface and for some reason we
don't care about quals and projection - we just want to get the tuples
back to the master. It's easy enough to create a parallel context
that fires up a worker and lets the worker call
heap_beginscan_parallel() and then heap_getnext() in a loop, but what
does it do with the resulting tuples? We need a tuple queue that can
be used to send the tuples back to master. That's also pretty easy:
just set up a shm_mq and use shm_mq_send() to send each tuple. Use
shm_mq_receive() in the master to read them back out. The only thing
we need to be careful about is that the tuple descriptors match. It
must be that they do, because the way the current parallel context
patch works, the master is guaranteed to hold a lock on the relation
from before the worker starts up until after it dies. But we could
stash the tuple descriptor in shared memory and cross-check that it
matches just to be sure. Anyway, this doesn't seem terribly complex
although we might want to wrap some abstraction around it somehow so
that every kind of parallelism that uses tuple queues can benefit from
it. Perhaps this could even be built into the parallel context
machinery somehow, or maybe it's something executor-specific. At any
rate it looks simpler than what you've got now.

The complicated part here seems to me to figure out what we need to
pass from the parallel leader to the parallel worker to create enough
state for quals and projection. If we want to be able to call
ExecScan() without modification, which seems like a good goal, we're
going to need a ScanState node, which is going to need to contain
valid pointers to (at least) a ProjectionInfo, an ExprContext, and a
List of quals. That in turn is going to require an ExecutorState.
Serializing those things directly doesn't seem very practical; what we
instead want to do is figure out what we can pass that will allow easy
reconstruction of those data structures. Right now, you're passing
the target list, the qual list, the range table, and the params, but
the range table doesn't seem to be getting used anywhere. I wonder if
we need it. If we could get away with just passing the target list
and qual list, and params, we'd be doing pretty well, I think. But
I'm not sure exactly what that looks like.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-02-06 19:43:42 Re: Early Setup of instrumentation information in pg_stat_statements
Previous Message Robert Haas 2015-02-06 17:34:29 Re: Parallel Seq Scan