Re: How to estimate the shared memory size required for parallel scan?

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Masayuki Takahashi <masayuki038(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: How to estimate the shared memory size required for parallel scan?
Date: 2018-08-19 23:08:25
Message-ID: CAEepm=1Z_QY9fK6sSro-3AV+LfvGXpO52WBKrqTn7DzYyocQqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Aug 19, 2018 at 4:28 PM, Masayuki Takahashi
<masayuki038(at)gmail(dot)com> wrote:
>> If you just supply an IsForeignScanParallelSafe function that returns
>> true, that would allow your FDW to be used inside parallel workers and
>> wouldn't need any extra shared memory, but it wouldn't be a "parallel
>> scan". It would just be "parallel safe". Each process that does a
>> scan of your FDW would expect a full normal scan (presumably returning
>> the same tuples in each process).
>
> I think that parallel scan mechanism uses this each worker's full
> normal scan to partitioned records, right?
> For example, I turned IsForeignScanParallelSafe to true in cstore_fdw
> and compared partitioned/non-partitioned scan.
>
> https://gist.github.com/masayuki038/daa63a21f8c16ffa8138b50db9129ced
>
> This shows that counted by each partition and 'Gather Merge' merge results.
> As a result, parallel scan and aggregation shows the correct count.

Ah, so here you have a Parallel Append node. That is a way to get
coarse-grained parallelism when you have only parallel-safe (not
parallel-aware) scans, but you have partitions. Technically (in our
jargon) there is no parallel scan happening here, but Parallel Append
is smart enough to scan each partition in a different worker. That
means that the 'granularity' of parallelism is whole tables
(partitions), so if you have (say) 3 partitions of approximately the
same size and 2 processes, you'll probably see that one of the
processes scans 1 partition and the other process scans 2 partitions,
so the work can be quite unbalanced. But if you have lots of
partitions, it's good, and in any case it's certainly better than no
parallelism.

> Then, in the case of cstore_fdw, it may not be necessary to reserve
> the shared memory in EstimateDSMForeignScan.

Correct. If all you need is parallel-safe scans, then you probably
don't need any shared memory.

BTW to be truly pedantically parallel-safe, I think it should ideally
be the case that each process has the same "snapshot" when scanning,
or subtle inconsistencies could arise (a transaction could be visible
to one process, but not to another; this would be weirder if it
applied to concurrent scans of the *same* foreign table, but it could
still be strange when scanning different partitions in a Parallel
Append). For file_fdw, we just didn't worry about that because plain
old text files are not transactional anyway, so we shrugged and
declared its scans to be parallel safe. I suppose that any FDW that
is backed by a non-snapshot-based system (including other RDBMSs)
would probably have no way to do better than that, and you might make
the same decision we made for file_fdw. When the foreign table is
PostgreSQL, or an extension that is tightly integrated into our
transaction system, I suppose you might want to think harder and maybe
even give the user some options?

>> So I guess this hasn't been done before and would require some more
>> research.
>
> I agree. I will try some query patterns.
> thanks.

Just to be clear, there I was talking about true Parallel Foreign
Scan, which is aiming a bit higher than mere parallel safety. After
looking at this again, this time with the benefit of coffee, I *think*
it should be possible without modifying core, if you do this:

1. As already mentioned, you need to figure out a way for cstore_fdw
to hand out a disjoint set of tuples to different processes. That
seems quite doable, since cstore is apparently block-structured
(though I only skim-read it for about 7 seconds and could be wrong
about that). You apparently have blocks and stripes: hopefully they
are of fixed size so you might be able to teach each process to
advance some kind of atomic variable in shared memory so that each
process eats different blocks?

2. Teach your GetForeignPath function to do something like this:

ForeignPath *partial_path;
double parallel_divisor;
int parallel_workers;

... existing code that adds regular non-partial path here ...

/* Should we add a partial path to enable a parallel scan? */
partial_path = create_foreignscan_path(root, baserel, NULL,
baserel->rows,
startup_cost,
total_cost,
NIL, NULL, NULL, coptions);
parallel_workers = compute_parallel_worker(baserel,
expected_num_pages, -1,

max_parallel_workers_per_gather);
partial_path->path.parallel_workers = parallel_workers;
partial_path->path.parallel_aware = true;
parallel_divisor = get_parallel_divisor(&partial_path->path);
partial_path->path.rows /= parallel_divisor;
partial_path->path.total_cost = startup_cost +
((total_cost - startup_cost) / parallel_divisor);
if (parallel_workers > 0)
add_partial_path(baserel, (Path *) partial_path);

You don't really have to use compute_parallel_worker() and
get_parallel_divisor() if you have a smarter way of coming up with
those numbers, but I'd probably use that logic to get started.
Unfortunately get_parallel_divisor() is not an extern function so
you'd need a clone of it, or equivalent logic. It's also a bit
inconvenient that it takes a Path * instead of just parallel_workers,
which would allow tidier coding here. It's also inconvenient that you
can't ALTER TABLE my_foreign_table SET (parallel_workers = N) today,
which compute_parallel_worker() would respect.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-08-19 23:09:52 Re: Temporary tables prevent autovacuum, leading to XID wraparound
Previous Message David Rowley 2018-08-19 22:47:17 Re: [sqlsmith] ERROR: partition missing from subplans