Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, benoit <benoit(at)hopsandfork(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date: 2024-03-26 18:01:58
Message-ID: CAH2-WzmQUhfpSZrvcDmyk1OtydYv=TWRYJdp75m6=9hUaeHbFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 21, 2024 at 8:32 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v16. This has some of the changes that you asked for. But
> my main focus has been on fixing regressions that were discovered
> during recent performance validation.

Attached is v17. This revision deals with problems with parallel index
scans that use array keys.

This patch is now *very* close to being committable. My only
outstanding concern is the costing/selfuncs.c aspects of the patch,
which I haven't thought about in many months. I'd estimate that I'm
now less than a week away from pushing this patch. (Plus I need to
decide which tests from my massive test suite should be included in
the committed patch, as standard regression tests.)

Back to v17. Earlier versions dealt with parallel index scans in a way
that could lead to very unbalanced utilization of worker processes (at
least with the wrong sort of query/index), which was clearly
unacceptable. The underlying issue was the continued use of
BTParallelScanDescData.btps_arrayKeyCount field in shared memory,
which was still used to stop workers from moving on to the next
primitive scan. That old approach made little sense with this new high
level design. (I have been aware of the problem for months, but didn't
get around to it until now.)

I wasn't specifically trying to win any benchmarks here -- I really
just wanted to make the design for parallel index scans with SAOPs
into a coherent part of the wider design, without introducing any
regressions. I applied my usual charter for the patch when working on
v17, by more or less removing any nbtree.c parallel index scan code
that had direct awareness of array keys as distinct primitive index
scans. This fixed the problem of unbalanced worker utilization, as
expected, but it also seems to have benefits particular to parallel
index scans with array keys -- which was unexpected.

We now use an approach that's almost identical to the approach taken
by every other type of parallel scan. My new approach naturally
reduces contention among backends that want to advance the scan. (Plus
parallel index scans get all of the same benefits as serial index
scans, of course.)

v17 replaces _bt_parallel_advance_array_keys with a new function,
called _bt_parallel_primscan_advance, which is now called from
_bt_advance_array_keys. The new function doesn't need to increment
btscan->btps_arrayKeyCount (nor does it increment a local copy in
BTScanOpaqueData.arrayKeyCount). It is called at the specific point
that array key advancement *tries* to start another primitive index
scan. The difference (relative to the serial case) is that it might
not succeed in doing so. It might not be possible to start another
primitive index scan, since it might turn out that some other backend
(one page ahead, or even one page behind) already did it for
everybody. At that point it's easy for the backend that failed to
start the next primitive scan to have _bt_advance_array_keys back out
of everything. Then the backend just goes back to consuming pages by
seizing the scan.

This approach preserves the ability of parallel scans to have
_bt_readpage release the next page before _bt_readpage even starts
examining tuples from the current page. It's possible that 2 or 3
backends (each of which scans its own leaf pages from a group of
adjacent pages) will "independently" decide that it's time to start
another scan -- but at most one backend can be granted the right to
do so.

It's even possible that one such backend among several (all of which
might be expected to try to start the next primitive scan in unison)
will *not* want to do so -- it could know better than the rest. This
can happen when the backend lands one page ahead of the rest, and it
just so happens to see no reason to not just continue the scan on the
leaf level for now. Such a backend can effectively veto the idea of
starting another primitive index scan for the whole top-level parallel
scan. This probably doesn't really matter much, in and of itself
(skipping ahead by only one or two pages is suboptimal but not really
a problem), but that's beside the point. The point is that having very
flexible rules like this works out to be both simpler and better
performing than a more rigid, pessimistic approach would be. In
particular, keeping the time that the scan is seized by any one
backend to an absolute minimum is important -- it may be better to
detect and recover from contention than to try to prevent them
altogether.

Just to be clear, all of this only comes up during parts of the scan
where primitive index scans actually look like a good idea in the
first place. It's perfectly possible for the scan to process thousands
of array keys, without any backend ever considering starting another
primitive index scan. Much of the time, every worker process can
independently advance their own private array keys, without that
requiring any sort of special coordination (nothing beyond the
coordination required by *all* parallel index scans, including those
without any SAOP arrays).

As I've said many times already, the high level charter of this
project is to make scans that have arrays (whether serial or parallel)
as similar as possible to any other type of scan.

--
Peter Geoghegan

Attachment Content-Type Size
v17-0001-Enhance-nbtree-ScalarArrayOp-execution.patch application/x-patch 186.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-03-26 18:02:59 Re: pg_combinebackup --copy-file-range
Previous Message Bertrand Drouvot 2024-03-26 17:52:12 Re: Introduce XID age and inactive timeout based replication slot invalidation