Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, benoit <benoit(at)hopsandfork(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date: 2024-04-01 22:33:22
Message-ID: CAH2-WzkPOyZaMVtPUT19CfGwRndDekEBmGw_rd+UqKyK9n2LdQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 26, 2024 at 2:01 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> v17 replaces _bt_parallel_advance_array_keys with a new function,
> called _bt_parallel_primscan_advance, which is now called from
> _bt_advance_array_keys. The new function doesn't need to increment
> btscan->btps_arrayKeyCount (nor does it increment a local copy in
> BTScanOpaqueData.arrayKeyCount). It is called at the specific point
> that array key advancement *tries* to start another primitive index
> scan. The difference (relative to the serial case) is that it might
> not succeed in doing so. It might not be possible to start another
> primitive index scan, since it might turn out that some other backend
> (one page ahead, or even one page behind) already did it for
> everybody.

There was a bug in v17 around how we deal with parallel index scans.
Under the right circumstances, a parallel scan with array keys put all
of its worker processes to sleep, without subsequently waking them up
(so the scan would hang indefinitely). The underlying problem was that
v17 assumed that the backend that scheduled the next primitive index
scan would also be the one that actually performs that same primitive
scan. While it is natural and desirable for the backend that schedules
the primitive scan to be the one that actually calls _bt_search, v17
tacitly relied on that to successfully finish the scan. It's
particularly important that the leader process always be able to stop
participating as a worker immediately, whenever it needs to, and for
as long as it needs to.

Attached is v18, which adds an additional layer of indirection to fix
the bug: worker processes now schedule primitive index scans
explicitly. They store their current array keys in shared memory when
primitive scans are scheduled (during a parallel scan), allowing other
backends to pick things up where the scheduling backend left off. And
so with v18, any other backend can be the one that performs the actual
descent of the index within _bt_search.

The design is essentially the same as before, though. We'd still
prefer that it be the backend that scheduled the primitive index scan.
Other backends might well be busy finishing off their scan of
immediately preceding leaf pages.

Separately, v18 also adds many new regression tests. This greatly
improves the general situation around test coverage, which I'd put off
before now. Now there is coverage for parallel scans with array keys,
as well as coverage for the new array-specific preprocessing code.

Note: v18 doesn't have any adjustments to the costing, as originally
planned. I'll probably need to post a revised patch with improved (or
at least polished) costing in the next few days, so that others will
have the opportunity to comment before I commit the patch.

--
Peter Geoghegan

Attachment Content-Type Size
v18-0001-Enhance-nbtree-ScalarArrayOp-execution.patch application/x-patch 213.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2024-04-01 22:56:54 Re: Reports on obsolete Postgres versions
Previous Message Nathan Bossart 2024-04-01 22:15:12 Re: Popcount optimization using AVX512