Re: optimize lookups in snapshot [sub]xip arrays

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: John Naylor <jcnaylor(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: optimize lookups in snapshot [sub]xip arrays
Date: 2022-07-17 03:59:57
Message-ID: 20220717035957.GA3631071@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andres,

Thanks for taking a look.

On Fri, Jul 15, 2022 at 01:08:57PM -0700, Andres Freund wrote:
> Hm. Are there any contexts where we'd not want the potential for failing due
> to OOM?

I'm not sure about this one.

> I wonder if we additionally / alternatively could use a faster method of
> searching the array linearly, e.g. using SIMD.

I looked into using SIMD. The patch is attached, but it is only intended
for benchmarking purposes and isn't anywhere close to being worth serious
review. There may be a simpler/better way to implement the linear search,
but this seemed to work well. Overall, SIMD provided a decent improvement.
I had to increase the number of writers quite a bit in order to demonstrate
where the hash tables began winning. Here are the numbers:

writers head simd hash
256 663 632 694
512 530 618 637
768 489 544 573
1024 364 508 562
2048 185 306 485
4096 146 197 441

While it is unsurprising that the hash tables perform the best, there are a
couple of advantages to SIMD that might make that approach worth
considering. For one, there's really no overhead (i.e., you don't need to
sort the array or build a hash table), so we can avoid picking an arbitrary
threshold and just have one code path. Also, a SIMD implementation for a
linear search through an array of integers could likely be easily reused
elsewhere.

> Another thing that might be worth looking into is to sort the xip/subxip
> arrays into a binary-search optimized layout. That'd make the binary search
> faster, wouldn't require additional memory (a boolean indicating whether
> sorted somewhere, I guess), and would easily persist across copies of the
> snapshot.

I spent some time looking into this, but I haven't attempted to implement
it. IIUC the most difficult part of this is sorting the array in place to
the special layout.

>> These hash tables are regarded as ephemeral; they only live in
>> process-local memory and are never rewritten, copied, or
>> serialized.
>
> What does rewriting refer to here?

I mean that a hash table created for one snapshot will not be cleared and
reused for another.

> Not convinced that's the right idea in case of copying. I think we often end
> up copying snapshots frequently, and building & allocating the hashed xids
> separately every time seems not great.

Right. My concern with reusing the hash tables is that we'd need to
allocate much more space that would go largely unused in many cases.

>> + snapshot->xiph = NULL;
>> + snapshot->subxiph = NULL;
>
> Do we need separate hashes for these? ISTM that if we're overflowed then we
> don't need ->subxip[h], and if not, then the action for an xid being in ->xip
> and ->subxiph is the same?

Do you mean that we can combine these into one hash table? That might
work.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
xip_search_simd.patch text/x-diff 3.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2022-07-17 06:00:09 Re: Proposal to introduce a shuffle function to intarray extension
Previous Message Tom Lane 2022-07-17 03:37:26 Re: Proposal to introduce a shuffle function to intarray extension