Re: RFC: Improve CPU cache locality of syscache searches

From: Andres Freund <andres(at)anarazel(dot)de>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: RFC: Improve CPU cache locality of syscache searches
Date: 2021-08-05 20:12:01
Message-ID: 20210805201201.cnc4hagkglxk4pos@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-08-05 12:27:49 -0400, John Naylor wrote:
> On Wed, Aug 4, 2021 at 3:44 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2021-08-04 12:39:29 -0400, John Naylor wrote:
> > > typedef struct cc_bucket
> > > {
> > > uint32 hashes[4];
> > > catctup *ct[4];
> > > dlist_head;
> > > };
> >
> > I'm not convinced that the above the right idea though. Even if the hash
> > matches, you're still going to need to fetch at least catctup->keys[0]
> from
> > a separate cacheline to be able to return the cache entry.
>
> I see your point. It doesn't make sense to inline only part of the
> information needed.

At least not for the unconditionally needed information.

> Although I'm guessing inlining just two values in the 4-key case wouldn't
> buy much.

Not so sure about that. I'd guess that two key comparisons take more cycles
than a cacheline fetch the further keys (perhaps not if we had inlined key
comparisons). I.e. I'd expect out-of-order + speculative execution to hide the
latency for fetching the second cacheline for later key values.

> > If we stuffed four values into one bucket we could potentially SIMD the
> hash
> > and Datum comparisons ;)
>
> ;-) That's an interesting future direction to consider when we support
> building with x86-64-v2. It'd be pretty easy to compare a vector of hashes
> and quickly get the array index for the key comparisons (ignoring for the
> moment how to handle the rare case of multiple identical hashes).
> However, we currently don't memcmp() the Datums and instead call an
> "eqfast" function, so I don't see how that part would work in a vector
> setting.

It definitely couldn't work unconditionally - we have to deal with text,
oidvector, comparisons after all. But we could use it for the other
types. However, looking at the syscaches, I think it'd not very often be
applicable for caches with enough columns.

I have wondered before whether we should have syscache definitions generate
code specific to each syscache definition. I do think that'd give a good bit
of performance boost. But I don't see a trivial way to get there without
notational overhead.

We could define syscaches in a header using a macro that's defined differently
in syscache.c than everywhere else. The header would declare a set of
functions for each syscache, syscache.c would define them to call an
always_inline function with the relevant constants.

Or perhaps we should move syscache definitions into the pg_*.h headers, and
generate the relevant code as part of their processing. That seems like it
could be nice from a modularity POV alone. And it could do better than the
current approach, because we could hardcode the types for columns in the
syscache definition without increasing notational overhead.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2021-08-05 20:18:05 Re: very long record lines in expanded psql output
Previous Message Bruce Momjian 2021-08-05 20:09:01 Re: Accidentally dropped constraints: bug?