Caching index AM working data across aminsert calls

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Caching index AM working data across aminsert calls
Date: 2017-02-07 23:04:40
Message-ID: 27568.1486508680@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

It's always been possible for index AMs to cache data across successive
amgettuple calls within a single SQL command: the IndexScanDesc.opaque
field is meant for precisely that. However, no comparable facility
exists for amortizing setup work across successive aminsert calls.
The attached proposed patch adds such a feature and teaches gin,
gist, and brin to use it. (The other standard index AMs keep everything
they need in the relcache, so there's little to improve there.)

The improvement I see from this is fairly modest in a normal build.
In an example similar to the gin regression test's main insert query,

insert into gin_test_tbl select array[1, 2, g] from generate_series(1, 1000000) g;

the overall insertion speed increases perhaps 10%, which is nice but
not great. gist and brin are less, maybe 5% or so. However, because
most of what happens in the saved work is catalog lookups, the savings
in a CLOBBER_CACHE_ALWAYS test build are pretty substantial: the runtime
of the gin regression script, on my workstation, goes from 40 minutes
to 4 seconds. (Yes, really.) The gist and brin test scripts are less
insert-heavy but still lose several minutes apiece. Since the core
regression tests are run multiple times (twice serially and once in
parallel) in the standard buildfarm cycle, I estimate that this patch
would cut over 1.5 hours from the cycle time for a CLOBBER_CACHE_ALWAYS
critter running on hardware similar to mine. I think that alone makes it
worth doing.

The reason this has been hard up to now is that the aminsert function
is not passed any useful place to cache per-statement data. What I
chose to do in the attached is to add suitable fields to struct IndexInfo
and pass that to aminsert. That's not widening the index AM API very
much because IndexInfo is already within the ken of ambuild. I figured
that this might be a particularly useful way to do it because it means
that aminsert also has access to the other data in the IndexInfo struct,
which might save it having to recompute some things. And it makes the
DDL info available to ambuild and aminsert more similar, which seems good
on general principles.

I also looked into the idea of using the index relcache entry's
rd_amcache field for this purpose, but that fails immediately in a
CLOBBER_CACHE_ALWAYS build, because gininsert (at least, probably the
other ones too) is not robust against its GinState disappearing from under
it mid-insert. Since rd_amcache goes away on a cache flush even if the
index is open, that doesn't work. We could maybe fix that by introducing
some way for AMs to control the lifetime of rd_amcache, but it would
result in a substantially more complex and invasive patch than this one,
and I'm unconvinced that it'd be worth the trouble.

Thoughts, objections?

regards, tom lane

Attachment Content-Type Size
cache-across-inserts-1.patch text/x-diff 28.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2017-02-07 23:40:03 Re: Press Release Draft - 2016-02-09 Cumulative Update
Previous Message Jonathan S. Katz 2017-02-07 22:52:11 Re: Press Release Draft - 2016-02-09 Cumulative Update