Quick Links

Re: hyrax vs. RelationBuildPartitionDesc

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: hyrax vs. RelationBuildPartitionDesc
Date:	2019-04-14 19:29:26
Message-ID:	27380.1555270166@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I agree that copying data isn't great. What I don't agree is that this
>> solution is better.

> I am not very convinced that it makes sense to lump something like
> RelationGetIndexAttrBitmap in with something like
> RelationGetPartitionDesc. RelationGetIndexAttrBitmap is returning a
> single Bitmapset, whereas the data structure RelationGetPartitionDesc
> is vastly larger and more complex. You can't say "well, if it's best
> to copy 32 bytes of data out of the relcache every time we need it, it
> must also be right to copy 10k or 100k of data out of the relcache
> every time we need it."

I did not say that. What I did say is that they're both correct
solutions. Copying large data structures is clearly a potential
performance problem, but that doesn't mean we can take correctness
shortcuts.

> If we want an at-least-somewhat unified solution here, I think we need
> to bite the bullet and make the planner hold a reference to the
> relcache throughout planning. (The executor already keeps it open, I
> believe.) Otherwise, how is the relcache supposed to know when it can
> throw stuff away and when it can't?

The real problem with that is that we *still* won't know whether it's
okay to throw stuff away or not. The issue with these subsidiary
data structures is exactly that we're trying to reduce the lock strength
required for changing one of them, and as soon as you make that lock
strength less than AEL, you have the problem that that value may need
to change in relcache entries despite them being pinned. The code I'm
complaining about is trying to devise a shortcut solution to exactly
that situation ... and it's not a good shortcut.

> The only alternative seems to be to have each subsystem hold its own
> reference count, as I did with the PartitionDirectory stuff, which is
> not something we'd want to scale out.

Well, we clearly don't want to devise a separate solution for every
subsystem, but it doesn't seem out of the question to me that we could
build a "reference counted cache sub-structure" mechanism and apply it
to each of these relcache fields. Maybe it could be unified with the
existing code in the typcache that does a similar thing. Sure, this'd
be a fair amount of work, but we've done it before. Syscache entries
didn't use to have reference counts, for example.

BTW, the problem I have with the PartitionDirectory stuff is exactly
that it *isn't* a reference-counted solution. If it were, we'd not
have this problem of not knowing when we could free rd_pdcxt.

>> I especially don't care for the much-less-than-half-baked kluge of
>> chaining the old rd_pdcxt onto the new one and hoping that it will go away
>> at a suitable time.

> It will go away at a suitable time, but maybe not at the soonest
> suitable time. It wouldn't be hard to improve that, though. The
> easiest thing to do, I think, would be to have an rd_oldpdcxt and
> stuff the old context there. If there already is one there, parent
> the new one under it. When RelationDecrementReferenceCount reduces
> the reference count to zero, destroy anything found in rd_oldpdcxt.

Meh. While it seems likely that that would mask most practical problems,
it still seems like covering up a wart with a dirty bandage. In
particular, that fix doesn't fix anything unless relcache reference counts
go to zero pretty quickly --- which I'll just note is directly contrary
to your enthusiasm for holding relcache pins longer.

I think that what we ought to do for v12 is have PartitionDirectory
copy the data, and then in v13 work on creating real reference-count
infrastructure that would allow eliminating the copy steps with full
safety. The $64 question is whether that really would cause unacceptable
performance problems. To look into that, I made the attached WIP patches.
(These are functionally complete, but I didn't bother for instance with
removing the hunk that 898e5e329 added to relcache.c, and the comments
need work, etc.) The first one just changes the PartitionDirectory
code to do that, and then the second one micro-optimizes
partition_bounds_copy() to make it somewhat less expensive, mostly by
collapsing lots of small palloc's into one big one.

What I get for test cases like [1] is

single-partition SELECT, hash partitioning:

N tps, HEAD tps, patch
2 11426.243754 11448.615193
8 11254.833267 11374.278861
32 11288.329114 11371.942425
128 11222.329256 11185.845258
512 11001.177137 10572.917288
1024 10612.456470 9834.172965
4096 8819.110195 7021.864625
8192 7372.611355 5276.130161

single-partition SELECT, range partitioning:

N tps, HEAD tps, patch
2 11037.855338 11153.595860
8 11085.218022 11019.132341
32 10994.348207 10935.719951
128 10884.417324 10532.685237
512 10635.583411 9578.108915
1024 10407.286414 8689.585136
4096 8361.463829 5139.084405
8192 7075.880701 3442.542768

Now certainly these numbers suggest that avoiding the copy could be worth
our trouble, but these results are still several orders of magnitude
better than where we were two weeks ago [2]. Plus, this is an extreme
case that's not really representative of real-world usage, since the test
tables have neither indexes nor any data. In practical situations the
baseline would be lower and the dropoff less bad. So I don't feel bad
about shipping v12 with these sorts of numbers and hoping for more
improvement later.

regards, tom lane

[1] https://www.postgresql.org/message-id/3529.1554051867%40sss.pgh.pa.us

[2] https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A512BAC60%40g01jpexmbkw24

Attachment	Content-Type	Size
copy-partdescs.patch	text/x-diff	3.1 KB
micro-optimize-partition_bounds_copy.patch	text/x-diff	3.5 KB

In response to

Re: hyrax vs. RelationBuildPartitionDesc at 2019-03-15 21:41:47 from Robert Haas

Responses

Re: hyrax vs. RelationBuildPartitionDesc at 2019-04-16 22:55:31 from Robert Haas
Re: hyrax vs. RelationBuildPartitionDesc at 2019-04-17 09:58:46 from Amit Langote

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jeff Janes	2019-04-14 20:47:09	Re: Should the docs have a warning about pg_stat_reset()?
Previous Message	Floris Van Nee	2019-04-14 19:19:41	partitioning performance tests after recent patches