Re: hyrax vs. RelationBuildPartitionDesc

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: hyrax vs. RelationBuildPartitionDesc
Date: 2019-03-15 19:45:28
Message-ID: 10797.1552679128@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Mar 15, 2019 at 2:18 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> BTW, after closer study of 898e5e329 I have a theory as to why it
>> made things worse for CCA animals: it causes relcache entries to be
>> held open (via incremented refcounts) throughout planning, which
>> they were not before. This means that, during each of the many
>> RelationCacheInvalidate events that will occur during a planning
>> cycle, we have to rebuild those relcache entries; previously they
>> were just dropped once and that was the end of it till execution.

> Right. So it's just that we turned a bunch of rebuild = false
> situations into rebuild = true situations. I think.

More to the point, we turned *one* rebuild = false situation into
a bunch of rebuild = true situations. I haven't studied it closely,
but I think a CCA animal would probably see O(N^2) rebuild = true
invocations in a query with N partitions, since each time
expand_partitioned_rtentry opens another child table, we'll incur
an AcceptInvalidationMessages call which leads to forced rebuilds
of the previously-pinned rels. In non-CCA situations, almost always
nothing happens with the previously-examined relcache entries.

>> I remain of the opinion that we ought not have PartitionDirectories
>> pinning relcache entries ... but this particular effect isn't really
>> significant to that argument, I think.

> Cool. I would still like to hear more about why you think that. I
> agree that the pinning is not great, but copying moderately large
> partitioning structures that are only likely to get more complex in
> future releases also does not seem great, so I think it may be the
> least of evils. You, on the other hand, seem to have a rather
> visceral hatred for it.

I agree that copying data isn't great. What I don't agree is that this
solution is better. In particular, copying data out of the relcache
rather than expecting the relcache to hold still over long periods
is the way we've done things everywhere else, cf RelationGetIndexList,
RelationGetStatExtList, RelationGetIndexExpressions,
RelationGetIndexPredicate, RelationGetIndexAttrBitmap,
RelationGetExclusionInfo, GetRelationPublicationActions. I don't care
for a patch randomly deciding to do things differently on the basis of an
unsupported-by-evidence argument that it might cost too much to copy the
data. If we're going to make a push to reduce the amount of copying of
that sort that we do, it should be a separately (and carefully) designed
thing that applies to all the relcache substructures that have the issue,
not one-off hacks that haven't been reviewed thoroughly.

I especially don't care for the much-less-than-half-baked kluge of
chaining the old rd_pdcxt onto the new one and hoping that it will go away
at a suitable time. Yeah, that will work all right as long as it's not
stressed very hard, but we've found repeatedly that users will find a way
to overstress places where we're sloppy about resource management.
In this case, since the leak is potentially of session lifespan, I doubt
it's even very hard to cause unbounded growth in relcache space usage ---
you just have to do $whatever over and over. Let's see ...

regression=# create table parent (a text, b int) partition by list (a);
CREATE TABLE
regression=# create table child (a text, b int);
CREATE TABLE
regression=# do $$
regression$# begin
regression$# for i in 1..10000000 loop
regression$# alter table parent attach partition child for values in ('x');
regression$# alter table parent detach partition child;
regression$# end loop;
regression$# end $$;

After letting that run for awhile, I've got enough rd_pdcxts to send
MemoryContextStats into a nasty loop.

CacheMemoryContext: 4259904 total in 11 blocks; 1501120 free (19 chunks); 2758784 used
partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
partition descriptor: 1024 total in 1 blocks; 400 free (0 chunks); 624 used: parent
partition descriptor: 1024 total in 1 blocks; 680 free (0 chunks); 344 used: parent
... yadda yadda ...

The leak isn't terribly fast, since this is an expensive thing to be
doing, but it's definitely leaking.

> That may be based on facts with which I am
> not acquainted or it may be that we have a different view of what good
> design looks like.

I don't think solving the same problem in multiple ways constitutes
good design, barring compelling reasons for the solutions being different,
which you haven't presented. Solving the same problem in multiple ways
some of which are buggy is *definitely* not good design.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chapman Flack 2019-03-15 21:04:00 Re: [HACKERS] is there a deep unyielding reason to limit U&'' literals to ASCII?
Previous Message Chapman Flack 2019-03-15 19:26:50 Re: Facing issue in using special characters