Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date: 2023-10-12 21:01:27
Message-ID: 1389919.1697144487@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I wrote:
> Michael Paquier <michael(at)paquier(dot)xyz> writes:
>> FWIW, I'm having a hard time thinking about a reason that we should
>> not support SearchSysCacheExists()+lookup to be a valid pattern, even
>> if the cache is clobbered. I am pretty sure that there are other code
>> paths in the tree, not mentioned on this thread, that do exactly that
>> (haven't checked, but indexcmds.c is one coming in mind).

> Yeah. As I see it, there are several questions here.

> First, is the SearchSysCacheExists()+lookup pattern actually unsafe,
> that is is there a way to break it without any cache-clobbering debug
> options enabled? If so, there's no question we have to get rid of it.

I came back to this question today, and after more thought I'm going
to cast my vote for "let's not trust SearchSysCacheExists()+lookup".
It just seems like that's too fragile. The case reported in this
thread of it failing in parallel workers but not main should scare
anybody, and the future course of development seems far more likely
to introduce new problems than remove them.

I spent some time looking through existing SearchSysCacheExists calls,
and I could only find two sets of routines where we seem to be
depending on SearchSysCacheExists to protect a subsequent lookup
somewhere else, and there isn't any lock on the object in question.
Those are the has_foo_privilege functions discussed here, and the
foo_is_visible functions near the bottom of namespace.c. I'm not
sure why we've not heard complaints traceable to the foo_is_visible
family. Maybe nobody has tried hard to break them, or maybe they
are just less likely to be used in ways that are at risk.

It turns out to not be that hard to get rid of the problem in the
has_foo_privilege family, as per attached patches. I've not looked
at fixing the foo_is_visible family, but that probably ought to be a
separate patch anyway.

BTW, while nosing around I found what seems like a very nasty related
bug. Suppose that a catalog tuple being loaded into syscache contains
some toasted fields. CatalogCacheCreateEntry will flatten the tuple,
involving fetches from toast tables that will certainly cause
AcceptInvalidationMessages calls. What if one of those should have
invalidated this tuple? We will not notice, because it's not in
the hashtable yet. When we do add it, we will mark it not-dead,
meaning that the stale entry looks fine and could persist for a long
while.

Anyway, as to the attached patches: I split this into two parts
just to make it easier to review. 0001 deals with the functions
that use pg_class_aclcheck and related, while 0002 deals with the
ones that go through object_aclcheck. In both cases, we're
basically extending the API convention somebody added awhile ago
that foo_aclcheck and foo_aclmask should have extended versions
foo_aclcheck_ext and foo_aclmask_ext that take an "is_missing"
argument. That wasn't terribly well documented IMO, so 0001
starts out with massaging the comments to make it clearer.
I also changed some existing places in pg_attribute_aclmask_ext and
pg_attribute_aclcheck_all to conform to the is_missing convention,
rather than hard-wiring a decision that it's okay to ignore lookup
failure.

pg_attribute_aclcheck_all also had a most curious decision that
it could call pg_attribute_aclmask, which not only opens it right
back up to potential lookup failure but is quite inefficient,
requiring in three syscache lookups instead of one for each
column having a non-null attacl. It just takes a few more lines
to call aclmask() directly.

0001's changes in acl.c are straightforward, but it's worth noting
that the has_sequence_privilege functions hadn't gotten the memo
about not failing when a bogus relation OID is passed.

0002 is pretty straightforward as well, just adding an "_ext"
version of object_aclcheck/object_aclmask and using those
where appropriate.

It looks like 0001 could be back-patched as far as v14 without
too much trouble. Before that, there isn't pg_class_aclcheck_ext,
and I'm not sure it's worth the trouble to add. 0002 will only
work in HEAD and v16 because object_aclcheck wasn't there earlier.
I'm not sure whether to bother back-patching at all, given that
we've only seen this problem in test builds.

Comments?

regards, tom lane

Attachment Content-Type Size
v1-0001-fix-table-privilege-checks.patch text/x-diff 14.5 KB
v1-0002-fix-other-privilege-checks.patch text/x-diff 23.5 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2023-10-13 00:44:33 BUG #18155: Logical Apply Worker Timeout During TableSync Causes Either Stuckness or Data Loss
Previous Message Thomas Munro 2023-10-12 03:36:53 Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows