Re: Shaky coding for vacuuming partitioned relations

From: Noah Misch <noah(at)leadboat(dot)com>
To: robertmhaas(at)gmail(dot)com
Cc: pgsql-hackers(at)postgreSQL(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Shaky coding for vacuuming partitioned relations
Date: 2017-09-28 05:31:23
Message-ID: 20170928053123.GD37725@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 22, 2017 at 03:13:10PM -0400, Tom Lane wrote:
> Somebody inserted this into vacuum.c's get_rel_oids():
>
> tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
> if (!HeapTupleIsValid(tuple))
> elog(ERROR, "cache lookup failed for relation %u", relid);
>
> apparently without having read the very verbose comment two lines above,
> which points out that we're not taking any lock on the target relation.
> So, if that relation is concurrently being dropped, you're likely to
> get "cache lookup failed for relation NNNN" rather than anything more
> user-friendly.
>
> A minimum-change fix would be to replace the elog() with an ereport
> that produces the same "relation does not exist" error you'd have
> gotten from RangeVarGetRelid, had the concurrent DROP TABLE committed
> a few microseconds earlier. But that feels like its's band-aiding
> around the problem.
>
> What I'm wondering about is changing the RangeVarGetRelid call to take
> ShareUpdateExclusiveLock rather than no lock. That would protect the
> syscache lookup, and it would also make the find_all_inheritors call
> a lot more meaningful.
>
> If we're doing a VACUUM, the ShareUpdateExclusiveLock would be dropped
> as soon as we close the caller's transaction, and then we'd acquire
> the same or stronger lock inside vacuum_rel(). So that seems fine.
> If we're doing an ANALYZE, then the lock would continue to be held
> and analyze_rel would merely be acquiring it an extra time, so we'd
> actually be removing a race-condition failure scenario for ANALYZE.
> This would mean a few more cycles in lock management, but since this
> only applies to a manual VACUUM or ANALYZE that specifies a table
> name, I'm not too concerned about that.
>
> Thoughts?

This thread now has two open items, both of them pertaining to VACUUM error
messages involving partitioning. The pair is probably best treated as a
single open item.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Robert,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2017-09-28 05:54:59 Re: [POC] hash partitioning
Previous Message Michael Paquier 2017-09-28 05:23:45 Re: SendRowDescriptionMessage() is slow for queries with a lot of columns