Re: DISCARD TEMP results in "ERROR: cache lookup failed for type 0"

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Manuel Rigger <rigger(dot)manuel(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: DISCARD TEMP results in "ERROR: cache lookup failed for type 0"
Date: 2019-07-05 22:22:26
Message-ID: 4709.1562365346@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Manuel Rigger <rigger(dot)manuel(at)gmail(dot)com> writes:
> CREATE TEMP TABLE t0(c0 INT GENERATED ALWAYS AS IDENTITY) PARTITION BY
> HASH((t0.c0));
> VACUUM FULL;
> DISCARD TEMP; -- unexpected: ERROR: cache lookup failed for type 0

Blech. The proximate cause is fairly obvious from poking around with a
debugger:

#0 errfinish (dummy=0) at elog.c:414
#1 0x00000000008a2694 in elog_finish (elevel=<value optimized out>,
fmt=<value optimized out>) at elog.c:1376
#2 0x00000000008898ae in get_typlenbyvalalign (typid=0, typlen=0x2d42118,
typbyval=0x2d42138, typalign=0x2d42158 "") at lsyscache.c:2057
#3 0x000000000088c2e0 in RelationBuildPartitionKey (relation=0x7f727c583a80)
at partcache.c:233
get_typlenbyvalalign(key->parttypid[i], ...);
#4 0x00000000008957a7 in RelationBuildDesc (
targetRelId=<value optimized out>, insertIt=true) at relcache.c:1189
#5 0x0000000000897ce6 in RelationIdGetRelation (relationId=53072)
at relcache.c:1956
#6 0x00000000004ca6b0 in relation_open (relationId=53072,
lockmode=<value optimized out>) at heapam.c:1135
#7 0x0000000000535c89 in heap_drop_with_catalog (relid=53072) at heap.c:1828
#8 0x0000000000533074 in doDeletion (object=0x2d4dfe4, depRel=0x7ffd1d23b3d8,
flags=<value optimized out>) at dependency.c:1134
#9 deleteOneObject (object=0x2d4dfe4, depRel=0x7ffd1d23b3d8,
flags=<value optimized out>) at dependency.c:1036
#10 0x000000000053310f in deleteObjectsInList (targetObjects=0x2c367b8,
depRel=0x7ffd1d23b3d8, flags=29) at dependency.c:260
#11 0x0000000000533328 in performDeletion (object=0x7ffd1d23b410,
behavior=DROP_CASCADE, flags=29) at dependency.c:341
#12 0x000000000053d283 in RemoveTempRelations () at namespace.c:4140
...
(gdb) f 3
(gdb) p *relation->rd_att
$2 = {natts = 1, tdtypeid = 53074, tdtypmod = -1, tdhasoid = false,
tdrefcount = 1, constr = 0x0, attrs = 0x7f727c6f68e0}
(gdb) p *relation->rd_att->attrs
$3 = {attrelid = 53072, attname = {
data = "........pg.dropped.1........", '\000' <repeats 35 times>},
atttypid = 0, attstattarget = 0, attlen = 4, attnum = 1, attndims = 0,
attcacheoff = 0, atttypmod = -1, attbyval = true, attstorage = 112 'p',
attalign = 105 'i', attnotnull = false, atthasdef = false,
atthasmissing = false, attidentity = 97 'a', attisdropped = true,
attislocal = true, attinhcount = 0, attcollation = 0}

So we're getting the failure because, when we come to delete the whole
temp table, its partitioning column has already been dropped, causing
RelationBuildPartitionKey (and hence any relation-opening operation)
to fail. That means that order-of-operations in dependency.c is at
the root of the issue.

The relevant pg_depend entries are (your OIDs will vary):

# select objid, pg_describe_object(classid,objid,objsubid) as obj,
pg_describe_object(refclassid,refobjid,refobjsubid) as ref, deptype from
pg_depend where ...;

objid | obj | ref | deptype
-------+--------------------+-----------------------+---------
53071 | type t0_c0_seq | sequence t0_c0_seq | i
53070 | sequence t0_c0_seq | schema pg_temp_3 | n
53074 | type t0 | table t0 | i
53073 | type t0[] | type t0 | i
53072 | table t0 | schema pg_temp_3 | n
53070 | sequence t0_c0_seq | column c0 of table t0 | i
(6 rows)

deleteObjectsInList's target list is

(gdb) p *targetObjects
$1 = {refs = 0x2d87450, extras = 0x2d2c8f8, numrefs = 7, maxrefs = 32}
(gdb) p targetObjects->refs[0]
$2 = {classId = 1247, objectId = 53071, objectSubId = 0}
(gdb) p targetObjects->refs[1]
$3 = {classId = 1259, objectId = 53070, objectSubId = 0}
(gdb) p targetObjects->refs[2]
$4 = {classId = 1259, objectId = 53072, objectSubId = 1}
(gdb) p targetObjects->refs[3]
$5 = {classId = 1247, objectId = 53073, objectSubId = 0}
(gdb) p targetObjects->refs[4]
$6 = {classId = 1247, objectId = 53074, objectSubId = 0}
(gdb) p targetObjects->refs[5]
$7 = {classId = 1259, objectId = 53072, objectSubId = 0}
(gdb) p targetObjects->refs[6]
$8 = {classId = 2615, objectId = 16982, objectSubId = 0}

So what's evidently happening is that it lands on the sequence first
and follows that to column c0, making a deletion target entry (refs[2])
for that column, and only afterwards finds the whole table and makes a
deletion target entry for the table (refs[5]). The behavior is dependent
on the order of entries in pg_depend, so you might or might not see it
(and that's why the VACUUM FULL contributes; it's changing the entry
order).

The example doesn't reproduce in v12/HEAD, but I fear that that is only a
cosmetic side-effect of commit f1ad067fc (Sort the dependent objects
before recursing in findDependentObjects()). [ pokes at it... ]
Yup, you can break it in HEAD too, if you arrange for the sequence
and the table to have different relative OIDs:

regression=# create temp table t0(c0 int not null) partition by range(c0);
CREATE TABLE
regression=# alter table t0 alter column c0 add generated always as identity;
ALTER TABLE
regression=# select objid, pg_describe_object(classid,objid,objsubid) as obj, pg_describe_object(refclassid,refobjid,refobjsubid) as ref, deptype from pg_depend where ...
objid | obj | ref | deptype
-------+--------------------+-----------------------+---------
43530 | type t0 | table t0 | i
43529 | type t0[] | type t0 | i
43528 | table t0 | schema pg_temp_3 | n
43534 | type t0_c0_seq | sequence t0_c0_seq | i
43533 | sequence t0_c0_seq | schema pg_temp_3 | n
43533 | sequence t0_c0_seq | column c0 of table t0 | i
(6 rows)

regression=# discard temp;
psql: ERROR: cache lookup failed for type 0

ISTM the real fix here probably involves having some explicit
dependencies between the table and its partitioning columns,
rather than relying on this check:

regression=# alter table t0 drop column c0;
psql: ERROR: cannot drop column named in partition key
regression=# \errverbose
psql: error: ERROR: 42P16: cannot drop column named in partition key
LOCATION: ATExecDropColumn, tablecmds.c:7041

which is evidently not even in the right place to guard this with
any security. With an explicit dependency, we could force a drop
of a partitioning column to be turned into a drop of the whole table.
Without one, well, you can break it really easily:

regression=# create domain d1 as int;
CREATE DOMAIN
regression=# create table t1 (f1 d1) partition by range(f1);
CREATE TABLE
regression=# drop domain d1 cascade;
psql: NOTICE: drop cascades to column f1 of table t1
DROP DOMAIN
regression=# \d t1
psql: ERROR: cache lookup failed for type 0

However, we can only apply a fix like that in HEAD and maybe v12;
it seems too late to be adding such dependencies in v10/v11.
Not sure what to do about those branches.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Manuel Rigger 2019-07-05 22:25:26 Re: VACUUM FULL results in deadlock
Previous Message Peter Geoghegan 2019-07-05 22:14:31 Re: BUG #15896: pg_upgrade from 10-or-earlier: TRAP: FailedAssertion(»!(metad->btm_version >= 3)«