My client has been seeing regular instances of the following sort of problem:
03:06:09.453 exec_simple_query, postgres.c:900
03:06:12.042 XX000: could not find pg_class tuple for index 2662 at character 13
03:06:12.042 RelationReloadIndexInfo, relcache.c:1740
03:06:12.042 INSERT INTO zzz_k(k) SELECT ...
03:06:12.045 00000: statement: ABORT
03:06:12.045 exec_simple_query, postgres.c:900
03:06:12.045 00000: duration: 0.100 ms
03:06:12.045 exec_simple_query, postgres.c:1128
03:06:12.046 00000: statement: INSERT INTO temp_807
03:06:12.046 exec_simple_query, postgres.c:900
03:06:12.046 XX000: could not find pg_class tuple for index 2662 at character 13
03:06:12.046 RelationReloadIndexInfo, relcache.c:1740
03:06:12.046 INSERT INTO temp_807
03:06:12.096 08P01: unexpected EOF on client connection
03:06:12.096 SocketBackend, postgres.c:348
03:06:12.096 XX000: could not find pg_class tuple for index 2662
03:06:12.096 RelationReloadIndexInfo, relcache.c:1740
03:06:12.121 00000: disconnection: session time: 0:06:08.537 user=ZZZ database=ZZZ_01
03:06:12.121 log_disconnections, postgres.c:4339
The above happens regularly (but not completely predictably) corresponding
with a daily cronjob that checks the catalogs for bloat and does vacuum full
and/or reindex as needed. Since some of the applications make very heavy
use of temp tables this will usually mean pg_class and pg_index get vacuum
full and reindex.
Sometimes queries will fail due to being unable to open a tables containing
file. On investigation the file will be absent in both the catalogs and the
filesystem so I don't know what table it refers to:
20:41:19.063 ERROR: could not open file "pg_tblspc/16401/PG_9.0_201008051/16413/1049145092": No such file or directory
20:41:19.063 STATEMENT: insert into r_ar__30
select aid, mid, pid, sum(wdata) as wdata, ...
20:41:19.430 ERROR: could not open file "pg_tblspc/16401/PG_9.0_201008051/16413/1049145092": No such file or directory
20:41:19.430 STATEMENT: SELECT nextval('j_id_seq')
Finallly, I have seen a several instances of failure to read data by
vacuum full itself:
03:05:45.699 00000: statement: vacuum full pg_catalog.pg_index;
03:05:45.699 exec_simple_query, postgres.c:900
03:05:46.142 XX001: could not read block 65 in file "pg_tblspc/16401/PG_9.0_201008051/16416/1049146489": read only 0 of 8192 bytes
03:05:46.142 mdread, md.c:656
03:05:46.142 vacuum full pg_catalog.pg_index;
This occurs on postgresql 9.0.4. on 32 core 512GB Dell boxes. We have
identical systems still running 8.4.8 that do not have this issue, so I'm
assuming it is related to the vacuum full work done for 9.0. Oddly, we don't
see this on the smaller hosts (8 core, 64GB, slower cpus) running 9.0.4,
so it may be timing related.
This seems possibly related to the issues in:
Bizarre buildfarm failure on baiji: can't find pg_class_oid_index
Broken HOT chains in system catalogs
As far as I can tell from the logs I have, once a session sees one of these
errors any subsequent query will hit it again until the session exits.
However, it does not seem to harm other sessions or leave any persistant
damage (crossing fingers and hoping here).
I'm ready to do any testing/investigation/instrumented builds etc that may be
helpful in resolving this.
David Gould daveg(at)sonic(dot)net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.
pgsql-hackers by date
|Next:||From: Tom Lane||Date: 2011-07-28 00:30:49|
|Subject: Ripping out pg_restore's attempts to parse SQL before sending it|
|Previous:||From: Tom Lane||Date: 2011-07-27 23:51:59|
|Subject: Re: patch: move dumpUserConfig call in dumpRoles function of pg_dumpall.c |