Re: can't start postgresql

From: Igor Shevchenko <igor(at)carcass(dot)ath(dot)cx>
To: pgsql-general(at)postgresql(dot)org, scotty(at)linuxtime(dot)it, Matthias(dot)Pitzl(at)izb(dot)de
Subject: Re: can't start postgresql
Date: 2006-11-14 19:26:02
Message-ID: 200611142126.02621.igor@carcass.ath.cx
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tuesday 14 November 2006 17:34, Alvaro Herrera wrote:
> Matthias(dot)Pitzl(at)izb(dot)de wrote:
> > He's just trying to do what you told him.
> > But the use of initdb won't correct the problem Igor got. There's
> > something
> >
> > wrong with his database so that he can't start the postmaster process:
> > > PANIC: failed to re-find parent key in "23724"
> >
> > Unfortunately i don't have any idea how to solve this issue :(
>
> I'd think starting a standalone backend and issuing a "reindex database"
> should be enough to get him started. Now, the problem is figuring _how_
> the index got in that state; or even _what_ index is the problematic
> one. (I think it would be possible to find out by setting the
> "log_error_verbosity" parameter to "verbose").

The problem is, it won't start even in the standalone mode. It prints the same
error message and dies. I was trying to start it connected to my main db,
template0, template1 and postgres, no change. The index corruption (assuming
this IS an index corruption) must've happened with system table(s). Is there
any way to diagose that ? Maybe reindex a specific table without starting
postmaster, or remove certain index files physically (e.g. for pg_class,
etc) ?

> If it's a hardware problem, one would think it deserves some diagnosis.
> It could be the btree bug Tom fixed last week, but I'm not sure if this
> is really a consequence of it.

Can't rule out an HW problem. I'd think this IS an HW problem, but we've
started using an updated version of our software on this server a week ago,
and it creates/drops alot of tables, temporary and not (kind of in-house
table partitioning), so this could be an (exceptionally rare) pgsql bug with
pg_class / pg_index / etc handling. A side question: is it safe to
create/drop tables from pl/pgsql functions, called directly or via a
trigger ?

Our previous version of the software was using a light version of what we're
trying to do now, and there was / maybe still a problem:

We're processing data in chunks, and each chunk is completely separate from
others, so using a special table per chunk seems to be a natural thing to do.
We keep these in a separate tablespace "w". There's one non-unique
(integer,varchar) index per each such table; both fields are non-null. Both
table and index are created via a pl/pgsql function. An index is created
after the data import, during the same transaction:

a) begin
b) select create_w_table(jobid,false) -- false means don't create index
c) insert into w.tableX select *....
d) select create_w_table(jobid,true) -- true means create index
e) commit

Both table and index are dropped at the same time, using another pl/pgsql
function.

create_w_table is rarely called from yet another pl/pgsql function.

Everything's working fine, except that sometimes an index can, um, "go off the
rails":

* not listed by "psql"'s \d w.tableX
* queries are not using it
* it's still listed in the pg_class (seen by "select * from pg_class where
relname='tblX_idx'")
* a corresponding "pg_index" row references a non-existing pg_class entry via
pg_index.indrelid

I'm seeing this several times a week, for over 5 months now. Here's my fix
procedure:

* fetch a list of probematic indexes:

select pc.relname,pc.oid,pc_idx.oid from pg_class pc left join pg_class pc_idx
on (pc_idx.relname=pc.relname || '_idx') where pc.relname ~ '^table\\d+$' and
(pc_idx.oid is NULL OR not exists (select 1 from pg_index where
pg_index.indexrelid=pc_idx.oid and indrelid=pc.oid))

* fix each with:

update pg_index set indrelid=${right one from pg_class} where
indexrelid=${index oid from pg_class};
update pg_class set relhasindex='t' where oid=${table oid}; -- AFAICR this is
redundant, as relhasindex is still true

now I can drop this index (usually it's out of sync), vacuum the table and
recreate index:

drop index w.tableX_idx;
vacuum full verbose w.tableX;
create index tableX_idx on w.tableX(...);
analyze w.tableX;

I've seen this on pgsql versions from v8.1.2 till v8.1.4, can't confirm with
v8.1.5 yet. Autovacuum's on; daily "vacuum verbose analyze" and "reindex" are
ran over pg_class, pg_index, a bunch of our own small tables, and recently
over pg_depend, pg_type and pg_statistic.

--
Best Regards,
Igor Shevchenko

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jim C. Nasby 2006-11-14 19:34:50 Re: database dump then restore on another system?
Previous Message Jim C. Nasby 2006-11-14 19:24:15 Re: PostgreSQL Mhash functions