Quick Links

Re: Why copy_relation_data only use wal when WALarchiving is enabled

From:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Jacky Leng <lengjianquan(at)163(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Why copy_relation_data only use wal when WALarchiving is enabled
Date:	2007-10-17 13:59:43
Message-ID:	4716154F.3030001@phlo.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-patches

Simon Riggs wrote:
> On Wed, 2007-10-17 at 12:11 +0100, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
>>>> Second, suppose that no checkpoint has occured during the upper
>>>> series--authough not quite possible;
>>> That part is irrelevant. It's forced out to disk and doesn't need
>>> recovery, with or without the checkpoint.
>>>
>>> There's no hole that I can see.
>> No, Jacky is right. The same problem exists at least with CLUSTER, and I
>> think there's other commands that rely on immediate fsync as well.
>>
>> Attached is a shell script that demonstrates the problem on CVS HEAD with
>> CLUSTER. It creates two tables, T1 and T2, both with one row. Then T1 is
>> dropped, and T2 is CLUSTERed, so that the new T2 relation file happens to
>> get the same relfilenode that T1 had. Then we crash the server, forcing a
>> WAL replay. After that, T2 is empty. Oops.
>>
>> Unfortunately I don't see any easy way to fix it.
>
> So, what you are saying is that re-using relfilenodes can cause problems
> during recovery in any command that alters the relfilenode of a relation?

For what I understand, I'd say that creating a relfilenode *and* subsequently
inserting data without WAL-logging causes the problem. If the relfilenode was
recently deleted, the inserts might be effectively undone upon recovery (because
we first replay the delete), but later *not* redone (because we didn't WAL-log
the inserts).

That brings me to another idea from a fix that is less heavyweight than my
previous checkpoint-before-commit suggestion.

We could make relfilenodes globally unique if we added the xid and epoch of the
creating transaction to the filename. Those are 64 bits, so if we encode them
in base 36 (using A-Z,0-9), that'd increase the length of the filenames by 13.

regards, Florian Pflug

In response to

Re: Why copy_relation_data only use wal when WALarchiving is enabled at 2007-10-17 13:42:45 from Simon Riggs

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Heikki Linnakangas	2007-10-17 14:02:17	Re: Why copy_relation_data only use wal whenWALarchiving is enabled
Previous Message	Kevin Grittner	2007-10-17 13:57:04	Re: [HACKERS] CVS and Eclipse

Browse pgsql-patches by date

	From	Date	Subject
Next Message	Heikki Linnakangas	2007-10-17 14:02:17	Re: Why copy_relation_data only use wal whenWALarchiving is enabled
Previous Message	Simon Riggs	2007-10-17 13:42:45	Re: Why copy_relation_data only use wal when WALarchiving is enabled