Re: WAL logging problem in 9.4.3?

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL logging problem in 9.4.3?
Date: 2015-07-09 16:52:36
Message-ID: CAHGQGwHGMeqA7PbRj9e-cgA7-Sy09c+Ysyy=6Ts=vLOynpU2Hg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 7, 2015 at 12:49 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
>> On 2015-07-06 11:14:40 -0400, Tom Lane wrote:
>>> The COUNT() correctly says 11 rows, but after crash-and-recover,
>>> only the row with -1 is there. This is because the INSERT writes
>>> out an INSERT+INIT WAL record, which we happily replay, clobbering
>>> the data added later by COPY.
>
>> ISTM any WAL logged action that touches a relfilenode essentially needs
>> to disable further optimization based on the knowledge that the relation
>> is new.
>
> After a bit more thought, I think it's not so much "any WAL logged action"
> as "any unconditionally-replayed action". INSERT+INIT breaks this
> example because heap_xlog_insert will unconditionally replay the action,
> even if the page is valid and has same or newer LSN. Similarly, TRUNCATE
> is problematic because we redo it unconditionally (and in that case it's
> hard to see an alternative).
>
>> It'd not be impossible to add more state to the relcache entry for the
>> relation. Whether it's likely that we'd find all the places that'd need
>> updating that state, I'm not sure.
>
> Yeah, the sticking point is mainly being sure that the state is correctly
> tracked, both now and after future changes. We'd need to identify a state
> invariant that we could be pretty confident we'd not break.
>
> One idea I had was to allow the COPY optimization only if the heap file is
> physically zero-length at the time the COPY starts.

This seems not helpful for the case where TRUNCATE is executed
before COPY. No?

> That would still be
> able to optimize in all the cases we care about making COPY fast for.
> Rather than reverting cab9a0656c36739f, which would re-introduce a
> different performance problem, perhaps we could have COPY create a new
> relfilenode when it does this. That should be safe if the table was
> previously empty.

So, if COPY is executed multiple times at the same transaction,
only first COPY can be optimized?

After second thought, I'm thinking that we can safely optimize
COPY if no problematic WAL records like INSERT+INIT or TRUNCATE
are generated since current REDO location or the table was created
at the same transaction. That is, if INSERT or TRUNCATE is executed
after the table creation, but if CHECKPOINT happens subsequently,
we don't need to log COPY. The subsequent crash recovery will not
replay such problematic WAL records. So the example cases where
we can optimize COPY are:

BEGIN
CREATE TABLE
COPY
COPY -- subsequent COPY also can be optimized

BEGIN
CREATE TABLE
TRUNCATE
CHECKPOINT
COPY

BEGIN
CREATE TABLE
INSERT
CHECKPOINT
COPY

A crash recovery can start from previous REDO location (i.e., REDO
location of the last checkpoint record). So we might need to check
whether such problematic WAL records are generated since the previous
REDO location instead of current one.

Regards,

--
Fujii Masao

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2015-07-09 17:01:28 Re: Solaris testers wanted for strxfrm() behavior
Previous Message Petr Jelinek 2015-07-09 16:05:22 Re: creating extension including dependencies