Re: WAL logging problem in 9.4.3?

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL logging problem in 9.4.3?
Date: 2015-07-10 08:50:33
Message-ID: 559F8759.2090401@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/10/2015 02:06 AM, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
>> On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
>>> Rather than reverting cab9a0656c36739f, which would re-introduce a
>>> different performance problem, perhaps we could have COPY create a new
>>> relfilenode when it does this. That should be safe if the table was
>>> previously empty.
>
>> I'm not convinced that cab9a0656c36739f needs to survive in that
>> form. To me only allowing one COPY to benefit from the wal_level =
>> minimal optimization has a significantly higher cost than
>> cab9a0656c36739f.
>
> What evidence have you got to base that value judgement on?
>
> cab9a0656c36739f was based on an actual user complaint, so we have good
> evidence that there are people out there who care about the cost of
> truncating a table many times in one transaction.

Yeah, if we specifically made that case cheap, in response to a
complaint, it would be a regression to make it expensive again. We might
get away with it in a major version, but would hate to backpatch that.

>> My tentative guess is that the best course is to
>> a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
>> truncation replay issue.
>> b) Force new pages to be used when using the heap_sync mode in
>> COPY. That avoids the INIT danger you found. It seems rather
>> reasonable to avoid using pages that have already been the target of
>> WAL logging here in general.
>
> And what reason is there to think that this would fix all the problems?
> We know of those two, but we've not exactly looked hard for other cases.

Hmm. Perhaps that could be made to work, but it feels pretty fragile.
For example, you could have an insert trigger on the table that inserts
additional rows to the same table, and those inserts would be intermixed
with the rows inserted by COPY. You'll have to avoid that somehow.
Full-page images in general are a problem. If a checkpoint happens, and
a trigger modifies the page we're COPYing to in any way, you have the
same issue. Even reading a page can cause a full-page image of it to be
written: If you update a hint bit on the page while reading it, and
checksums are enabled, and a checkpoint happened since the page was last
updated, bang. I don't think that's a problem in this case because there
are no hint bits to be set on pages that we're COPYing to, but it's a
whole new subtle assumption.

I think we should
1. reliably and explicitly keep track of whether we've WAL-logged any
TRUNCATE, INSERT/UPDATE+INIT, or any other full-page-logging operations
on the relation, and
2. make sure we never skip WAL-logging again if we have.

Let's add a flag, rd_skip_wal_safe, to RelationData that's initially set
when a new relfilenode is created, i.e. whenever rd_createSubid or
rd_newRelfilenodeSubid is set. Whenever a TRUNCATE or a full-page image
(including INSERT/UPDATE+INIT) is WAL-logged, clear the flag. In copy.c,
only skip WAL-logging if the flag is still set. To deal with the case
that the flag gets cleared in the middle of COPY, also check the flag
whenever we're about to skip WAL-logging in heap_insert, and if it's
been cleared, ignore the HEAP_INSERT_SKIP_WAL option and WAL-log anyway.

Compared to the status quo, that disables the WAL-skipping optimization
in the scenario where you CREATE, INSERT, then COPY to a table in the
same transaction. I think that's acceptable.

(Alternatively, to handle the case that the flag gets cleared in the
middle of COPY, add another flag to RelationData indicating that a
WAL-skipping COPY is in-progress, and refrain from WAL-logging any
FPW-writing operations on the table when it's set (or any operations
whatsoever). That'd be more efficient, but it's such a rare corner case
that it hardly matters.)

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-07-10 08:59:32 Re: WAL logging problem in 9.4.3?
Previous Message Sawada Masahiko 2015-07-10 08:49:10 Re: Freeze avoidance of very large table.