Re: WAL logging problem in 9.4.3?

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL logging problem in 9.4.3?
Date: 2015-07-10 10:38:50
Message-ID: 559FA0BA.3080808@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/10/2015 12:14 PM, Andres Freund wrote:
> On 2015-07-10 11:50:33 +0300, Heikki Linnakangas wrote:
>> On 07/10/2015 02:06 AM, Tom Lane wrote:
>>> cab9a0656c36739f was based on an actual user complaint, so we have good
>>> evidence that there are people out there who care about the cost of
>>> truncating a table many times in one transaction.
>>
>> Yeah, if we specifically made that case cheap, in response to a complaint,
>> it would be a regression to make it expensive again. We might get away with
>> it in a major version, but would hate to backpatch that.
>
> Sure. But making COPY slower would also be one. Of a longer standing
> behaviour, with massively bigger impact if somebody relies on it? I mean
> a new relfilenode includes a couple heap and storage options. Missing
> the skip wal optimization can easily double or triple COPY durations.

Completely disabling the skip-WAL optimization is not acceptable either,
IMO. It's a false dichotomy that we have to choose between those two
options. We'll have to consider the exact scenarios where we'd have to
disable the optimization vs. using a new relfilenode.

>>>> My tentative guess is that the best course is to
>>>> a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
>>>> truncation replay issue.
>>>> b) Force new pages to be used when using the heap_sync mode in
>>>> COPY. That avoids the INIT danger you found. It seems rather
>>>> reasonable to avoid using pages that have already been the target of
>>>> WAL logging here in general.
>>>
>>> And what reason is there to think that this would fix all the problems?
>>> We know of those two, but we've not exactly looked hard for other cases.
>>
>> Hmm. Perhaps that could be made to work, but it feels pretty fragile.
>
> It does. I'm not very happy about this mess.
>
>> For
>> example, you could have an insert trigger on the table that inserts
>> additional rows to the same table, and those inserts would be intermixed
>> with the rows inserted by COPY.
>
> That should be fine? As long as copy only uses new pages INSERT can use
> the same ones without problem. I think...
>
>> Full-page images in general are a problem.
>
> With the above rules I don't think it'd be. They'd contain the previous
> contents, and we'll not target them again with COPY.

Well, you really have to ensure that COPY never uses a page that any
other operation (INSERT, DELETE, UPDATE, hint-bit-update) has ever
touched and created a FPW for. The naive approach, where you just reset
the target block at beginning of COPY and use the HEAP_INSERT_SKIP_FSM
option is not enough. It's possible, but requires a lot more bookkeeping
than might seem at first glance.

>> I think we should
>> 1. reliably and explicitly keep track of whether we've WAL-logged any
>> TRUNCATE, INSERT/UPDATE+INIT, or any other full-page-logging operations on
>> the relation, and
>> 2. make sure we never skip WAL-logging again if we have.
>>
>> Let's add a flag, rd_skip_wal_safe, to RelationData that's initially set
>> when a new relfilenode is created, i.e. whenever rd_createSubid or
>> rd_newRelfilenodeSubid is set. Whenever a TRUNCATE or a full-page image
>> (including INSERT/UPDATE+INIT) is WAL-logged, clear the flag. In copy.c,
>> only skip WAL-logging if the flag is still set. To deal with the case that
>> the flag gets cleared in the middle of COPY, also check the flag whenever
>> we're about to skip WAL-logging in heap_insert, and if it's been cleared,
>> ignore the HEAP_INSERT_SKIP_WAL option and WAL-log anyway.
>
> Am I missing something or will this break the BEGIN; TRUNCATE; COPY;
> pattern we use ourselves and have suggested a number of times ?

Sorry, I was imprecise above. I meant "whenever an XLOG_SMGR_TRUNCATE
record is WAL-logged", rather than a "whenever a TRUNCATE [command] is
WAL-logged". TRUNCATE on a table that wasn't created in the same
transaction doesn't emit an XLOG_SMGR_TRUNCATE record, because it
creates a whole new relfilenode. So that's OK.

In the long-term, I'd like to refactor this whole thing so that we never
WAL-log any operations on a relation that's created in the same
transaction (when wal_level=minimal). Instead, at COMMIT, we'd fsync()
the relation, or if it's smaller than some threshold, WAL-log the
contents of the whole file at that point. That would move all that
more-difficult-than-it-seems-at-first-glance logic from COPY and
indexam's to a central location, and it would allow the same
optimization for all operations, not just COPY. But that probably isn't
feasible to backpatch.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-07-10 10:44:25 Re: WAL logging problem in 9.4.3?
Previous Message Andres Freund 2015-07-10 10:29:02 Re: WAL logging problem in 9.4.3?