|From:||Heikki Linnakangas <hlinnaka(at)iki(dot)fi>|
|To:||Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>|
|Cc:||Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>|
|Subject:||Re: WAL logging problem in 9.4.3?|
|Views:||Raw Message | Whole Thread | Download mbox|
On 07/22/2015 11:18 AM, Simon Riggs wrote:
> On 10 July 2015 at 00:06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Andres Freund <andres(at)anarazel(dot)de> writes:
>>> On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
>>>> Rather than reverting cab9a0656c36739f, which would re-introduce a
>>>> different performance problem, perhaps we could have COPY create a new
>>>> relfilenode when it does this. That should be safe if the table was
>>>> previously empty.
>>> I'm not convinced that cab9a0656c36739f needs to survive in that
>>> form. To me only allowing one COPY to benefit from the wal_level =
>>> minimal optimization has a significantly higher cost than
>> What evidence have you got to base that value judgement on?
>> cab9a0656c36739f was based on an actual user complaint, so we have good
>> evidence that there are people out there who care about the cost of
>> truncating a table many times in one transaction. On the other hand,
>> I know of no evidence that anyone's depending on multiple sequential
>> COPYs, nor intermixed COPY and INSERT, to be fast. The original argument
>> for having this COPY optimization at all was to make restoring pg_dump
>> scripts in a single transaction fast; and that use-case doesn't care
>> about anything but a single COPY into a virgin table.
> We have to backpatch this fix, so it must be both simple and effective.
> Heikki's suggestions may be best, maybe not, but they don't seem
> Tom's suggestion looks good. So does Andres' suggestion. I have coded both.
Thanks. For comparison, I wrote a patch to implement what I had in mind.
When a WAL-skipping COPY begins, we add an entry for that relation in a
"pending-fsyncs" hash table. Whenever we perform any action on a heap
that would normally be WAL-logged, we check if the relation is in the
hash table, and skip WAL-logging if so.
That was a simplified explanation. In reality, when WAL-skipping COPY
begins, we also memorize the current size of the relation. Any actions
on blocks greater than the old size are not WAL-logged, and any actions
on smaller-numbered blocks are. This ensures that if you did any INSERTs
on the table before the COPY, any new actions on the blocks that were
already WAL-logged by the INSERT are also WAL-logged. And likewise if
you perform any INSERTs after (or during, by trigger) the COPY, and they
modify the new pages, those actions are not WAL-logged. So starting a
WAL-skipping COPY splits the relation into two parts: the first part
that is WAL-logged as usual, and the later part that is not WAL-logged.
(there is one loose end marked with XXX in the patch on this, when one
of the pages involved in a cold UPDATE is before the watermark and the
other is after)
The actual fsync() has been moved to the end of transaction, as we are
now skipping WAL-logging of any actions after the COPY as well.
And truncations complicate things further. If we emit a truncation WAL
record in the transaction, we also make an entry in the hash table to
record that. All operations on a relation that has been truncated must
be WAL-logged as usual, because replaying the truncate record will
destroy all data even if we fsync later. But we still optimize for
"BEGIN; CREATE; COPY; TRUNCATE; COPY;" style patterns, because if we
truncate a relation that has already been marked for fsync-at-COMMIT, we
don't need to WAL-log the truncation either.
This is more invasive than I'd like to backpatch, but I think it's the
simplest approach that works, and doesn't disable any of the important
optimizations we have.
>> And what reason is there to think that this would fix all the problems?
> I don't think either suggested fix could be claimed to be a great solution,
> since there is little principle here, only heuristic. Heikki's solution
> would be the only safe way, but is not backpatchable.
I can't get too excited about a half-fix that leaves you with data
corruption in some scenarios.
I wrote a little test script to test all these different scenarios
(attached). Both of your patches fail with the script.
|Next Message||Kevin Grittner||2015-07-22 16:28:28||Re: brin index vacuum versus transaction snapshots|
|Previous Message||Egor Rogov||2015-07-22 16:17:45||Revoke [admin option for] role|