Re: Enabling Checksums

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enabling Checksums
Date: 2013-01-29 02:34:41
Message-ID: CA+TgmoZL5+-s8NJLYvuZE7_qcVc=FzpM8y5S5L1UdKZukwc=4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jan 27, 2013 at 5:28 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> There's a maximum of one FPI per page per cycle, and we need the FPI for
> any modified page in this design regardless.
>
> So, deferring the XLOG_HINT WAL record doesn't change the total number
> of FPIs emitted. The only savings would be on the trivial XLOG_HINT wal
> record itself, because we might notice that it's not necessary in the
> case where some other WAL action happened to the page.

OK, I see. So the case where this really hurts is where a page is
updated for hint bits only and then not touched again for the
remainder of the checkpoint cycle.

>> > At first glance, it seems sound as long as the WAL FPI makes it to disk
>> > before the data. But to meet that requirement, it seems like we'd need
>> > to write an FPI and then immediately flush WAL before cleaning a page,
>> > and that doesn't seem like a win. Do you (or Simon) see an opportunity
>> > here that I'm missing?
>>
>> I am not sure that isn't a win. After all, we can need to flush WAL
>> before flushing a buffer anyway, so this is just adding another case -
>
> Right, but if we get the WAL record in earlier, there is a greater
> chance that it goes out with some unrelated WAL flush, and we don't need
> to flush the WAL to clean the buffer at all. Separating WAL insertions
> from WAL flushes seems like a fairly important goal, so I'm a little
> skeptical of a proposal to narrow that gap so drastically.
>
> It's hard to analyze without a specific proposal on the table. But if
> cleaning pages requires a WAL record followed immediately by a flush, it
> seems like that would increase the number of actual WAL flushes we need
> to do by a lot.

Yeah, maybe. I think Simon had a good argument for not pursuing this
route, anyway.

>> and the payoff is that the initial access to a page, setting hint
>> bits, is quickly followed by a write operation, we avoid the need for
>> any extra WAL to cover the hint bit change. I bet that's common,
>> because if updating you'll usually need to look at the tuples on the
>> page and decide whether they are visible to your scan before, say,
>> updating one of them
>
> That's a good point, I'm just not sure how avoid that problem without a
> lot of complexity or a big cost. It seems like we want to defer the
> XLOG_HINT WAL record for a short time; but not wait so long that we need
> to clean the buffer or miss a chance to piggyback on another WAL flush.
>
>> > By the way, the approach I took was to add the heap buffer to the WAL
>> > chain of the XLOG_HEAP2_VISIBLE wal record when doing log_heap_visible.
>> > It seemed simpler to understand than trying to add a bunch of options to
>> > MarkBufferDirty.
>>
>> Unless I am mistaken, that's going to heavy penalize the case where
>> the user vacuums an insert-only table. It will emit much more WAL
>> than currently.
>
> Yes, that's true, but I think that's pretty fundamental to this
> checksums design (and of course it only applies if checksums are
> enabled). We need to make sure an FPI is written and the LSN bumped
> before we write a page.
>
> That's why I was pushing a little on various proposals to either remove
> or mitigate the impact of hint bits (load path, remove PD_ALL_VISIBLE,
> cut down on the less-important hint bits, etc.). Maybe those aren't
> viable, but that's why I spent time on them.
>
> There are some other options, but I cringe a little bit thinking about
> them. One is to simply exclude the PD_ALL_VISIBLE bit from the checksum
> calculation, so that a torn page doesn't cause a problem (though
> obviously that one bit would be vulnerable to corruption). Another is to
> use a double-write buffer, but that didn't seem to go very far. Or, we
> could abandon the whole thing and tell people to use ZFS/btrfs/NAS/SAN.

I am inclined to think that we shouldn't do any of this stuff for now.
I think it's OK if the first version of checksums is
not-that-flexible and/or not-that-performant. We can optimize for
those things later. Trying to monkey with this at the same time we're
trying to get checksums in risks introducing new diverting focus from
getting checksums done at all, and risks also introducing new data
corruption bugs. We have a reputation of long standing for getting it
right first and then getting it to perform well later, so it shouldn't
a be a total shock if we take that approach here, too. I see no
reason to think that the performance problems must be solved up front
or not at all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-01-29 02:36:54 Re: allowing privileges on untrusted languages
Previous Message Bruce Momjian 2013-01-29 02:29:35 Re: pg_ctl idempotent option