Re: HEAD seems to generate larger WAL regarding GIN index

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: HEAD seems to generate larger WAL regarding GIN index
Date: 2014-03-17 14:54:09
Message-ID: 53270C91.3020103@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/17/2014 04:33 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
>> 2. Instead of storing the new compressed posting list in the WAL record,
>> store only the new item pointers added to the page. WAL replay would
>> then have to duplicate the work done in the main insertion code path:
>> find the right posting lists to insert to, decode them, add the new
>> items, and re-encode.
>
> That sounds fairly dangerous ... is any user-defined code involved in
> those decisions?

No.

>> This record format would be higher-level, in the sense that we would not
>> store the physical copy of the compressed posting list as it was formed
>> originally. The same work would be done at WAL replay. As the code
>> stands, it will produce exactly the same result, but that's not
>> guaranteed if we make bugfixes to the code later, and a master and
>> standby are running different minor version. There's not necessarily
>> anything wrong with that, but it's something to keep in mind.
>
> Version skew would be a hazard too, all right. I think it's important
> that WAL replay be a pretty mechanical, predictable process.

Yeah. One particular point to note is that if in one place we do the
more "high level" thing and have WAL replay re-encode the page as it
sees fit, then we can *not* rely on the page being byte-by-byte
identical in other places. Like, in vacuum, where items are deleted.

Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct
the page, instead of making a physical copy of the modified parts. And
_bt_restore_page even inserts the items physically in different order
than the normal codepath does. So for good or bad, there is some
precedence for this.

The imminent danger I see is if we change the logic on how the items are
divided into posting lists, and end up in a situation where a master
server adds an item to a page, and it just fits, but with the
compression logic the standby version has, it cannot make it fit. As an
escape hatch for that, we could have the WAL replay code try the
compression again, with a larger max. posting list size, if it doesn't
fit at first. And/or always leave something like 10 bytes of free space
on every data page to make up for small differences in the logic.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-03-17 15:17:02 Re: HEAD seems to generate larger WAL regarding GIN index
Previous Message Robert Haas 2014-03-17 14:52:59 Re: on_exit_reset fails to clear DSM-related exit actions