Re: HEAD seems to generate larger WAL regarding GIN index

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: HEAD seems to generate larger WAL regarding GIN index
Date: 2014-03-17 14:26:28
Message-ID: 53270614.5050804@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/17/2014 03:20 PM, Fujii Masao wrote:
> On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
>> On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas
>> <hlinnakangas(at)vmware(dot)com> wrote:

> I ran "pg_xlogdump | grep Gin" and checked the size of GIN-related WAL,
> and then found its max seems more than 256B. Am I missing something?
>
> What I observed is
>
> [In HEAD]
> At first, the size of GIN-related WAL is gradually increasing up to about 1400B.
> rmgr: Gin len (rec/tot): 48/ 80, tx: 1813,
> lsn: 0/020020D8, prev 0/02000070, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: F
> rmgr: Gin len (rec/tot): 56/ 88, tx: 1813,
> lsn: 0/02002440, prev 0/020023F8, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
> rmgr: Gin len (rec/tot): 64/ 96, tx: 1813,
> lsn: 0/020044D8, prev 0/02004490, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
> ...
> rmgr: Gin len (rec/tot): 1376/ 1408, tx: 1813,
> lsn: 0/02A7EE90, prev 0/02A7E910, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 2 isdata: F isleaf: T isdelete: T
> rmgr: Gin len (rec/tot): 1392/ 1424, tx: 1813,
> lsn: 0/02A7F458, prev 0/02A7F410, bkp: 0000, desc: Create posting
> tree, node: 1663/12945/16441 blkno: 4

This corresponds to the stage where the items are stored in-line in the
entry-tree. After it reaches a certain size, a posting tree is created.

> Then the size decreases to about 100B and is gradually increasing
> again up to 320B.
>
> rmgr: Gin len (rec/tot): 116/ 148, tx: 1813,
> lsn: 0/02A7F9E8, prev 0/02A7F458, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
> 1372 (compressed)
> rmgr: Gin len (rec/tot): 40/ 72, tx: 1813,
> lsn: 0/02A7FA80, prev 0/02A7F9E8, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 3 isdata: F isleaf: T isdelete: T
> ...
> rmgr: Gin len (rec/tot): 118/ 150, tx: 1813,
> lsn: 0/02A83BA0, prev 0/02A83B58, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
> 1374 (compressed)
> ...
> rmgr: Gin len (rec/tot): 288/ 320, tx: 1813,
> lsn: 0/02AEDE28, prev 0/02AEDCE8, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 14 isdata: T isleaf: T unmodified: 1280
> length: 1544 (compressed)
>
> Then the size decreases to 66B and is gradually increasing again up to 320B.
> This increase and decrease of WAL size seems to continue.

Here the new items are appended to posting tree pages. This is where the
maximum of 256 bytes I mentioned applies. 256 bytes is the max size of
one compressed posting list, the WAL record containing it includes some
other stuff too, which adds up to that 320 bytes.

> [In 9.3]
> At first, the size of GIN-related WAL is gradually increasing up to about 2700B.
>
> rmgr: Gin len (rec/tot): 52/ 84, tx: 1812,
> lsn: 0/02000430, prev 0/020003D8, bkp: 0000, desc: Insert item, node:
> 1663/12896/16441 blkno: 1 offset: 11 nitem: 1 isdata: F isleaf T
> isdelete F updateBlkno:4294967295
> rmgr: Gin len (rec/tot): 60/ 92, tx: 1812,
> lsn: 0/020004D0, prev 0/02000488, bkp: 0000, desc: Insert item, node:
> 1663/12896/16441 blkno: 1 offset: 1 nitem: 1 isdata: F isleaf T
> isdelete T updateBlkno:4294967295
> ...
> rmgr: Gin len (rec/tot): 2740/ 2772, tx: 1812,
> lsn: 0/026D1670, prev 0/026D0B98, bkp: 0000, desc: Insert item, node:
> 1663/12896/16441 blkno: 5 offset: 2 nitem: 1 isdata: F isleaf T
> isdelete T updateBlkno:4294967295
> rmgr: Gin len (rec/tot): 2714/ 2746, tx: 1812,
> lsn: 0/026D21A8, prev 0/026D2160, bkp: 0000, desc: Create posting
> tree, node: 1663/12896/16441 blkno: 6
>
> The size decreases to 66B and then is never changed.

Same mechanism on 9.3, but the insertions to the posting tree pages are
constant size.

>>> That could be optimized, but I figured we can live with it, thanks to the
>>> fastupdate feature. Fastupdate allows amortizing that cost over several
>>> insertions. But of course, you explicitly disabled that...
>>
>> Let me know if you want me to write patch addressing this issue.
>
> Yeah, I really want you to address this problem! That's definitely useful
> for every users disabling FASTUPDATE option for some reasons.

Ok, let's think about it a little bit. I think there are three fairly
simple ways to address this:

1. The GIN data leaf "recompress" record contains an offset called
"unmodifiedlength", and the data that comes after that offset.
Currently, the record is written so that unmodifiedlength points to the
end of the last compressed posting list stored on the page that was not
modified, followed by all the modified ones. The straightforward way to
cut down the WAL record size would be to be more fine-grained than that,
and for the posting lists that were modified, only store the difference
between the old and new version.

To make this approach work well for random insertions, not just
appending to the end, we would also need to make the logic in
leafRepackItems a bit smarter so that it would not re-encode all the
posting lists, after the first modified one.

2. Instead of storing the new compressed posting list in the WAL record,
store only the new item pointers added to the page. WAL replay would
then have to duplicate the work done in the main insertion code path:
find the right posting lists to insert to, decode them, add the new
items, and re-encode.

The upside of that would be that the WAL format would be very compact.
It would be quite simple to implement - you just need to call the same
functions we use in the main insertion codepath to insert the new items.
It could be more expensive, CPU-wise, to replay the records, however.

This record format would be higher-level, in the sense that we would not
store the physical copy of the compressed posting list as it was formed
originally. The same work would be done at WAL replay. As the code
stands, it will produce exactly the same result, but that's not
guaranteed if we make bugfixes to the code later, and a master and
standby are running different minor version. There's not necessarily
anything wrong with that, but it's something to keep in mind.

3. Just reduce the GinPostingListSegmentMaxSize constant from 256, to
say 128. That would halve the typical size of a WAL record that appends
to the end. However, it would not help with insertions in the middle of
a posting list, only appends to the end, and it would bloat the pages
somewhat, as you would waste more space on the posting list headers.

I'm leaning towards option 2. Alexander, what do you think?

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-03-17 14:29:11 Re: gaussian distribution pgbench
Previous Message Amit Kapila 2014-03-17 14:25:42 Re: Patch: show relation and tuple infos of a lock to acquire