Re: GIN improvements part 1: additional information

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN improvements part 1: additional information
Date: 2014-01-22 08:30:22
Message-ID: 52DF819E.4070507@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/22/2014 09:25 AM, Alexander Korotkov wrote:
> On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
>> wrote:
>
>> On 01/21/2014 11:35 AM, Heikki Linnakangas wrote:
>>
>>> Oh, I see what's going on. I had assumed that there cannot be duplicate
>>> insertions into the posting tree, but that's dead wrong. The fast
>>> insertion mechanism depends on a duplicate insertion to do nothing.
>>
>> Ok, this turned out to be a much bigger change than I thought...
>>
>> In principle, it's not difficult to eliminate duplicates. However, to
>> detect a duplicate, you have to check if the item is present in the
>> uncompressed items array, or in the compressed lists. That requires
>> decoding the segment where it should be.
>>
>> But if we decode the segment, what's the purpose of the uncompressed items
>> array? Its purpose was to speed up insertions, by buffering them so that we
>> don't need to decode and re-encode the page/segment on every inserted item.
>> But if we're paying the price of decoding it anyway, we might as well
>> include the new item and re-encode the segment. The uncompressed area saves
>> some effort in WAL-logging, as the record of inserting an entry into the
>> uncompressed area is much smaller than that of re-encoding part of the
>> page, but if that really is a concern, we could track more carefully which
>> parts of the page is modified, and only WAL-log the required parts. And
>> hopefully, the fast-update lists buffer inserts so that you insert many
>> items at a time to the posting tree, and the price of re-encoding is only
>> paid once.
>>
>> So, now that I think about this once more, I don't think the uncompressed
>> area in every leaf page is a good idea.
>
> I didn't get why insertion of duplicate TIDs to uncompressed area and
> eliminate them of re-encoding? It would be somewhat more work during
> updates, more work during scan, more WAL records. But is it really
> significant for real-life workloads?

Hmm, so you would merrily insert duplicate TIDs into the uncompressed
area, and weed them out when reading or recompressing the page? I had
not thought of that. Yeah, it might be a good trade-off, duplicates are
not expected to happen very often.

> I refactored the way the recompression routine works again. It is now more
>> clearly a multi-step process. First, the existing page is "disassembled"
>> into an in-memory struct, which is a linked list of all the segments.
>> In-memory each segment can be represented as an array of item pointers, or
>> in the compressed format. In the next phase, all the new items are added to
>> the segments where they belong. This naturally can lead to overly large
>> segments; in the third phase, the items are redistributed among the
>> segments, and compressed back to the compressed format. Finally, the
>> finished segments are written back to the page, or pages if it had to be
>> split.
>>
>> The same subroutines to disassemble and recompress are also used in vacuum.
>>
>> Attached is what I've got now. This is again quite heavily-changed from
>> the previous version - sorry for repeatedly rewriting this. I'll continue
>> testing and re-reviewing this tomorrow.
>
>
> Let's clarify where we are. We had quite debugged and tested version with
> hard-wired varbyte encoding. Now we're reimplementing and debugging
> segmented version of varbyte encoding in a hope that one day we can easily
> replace it with something better that meets our restrictions (but we didn't
> find it already). Is it right?

The segmentation was a sensible change on code-readability grounds
alone. Yes, it made it easier to experiment with different encodings,
and will make it easier to replace the encoding in the future, but that
wasn't the only reason for doing it. It's nice to have the
encoding/decoding stuff in ginpostinglists.c, so that the rest of the
code just passes around opaque GinPostingList structs (previously known
as PostingListSegment).

One thing I have been pondering, though: Instead of storing the posting
lists one after each other on the leaf page, it might be better to store
them as Items on the page, with normal ItemIds pointing to each. So the
page layout would be more standard, and you could use
PageAddItem/PageIndexMultiDelete to add/remove posting lists to page.
The immediate advantage of that would be that it would make it possible
to do a binary search of the segments, to quickly locate the right
segment where a tuple belongs to. That might not be significant in
practice - linearly scanning 32 items is not slow either. And it would
add some overhead, one line pointer per segment, 4 * 32 = 128 bytes per
page with the current segment size of 256 bytes. But then again, it
might be a good idea just because it would make the pages look more like
any other page, which is generally a good thing.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message KONDO Mitsumasa 2014-01-22 08:32:56 Re: Add min and max execute statement time in pg_stat_statement
Previous Message Kyotaro HORIGUCHI 2014-01-22 08:28:27 Re: Trigger information for auto_explain.