Re: FSM versus GIN pending list bloat

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FSM versus GIN pending list bloat
Date: 2015-08-10 17:16:55
Message-ID: CAMkU=1y6mSSJXkOS-mRxor4mLHOHairbmhg_o9HCoSJj1m2EHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 4, 2015 at 12:38 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Tue, Aug 4, 2015 at 1:39 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
>> On 4 August 2015 at 06:03, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>
>>
>>> The attached proof of concept patch greatly improves the bloat for both
>>> the insert and the update cases. You need to turn on both features: adding
>>> the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3).
>>> The first of those two things could probably be adopted for real, but the
>>> second probably is not acceptable. What is the right way to do this?
>>> Could a variant of RecordFreeIndexPage bubble the free space up `the map
>>> immediately rather than waiting for a vacuum? It would only have to move
>>> up until it found a page with freespace already recorded in it, which the
>>> vast majority of the time would mean observing up one level and then not
>>> writing to it, assuming the pending list pages remain well clustered.
>>>
>>
>> You make a good case for action here since insert only tables with GIN
>> indexes on text are a common use case for GIN.
>>
>> Why would vacuuming the FSM be unacceptable? With a
>> large gin_pending_list_limit it makes sense.
>>
>
> But with a smallish gin_pending_list_limit (like the default 4MB) this
> could be called a lot (multiple times a second during some spurts), and
> would read the entire fsm each time.
>
>
>>
>> If it is unacceptable, perhaps we can avoid calling it every time, or
>> simply have FreeSpaceMapVacuum() terminate more quickly on some kind of
>> 80/20 heuristic for this case.
>>
>
> Or maybe it could be passed a range of blocks which need vacuuming, so it
> concentrated on that range.
>
> But from the README file, it sounds like it is already supposed to be
> bubbling up. I'll have to see just whats going on there when I get a
> chance.
>

Before making changes to the FSM code to make immediate summarization
possible, I decided to quantify the effect of vacuuming the entire fsm.
Going up to 5 GB of index size, the time taken to vacuum the entire FSM one
time for each GIN_NDELETE_AT_ONCE was undetectable.

Based on that, I made this patch which vacuums it one time per completed
ginInsertCleanup, which should be far less than once per
GIN_NDELETE_AT_ONCE.

I would be interested in hearing what people with very large GIN indexes
think of it. It does seem like at some point the time needed must become
large, but from what I can tell that point is way beyond what someone is
likely to have for an index on an unpartitioned table.

I have a simple test case that inserts an array of 101 md5 digests into
each row. With 10_000 of these rows inserted into an already indexed
table, I get 40MB for the table and 80MB for the index unpatched. With the
patch, I get 7.3 MB for the index.

Cheers,

Jeff

Attachment Content-Type Size
gin_fast_freespace_v001.patch application/octet-stream 2.9 KB
gin_freespace2.pl application/octet-stream 1.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Verite 2015-08-10 17:21:27 Re: [patch] A \pivot command for psql
Previous Message Daniel Verite 2015-08-10 17:10:41 Re: [patch] A \pivot command for psql