Re: FSM versus GIN pending list bloat

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FSM versus GIN pending list bloat
Date: 2015-08-04 20:04:29
Message-ID: CAMkU=1wiVvwbOKFa4pT0f0hbLwz1k0T3CJLs4W67pTDNdgW0KQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 4, 2015 at 6:35 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On 4 August 2015 at 09:39, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
>> On 4 August 2015 at 06:03, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>
>>
>>> The attached proof of concept patch greatly improves the bloat for both
>>> the insert and the update cases. You need to turn on both features: adding
>>> the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3).
>>> The first of those two things could probably be adopted for real, but the
>>> second probably is not acceptable. What is the right way to do this?
>>> Could a variant of RecordFreeIndexPage bubble the free space up the map
>>> immediately rather than waiting for a vacuum? It would only have to move
>>> up until it found a page with freespace already recorded in it, which the
>>> vast majority of the time would mean observing up one level and then not
>>> writing to it, assuming the pending list pages remain well clustered.
>>>
>>
>> You make a good case for action here since insert only tables with GIN
>> indexes on text are a common use case for GIN.
>>
>> Why would vacuuming the FSM be unacceptable? With a
>> large gin_pending_list_limit it makes sense.
>>
>> If it is unacceptable, perhaps we can avoid calling it every time, or
>> simply have FreeSpaceMapVacuum() terminate more quickly on some kind of
>> 80/20 heuristic for this case.
>>
>
> Couple of questions here...
>
> * the docs say "it's desirable to have pending-list cleanup occur in the
> background", but there is no way to invoke that, except via VACUUM. I
> think we need a separate function to be able to call this as a background
> action. If we had that, we wouldn't need much else, would we?
>

I thought maybe the new bgworker framework would be a way to have a backend
signal a bgworker to do the cleanup when it notices the pending list is
getting large. But that wouldn't directly fix this issue, because the
bgworker still wouldn't recycle that space (without further changes), only
vacuum workers do that currently.

But I don't think this could be implemented as an extension, because the
signalling code has to be in core, so (not having studied the matter at
all) I don't know if it is good fit for bgworker.

> * why do we have two parameters: gin_pending_list_limit and fastupdate?
> What happens if we set gin_pending_list_limit but don't set fastupdate?
>

Fastupdate is on by default. If it were turned off, then
gin_pending_list_limit would be mostly irrelevant for those tables.
Fastupdate could have been implemented as a magic value (0 or -1) for
gin_pending_list_limit but that would break backwards compatibility (and
arguably would not be a better way of doing things, anyway).

> * how do we know how to set that parameter? Is there a way of knowing
> gin_pending_list_limit has been reached?
>

I don't think there is an easier answer to that. The trade offs are
complex and depend on things like how well cached the parts of the index
needing insertions are, how many lexemes/array elements are in an average
document, and how many documents inserted near the same time as each other
share lexemes in common. And of course what you need to optimize for,
latency or throughput, and if latency search latency or insert latency.

This and the OP seem like 9.5 open items to me.
>

I don't think so. Freeing gin_pending_list_limit from being forcibly tied
to work_mem is a good thing. Even if I don't know exactly how to set
gin_pending_list_limit, I know I don't want to be 4GB just because work_mem
was set there for some temporary reason. I'm happy to leave it at its
default and let its fine tuning be a topic for people who really care about
every microsecond of performance.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2015-08-04 20:30:58 Re: More work on SortSupport for text - strcoll() and strxfrm() caching
Previous Message Andres Freund 2015-08-04 19:55:41 Re: Raising our compiler requirements for 9.6