Re: FSM versus GIN pending list bloat

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FSM versus GIN pending list bloat
Date: 2015-08-04 20:50:21
Message-ID: CANP8+jJsjj8HOzVKbLB4+Bc+B1tkzymJf3O3K5BFS=zpXbTX1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4 August 2015 at 21:04, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> Couple of questions here...
>>
>> * the docs say "it's desirable to have pending-list cleanup occur in the
>> background", but there is no way to invoke that, except via VACUUM. I
>> think we need a separate function to be able to call this as a background
>> action. If we had that, we wouldn't need much else, would we?
>>
>
> I thought maybe the new bgworker framework would be a way to have a
> backend signal a bgworker to do the cleanup when it notices the pending
> list is getting large. But that wouldn't directly fix this issue, because
> the bgworker still wouldn't recycle that space (without further changes),
> only vacuum workers do that currently.
>
> But I don't think this could be implemented as an extension, because the
> signalling code has to be in core, so (not having studied the matter at
> all) I don't know if it is good fit for bgworker.
>

We need to expose 2 functions:

1. a function to perform the recycling directly (BRIN has an equivalent
function)

2. a function to see how big the pending list is for a particular index,
i.e. do we need to run function 1?

We can then build a bgworker that polls the pending list and issues a
recycle if and when needed - which is how autovac started.

> * why do we have two parameters: gin_pending_list_limit and fastupdate?
>> What happens if we set gin_pending_list_limit but don't set fastupdate?
>>
>
> Fastupdate is on by default. If it were turned off, then
> gin_pending_list_limit would be mostly irrelevant for those tables.
> Fastupdate could have been implemented as a magic value (0 or -1) for
> gin_pending_list_limit but that would break backwards compatibility (and
> arguably would not be a better way of doing things, anyway).
>
>
>> * how do we know how to set that parameter? Is there a way of knowing
>> gin_pending_list_limit has been reached?
>>
>
> I don't think there is an easier answer to that. The trade offs are
> complex and depend on things like how well cached the parts of the index
> needing insertions are, how many lexemes/array elements are in an average
> document, and how many documents inserted near the same time as each other
> share lexemes in common. And of course what you need to optimize for,
> latency or throughput, and if latency search latency or insert latency.
>

So we also need a way to count the number of times the pending list is
flushed. Perhaps record that on the metapage, so we can see how often it
has happened - and another function to view the stats on that

This and the OP seem like 9.5 open items to me.
>>
>
> I don't think so. Freeing gin_pending_list_limit from being forcibly tied
> to work_mem is a good thing. Even if I don't know exactly how to set
> gin_pending_list_limit, I know I don't want to be 4GB just because work_mem
> was set there for some temporary reason. I'm happy to leave it at its
> default and let its fine tuning be a topic for people who really care about
> every microsecond of performance.
>

OK, I accept this.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-08-04 20:52:41 Re: RFC: replace pg_stat_activity.waiting with something more descriptive
Previous Message Robert Haas 2015-08-04 20:47:21 Re: RFC: replace pg_stat_activity.waiting with something more descriptive