Re: Autovacuum fails to keep visibility map up-to-date in mostly-insert-only-tables

From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alexey Bashtanov <bashtanov(at)imap(dot)cc>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Autovacuum fails to keep visibility map up-to-date in mostly-insert-only-tables
Date: 2014-10-20 01:43:29
Message-ID: 544468C1.6050101@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/19/14, 11:41 AM, Andres Freund wrote:
> On 2014-10-18 21:36:48 -0500, Jim Nasby wrote:
>> On 10/9/14, 4:19 PM, Andres Freund wrote:
>>> On 2014-10-09 18:16:46 -0300, Alvaro Herrera wrote:
>>>>> Andres Freund wrote:
>>>>>>> On 2014-10-09 18:03:00 -0300, Alvaro Herrera wrote:
>>>>>>>>> Bruce Momjian wrote:
>>>>>>>>>
>>>>>>>>>>> I agree this is a serious problem. We have discussed various options,
>>>>>>>>>>> but have not decided on anything. The TODO list has:
>>>>>>>>>>>
>>>>>>>>>>> https://wiki.postgresql.org/wiki/Todo
>>>>>>>>>>>
>>>>>>>>>>> Improve setting of visibility map bits for read-only and insert-only
>>>>>>>>>>> workloads
>>>>>>>>>>>
>>>>>>>>>>> http://www.postgresql.org/message-id/20130906001437.GA29264@momjian.us
>>>>>>>>>
>>>>>>>>> I hate to repeat myself, but I think autovacuum could be modified to run
>>>>>>>>> actions other than vacuum and analyze. In this specific case we could
>>>>>>>>> be running a table scan that checks only pages that don't have the
>>>>>>>>> all-visible bit set, and see if it can be set.
>>>>>>>
>>>>>>> Isn't that*precisely* what a plain vacuum run does?
>>>>>
>>>>> Well, it also scans for dead tuples, removes them, and needs to go
>>>>> through indexes to remove their references.
>
>>> IIRC it doesn't do most of that if that there's no need. And if it's a
>>> insert only table without rollbacks. I*do* think there's some
>>> optimizations we could make in general.
>>
>> No, it always attempts dead tuple removal.
>
> I said some steps, not all steps. Check it out:
>
> /* If any tuples need to be deleted, perform final vacuum cycle */
> /* XXX put a threshold on min number of tuples here? */
> if (vacrelstats->num_dead_tuples > 0)
> {
> /* Log cleanup info before we touch indexes */
> vacuum_log_cleanup_info(onerel, vacrelstats);
>
> /* Remove index entries */
> for (i = 0; i < nindexes; i++)
> lazy_vacuum_index(Irel[i],
> &indstats[i],
> vacrelstats);
> /* Remove tuples from heap */
> lazy_vacuum_heap(onerel, vacrelstats);
> vacrelstats->num_index_scans++;
> }
>
> There's rub here though. We unconditionally do:
> /* Do post-vacuum cleanup and statistics update for each index */
> for (i = 0; i < nindexes; i++)
> lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
>
> and that's not particularly cheap. Maybe we should make that conditional
> when there's been no lazy_vacuum_index/heap calls at all?

We could possibly pass in to lazy_cleanup_index whether we actually removed any tuples.

>> The "weird" part is that if it's not doing a freeze it will just punt
>> on a page if it can't get the cleanup lock.
>
> I don't think that's particularly wierd. Otherwise vacuum can get stuck
> behind a single very hot page - leading to much, much more bloat.
>
>> I have to believe that could seriously screw up autovacuum scheduling.
>
> Why?

I'm worried there could be some pathological cases where we'd skip a large number of pages, perhaps if a vacuum scan and a seqscan ended up running alongside each other.

Perhaps this is just paranoia, but we have no idea how bad things might be, because we don't have any logging for how many pages we skipped because we couldn't lock them.

Also, if this really is that big a deal for heap pages, how come we don't get screwed by it on Btree index pages, where we mandate that we acquire a cleanup lock?

>> Now that we have forks, I'm wondering if it would be best to come up
>> with a per-page system that could be used to determine when a table
>> needs background work to be done. The visibility map could serve a lot
>> of this purpose, but I'm not sure if it would work for getting hint
>> bits set in the background.
>
> It would. Per definition, all tuples that are 'all visible' need to be
> fully hint bitted.
>
>> I think it would also be a win if we had a way to advance relfrozenxid
>> and relminmxid. Perhaps something that simply remembered the last XID
>> that touched each page...
>
> Not sure what you're getting at here?

That ultimately, our current method for determining when and what to vacuum is rather crude, and likely results in wasted effort during scans as well as not firing autovac often enough. Keep in mind that autovac started as a user-space utility and the best it could possibly do was to keep a table of stats counters.

The visibility map obviously helps cut down on extra work during a scan, but it only goes so far in that regard.

Instead of relying on the crude methods, if we reliably tracked certain txids on a per-block basis in a fork, we could cheaply scan the fork and make an extremely informed decision on how much a vacuum would gain us, and exactly what blocks it should hit.

Let me use freezing as an example. If we had a reliable list of the lowest txid for each block of a relation that would allow us to do a freeze scan by hitting only blocks with minimum txid within our freeze range. The same could be done for multixacts.

That's just one example. We could do something similar for background hinting (oldest xmin/xmax of all non-hinted tuples), and whether it's worth trying to vacuum (I think that'd be a combination of oldest non-locking xmax and seeing how much room the FSM has listed for the page).

If we stored 3 txids for each block in a fork, we could fit information for ~680 heap blocks in each fork block. So in a database with 680G of heap data, we could fully determine every *block* (not table) we needed to vacuum by scanning just 1GB of data. That would allow for far better autovacuum scheduling than what we do today.

> I think the big missing piece lest something like Heikki's xid lsn
> ranges thing gets finished is a freeze map.

The problem with a simple freeze map is when do you actually set the bit? If you do it while the transaction that created all the tuples is still running then any attempt to use the map prior to those tuples becoming all visible is pointless. Presumably this is why pd_prune_xid stores a txid and isn't just a boolean.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2014-10-20 01:52:34 Re: get_actual_variable_range vs idx_scan/idx_tup_fetch
Previous Message Fabrízio de Royes Mello 2014-10-20 00:41:16 Re: Proposal : REINDEX SCHEMA