Re: Freeze avoidance of very large table.

From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject: Re: Freeze avoidance of very large table.
Date: 2015-04-23 15:04:33
Message-ID: 55390A01.3090200@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4/23/15 8:42 AM, Robert Haas wrote:
> On Thu, Apr 23, 2015 at 4:19 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> We were talking about having an incremental backup map also. Which sounds a
>> lot like the freeze map.
>
> Yeah, possibly. I think we should try to set things up so that the
> backup map can be updated asynchronously by a background worker, so
> that we're not adding more work to the foreground path just for the
> benefit of maintenance operations. That might make the logic for
> autovacuum to use it a little bit more complex, but it seems
> manageable.

I'm not sure an actual map makes sense... for incremental backups you
need some kind of stream that tells you not only what changed but when
it changed. A simple freeze map won't work for that because the
operation of freezing itself writes data (and the same can be true for
VM). Though, if the backup utility was actually comparing live data to
an actual backup maybe this would work...

>> We only need a freeze/backup map for larger relations. So if we map 1000
>> blocks per map page, we skip having a map at all when size < 1000.
>
> Agreed. We might also want to map multiple blocks per map slot - e.g.
> one slot per 32 blocks. That would keep the map quite small even for
> very large relations, and would not compromise efficiency that much
> since reading 256kB sequentially probably takes only a little longer
> than reading 8kB.

The problem with mapping a range of pages per bit is dealing with
locking when you set the bit. Currently that's easy because we're
holding the cleanup lock on the page, but you can't do that if you have
a range of pages. Though, if each 'slot' wasn't a simple binary value we
could have a 3rd state that indicates we're in the process of marking
that slot as all visible/frozen, but you still need to consider the bit
as cleared.

Honestly though, I think concerns about the size of the map are a bit
overblown. Even if we double it's size, it's still 32,000 times smaller
than the heap is with 8k pages. I suspect if you have tables large
enough where you'll care, you'll also be using 32k pages, which means
it'd be 128,000 times smaller than the heap. I have a hard time
believing that's going to be even a faint blip on the performance radar.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-04-23 15:05:14 Re: tablespaces inside $PGDATA considered harmful
Previous Message Andres Freund 2015-04-23 15:02:19 Re: INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0