Re: Freezing without write I/O

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Freezing without write I/O
Date: 2013-09-19 11:42:19
Message-ID: 523AE31B.5@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 18.09.2013 16:22, Andres Freund wrote:
> On 2013-09-16 16:59:28 +0300, Heikki Linnakangas wrote:
>> Here's a rebased version of the patch, including the above-mentioned fixes.
>> Nothing else new.
>
> * We need some higherlevel description of the algorithm somewhere in the
> source. I don't think I've understood the concept from the patch alone
> without having read the thread previously.
> * why do we need to do the PageUpdateNeedsFreezing() dance in
> heap_page_prune? No xids should change during it.

Ah, but its LSN does change. Moving the LSN forward might move the LSN
from one xid-lsn range to another, requiring any XIDs on the page that
fall outside the xid range of the new xid-lsn range to be frozen.

> * Why can we do a GetOldestXmin(allDbs = false) in
> BeginXidLSNRangeSwitch()?

Looks like a bug. I think I got the arguments backwards, was supposed to
be allDbs = true and ignoreVacuum = false. I'm not sure if vacuums could
be ignored, but 'false' is the safe option.

> * Is there any concrete reasoning behind the current values for
> XID_LSN_RANGE_INTERVAL and NUM_XID_LSN_RANGES or just gut feeling?

The values in the patch are:

#define NUM_XID_LSN_RANGES 100
#define XID_LSN_RANGE_INTERVAL 1000000

The interval is probably way too small for production, but a small value
is handy for testing, as you can get through ranges faster. Something
like 100 million would probably be a more suitable value for production.
With a smaller interval, you need to freeze more often, while with a
large interval, you can't truncate the clog as early.

NUM_XID_LSN_RANGES is also quite arbitrary. I don't have a good feeling
on what the appropriate sizing for that would be. Something like 5 or 10
would probably be enough. Although at the moment, the patch doesn't
handle the situation that you run out of slots very well. You could
merge some old ranges to make room, but ATM, it just stops creating new
ones.

> * the lsn ranges file can possibly become bigger than 512bytes (the size
> we assume to be written atomically) and you write it inplace. If we
> fail halfway through writing, we seem to be able to recover by using
> the pageMatureLSN from the last checkpoint, but it seems better to
> do the fsync(),rename(),fsync() dance either way.

The lsn-range array is also written to the xlog in toto whenever it
changes, so on recovery, the ranges will be read from the WAL, and the
ranges file will be recreated at the end-of-recovery checkpoint.

> * Should we preemptively freeze tuples on a page in lazy_scan_heap if we
> already have dirtied the page? That would make future modifcations
> cheaper.

Hmm, dunno. Any future modification will also dirty the page, so you're
not actually saving any I/O by freezing it earlier. You're just choosing
to do the CPU work and WAL insertion earlier than you have to. And if
the page is not modified later, it is a waste of cycles. That said,
maybe it would indeed be better to do it in vacuum when you have a
chance to reduce latency in the critical path, even if it might be a waste.

> * lazy_scan_heap now blocks acquiring a cleanup lock on every buffer
> that contains dead tuples. Shouldn't we use some kind of cutoff xid
> there? That might block progress too heavily. Also the comment above
> it still refers to the old logic.

Hmm, you mean like skip the page even if there are some dead tuples on
it, as long as the dead tuples are not older than X xids? I guess we
could do that. Or the other thing mentioned in the comments, ie.
remember the page and come back to it later.

For now though, I think it's good enough as it is.

> * There's no way to force a full table vacuum anymore, that seems
> problematic to me.

Yeah, it probably would be good to have that. An "ignore visibility map"
option.

> * I wonder if CheckPointVarsup() doesn't need to update
> minRecoveryPoint.

Hmm, I guess it should.

> StartupVarsup() should be ok, because we should only
> read one from the future during a basebackup?

You might read one from the future during recovery, whether it's crash
or archive recovery, but as soon as you've replayed up to the min
recovery point, or the end of WAL on crash recovery, the XID ranges
array in memory should be consistent with the rest of the system,
because changes to the array are WAL logged.

> * xidlsnranges_recently[_dirtied] are not obvious on a first glance. Why
> can't we just reset dirty before the WriteXidLSNRangesFile() call?
> There's only one process doing the writeout. Just because the
> checkpointing process could be killed?

Right, if the write fails, you need to retry on the next checkpoint.

> * I think we should either not require consuming an multixactid or use a
> function that doesn't need MultiXactIdSetOldestMember(). If the
> transaction doing so lives for long it will unnecessarily prevent
> truncation of mxacts.

Agreed, that was just a quick kludge to get it working.

> * switchFinishXmin and nextSwitchXid should probably be either volatile
> or have a compiler barrier between accessing shared memory and
> checking them. The compiler very well could optimize them away and
> access shmem all the time which could lead to weird results.

Hmm, that would be a weird "optimization". Is that really legal for the
compiler to do? Better safe than sorry, I guess.

> * I wonder whether the fact that we're doing the range switches after
> acquiring an xid could be problematic if we're preventing xid
> allocation due to the checks earlier in that function?

You mean that you might get into a situation where you cannot assign a
new XID because you've reached xidStopLimit, and you cannot advance
xidStopLimit because you can't switch to a new xid-lsn range, because no
new XIDs are being assigned? Yeah, that would be nasty. The range
management stuff should really be moved out of GetNewTransaction(),
maybe do them in walwriter or bgwriter as Alvaro suggested.

> * I think heap_lock_tuple() needs to unset all-visible, otherwise we
> won't vacuum that page again which can lead to problems since we
> don't do full-table vacuums again?

It's OK if the page is never vacuumed again, the whole point of the
patch is that old XIDs can be just left lingering in the table. The next
time the page is updated, perhaps to lock a tuple again, the page will
be frozen and the old xmax will be cleared.

> So, I think that's enough for a first look. Will think about general
> issues a bit more.

Thanks!

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2013-09-19 11:56:09 Re: proposal: lob conversion functionality
Previous Message Hannu Krosing 2013-09-19 11:07:25 Re: record identical operator