Re: checkpointer continuous flushing

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-08-10 15:02:36
Message-ID: 20150810150236.GA11513@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2015-08-08 20:49:03 +0300, Heikki Linnakangas wrote:
> I ripped out the "flushing" part, keeping only the sorting. I refactored the
> logic in BufferSync() a bit. There's now a separate function,
> nextCheckpointBuffer(), that returns the next buffer ID from the sorted
> list. The tablespace-parallelization behaviour in encapsulated there,
> keeping the code in BufferSync() much simpler. See attached. Needs some
> minor cleanup and commenting still before committing, and I haven't done any
> testing besides a simple "make check".

Thought it'd be useful to review the current version as well. Some of
what I'm commenting on you'll probably already have though of under the
label of "minor cleanup".

> /*
> + * Array of buffer ids of all buffers to checkpoint.
> + */
> +static int *CheckpointBufferIds = NULL;
> +
> +/* Compare checkpoint buffers
> + */

Should be at the beginning of the file. There's a bunch more cases of that.

> +/* Compare checkpoint buffers
> + */
> +static int bufcmp(const int * pa, const int * pb)
> +{
> + BufferDesc
> + *a = GetBufferDescriptor(*pa),
> + *b = GetBufferDescriptor(*pb);
> +
> + /* tag: rnode, forkNum (different files), blockNum
> + * rnode: { spcNode (ignore: not really needed),
> + * dbNode (ignore: this is a directory), relNode }
> + * spcNode: table space oid, not that there are at least two
> + * (pg_global and pg_default).
> + */
> + /* compare relation */
> + if (a->tag.rnode.spcNode < b->tag.rnode.spcNode)
> + return -1;
> + else if (a->tag.rnode.spcNode > b->tag.rnode.spcNode)
> + return 1;
> + if (a->tag.rnode.relNode < b->tag.rnode.relNode)
> + return -1;
> + else if (a->tag.rnode.relNode > b->tag.rnode.relNode)
> + return 1;
> + /* same relation, compare fork */
> + else if (a->tag.forkNum < b->tag.forkNum)
> + return -1;
> + else if (a->tag.forkNum > b->tag.forkNum)
> + return 1;
> + /* same relation/fork, so same segmented "file", compare block number
> + * which are mapped on different segments depending on the number.
> + */
> + else if (a->tag.blockNum < b->tag.blockNum)
> + return -1;
> + else /* should not be the same block anyway... */
> + return 1;
> +}

This definitely needs comments about ignoring the normal buffer header
locking.

Why are we ignoring the database directory? I doubt it'll make a huge
difference, but grouping metadata affecting operations by directory
helps.

> +
> +static void
> +AllocateCheckpointBufferIds(void)
> +{
> + /* Safe worst case allocation, all buffers belong to the checkpoint...
> + * that is pretty unlikely.
> + */
> + CheckpointBufferIds = (int *) palloc(sizeof(int) * NBuffers);
> +}

(wrong comment style...)

Heikki, you were concerned about the size of the allocation of this,
right? I don't think it's relevant - we used to allocate an array of
that size for the backend's private buffer pin array until 9.5, so in
theory we should be safe agains that. NBuffers is limited to INT_MAX/2
in guc.ċ, which ought to be sufficient?

> + /*
> + * Lazy allocation: this function is called through the checkpointer,
> + * but also by initdb. Maybe the allocation could be moved to the callers.
> + */
> + if (CheckpointBufferIds == NULL)
> + AllocateCheckpointBufferIds();
> +
>

I don't think it's a good idea to allocate this on every round. That
just means a lot of page table entries have to be built and torn down
regularly. It's not like checkpoints only run for 1% of the time or
such.

FWIW, I still think it's a much better idea to allocate the memory once
in shared buffers. It's not like that makes us need more memory overall,
and it'll be huge page allocations if configured. I also think that
sooner rather than later we're going to need more than one process
flushing buffers, and then it'll need to be moved there.

> + /*
> + * Sort buffer ids to help find sequential writes.
> + *
> + * Note: buffers are not locked in anyway, but that does not matter,
> + * this sorting is really advisory, if some buffer changes status during
> + * this pass it will be filtered out later. The only necessary property
> + * is that marked buffers do not move elsewhere.
> + */

That reasoning makes it impossible to move the fsyncing of files into
the loop (whenever we move to a new file). That's not nice. The
formulation with "necessary property" doesn't seem very clear to me?

How about:
/*
* Note: Buffers are not locked in any way during sorting, but that's ok:
* A change in the buffer header is only relevant when it changes the
* buffer's identity. If the identity has changed it'll have been
* written out by BufferAlloc(), so there's no need for checkpointer to
* write it out anymore. The buffer might also get written out by a
* backend or bgwriter, but that's equally harmless.
*/

> Also, qsort implementation
> + * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
> + * because of these possible concurrent changes.

Hm. Is that actually the case for our qsort implementation? If the pivot
element changes its identity won't the result be pretty much random?

> +
> + if (checkpoint_sort && num_to_write > 1 && false)
> + {

&& false - Huh?

> + qsort(CheckpointBufferIds, num_to_write, sizeof(int),
> + (int(*)(const void *, const void *)) bufcmp);
> +

Ick, I'd rather move the typecasts to the comparator.

> + for (i = 1; i < num_to_write; i++)
> + {
> + bufHdr = GetBufferDescriptor(CheckpointBufferIds[i]);
> +
> + spc = bufHdr->tag.rnode.spcNode;
> + if (spc != lastspc && (bufHdr->flags & BM_CHECKPOINT_NEEDED) != 0)
> + {
> + if (allocatedSpc <= j)
> + {
> + allocatedSpc = j + 5;
> + spcStatus = (TableSpaceCheckpointStatus *)
> + repalloc(spcStatus, sizeof(TableSpaceCheckpointStatus) * allocatedSpc);
> + }
> +
> + spcStatus[j].index_end = spcStatus[j + 1].index = i;
> + j++;
> + lastspc = spc;
> + }
> + }
> + spcStatus[j].index_end = num_to_write;

This really deserves some explanation.

Regards,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2015-08-10 15:13:10 Re: Using contrib modules in check (Re: pgsql: Fix BRIN to use SnapshotAny during summarization)
Previous Message Robert Haas 2015-08-10 15:01:06 Re: Asynchronous execution on FDW