Re: checkpointer continuous flushing - V18

From: Andres Freund <andres(at)anarazel(dot)de>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing - V18
Date: 2016-02-21 20:22:24
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2016-02-21 10:52:45 +0100, Fabien COELHO wrote:
> * CpktSortItem:
> I think that allocating 20 bytes per buffer in shared memory is a little on
> the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 bytes
> to hold 4 values, could be one byte or even 2 bits somewhere. Also, there
> are very few tablespaces, they could be given a small number and this number
> could be used instead of the Oid, so the space requirement could be reduced
> to say 16 bytes per buffer by combining space & fork in 2 shorts and keeping
> 4 bytes alignement and also getting 8 byte alignement... If this is too
> much, I have shown that it can work with only 4 bytes per buffer, as the
> sorting is really just a performance optimisation and is not broken if some
> stuff changes between sorting & writeback, but you did not like the idea. If
> the amount of shared memory required is a significant concern, it could be
> resurrected, though.

This is less than 0.2 % of memory related to shared buffers. We have the
same amount of memory allocated in CheckpointerShmemSize(), and nobody
has complained so far. And sorry, going back to the previous approach
isn't going to fly, and I've no desire to discuss that *again*.

> ISTM that "progress" and "progress_slice" only depend on num_scanned and
> per-tablespace num_to_scan and total num_to_scan, so they are somehow
> redundant and the progress could be recomputed from the initial figures
> when needed.

They don't cause much space usage, and we access the values
frequently. So why not store them?

> If these fields are kept, I think that a comment should justify why float8
> precision is okay for the purpose. I think it is quite certainly fine in the
> worst case with 32 bits buffer_ids, but it would not be if this size is
> changed someday.

That seems pretty much unrelated to having the fields - the question of
accuracy plays a role regardless, no? Given realistic amounts of memory
the max potential "skew" seems fairly small with float8. If we ever
flush one buffer "too much" for a tablespace it's pretty much harmless.

> ISTM that nearly all of the collected data on the second sweep could be
> collected on the first sweep, so that this second sweep could be avoided
> altogether. The only missing data is the index of the first buffer in the
> array, which can be computed by considering tablespaces only, sweeping over
> buffers is not needed. That would suggest creating the heap or using a hash
> in the initial buffer sweep to keep this information. This would also
> provide a point where to number tablespaces for compressing the CkptSortItem
> struct.

Doesn't seem worth the complexity to me.

> I'm wondering about calling CheckpointWriteDelay on each round, maybe
> a minimum amount of write would make sense.

Why? There's not really much benefit of doing more work than needed. I
think we should sleep far shorter in many cases, but that's indeed a
separate issue.

> I see a binary_heap_allocate but no corresponding deallocation, this
> looks like a memory leak... or is there some magic involved?

Hm. I think we really should use a memory context for all of this - we
could after all error out somewhere in the middle...

> >I think this patch primarily needs:
> >* Benchmarking on FreeBSD/OSX to see whether we should enable the
> > mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
> > inclined to leave it off till then.
> I do not have that. As "msync" seems available on Linux, it is possible to
> force using it with a "ifdef 0" to skip sync_file_range and check whether it
> does some good there.

Unfortunately it doesn't work well on linux:
* On many OSs msync() on a mmap'ed file triggers writeback. On linux
* it only does so when MS_SYNC is specified, but then it does the
* writeback synchronously. Luckily all common linux systems have
* sync_file_range(). This is preferrable over FADV_DONTNEED because
* it doesn't flush out clean data.

I've verified beforehand, with a simple demo program, that
msync(MS_ASYNC) does something reasonable of freebsd...

> Idem for the "posix_fadvise" stuff. I can try to do
> that, but it takes time to do so, if someone can test on other OS it would
> be much better. I think that if it works it should be kept in, so it is just
> a matter of testing it.

I'm not arguing for ripping it out, what I mean is that we don't set a
nondefault value for the GUCs on platforms with just posix_fadivise


Andres Freund

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Thom Brown 2016-02-21 20:40:08 Re: Proposal: "Causal reads" mode for load balancing reads without stale data
Previous Message Heikki Linnakangas 2016-02-21 20:15:54 Re: about google summer of code 2016