Re: checkpointer continuous flushing - V18

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing - V18
Date: 2016-02-21 09:52:45
Message-ID: alpine.DEB.2.10.1602211037560.3927@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hallo Andres,

Here is a review for the second patch.

> For 0002 I've recently changed:
> * Removed the sort timing information, we've proven sufficiently that
> it doesn't take a lot of time.

I put it there initialy to demonstrate that there was no cache performance
issue when sorting on just buffer indexes. As it is always small, I agree
that it is not needed. Well, it could be still be in seconds on a very
large shared buffers setting with a very large checkpoint, but then the
checkpoint would be tremendously huge...

> * Minor comment polishing.

Patch applies and checks on Linux.

* CpktSortItem:

I think that allocating 20 bytes per buffer in shared memory is a little
on the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4
bytes to hold 4 values, could be one byte or even 2 bits somewhere. Also,
there are very few tablespaces, they could be given a small number and
this number could be used instead of the Oid, so the space requirement
could be reduced to say 16 bytes per buffer by combining space & fork in 2
shorts and keeping 4 bytes alignement and also getting 8 byte
alignement... If this is too much, I have shown that it can work with only
4 bytes per buffer, as the sorting is really just a performance
optimisation and is not broken if some stuff changes between sorting &
writeback, but you did not like the idea. If the amount of shared memory
required is a significant concern, it could be resurrected, though.

* CkptTsStatus:

As I suggested in the other mail, I think that this structure should also keep
a per tablespace WritebackContext so that coalescing is done per tablespace.

ISTM that "progress" and "progress_slice" only depend on num_scanned and
per-tablespace num_to_scan and total num_to_scan, so they are somehow
redundant and the progress could be recomputed from the initial figures
when needed.

If these fields are kept, I think that a comment should justify why float8
precision is okay for the purpose. I think it is quite certainly fine in
the worst case with 32 bits buffer_ids, but it would not be if this size
is changed someday.

* BufferSync

After a first sweep to collect buffers to write, they are sorted, and then
there those buffers are swept again to compute some per tablespace data
and organise a heap.

ISTM that nearly all of the collected data on the second sweep could be
collected on the first sweep, so that this second sweep could be avoided
altogether. The only missing data is the index of the first buffer in the
array, which can be computed by considering tablespaces only, sweeping
over buffers is not needed. That would suggest creating the heap or using
a hash in the initial buffer sweep to keep this information. This would
also provide a point where to number tablespaces for compressing the
CkptSortItem struct.

I'm wondering about calling CheckpointWriteDelay on each round, maybe
a minimum amount of write would make sense. This remark is independent of
this patch. Probably it works fine because after a sleep the checkpointer
is behind enough so that it will write a bunch of buffers before sleeping

I see a binary_heap_allocate but no corresponding deallocation, this
looks like a memory leak... or is there some magic involved?

There are some debug stuff to remove in #ifdefs.

I think that the buffer/README should be updated with explanations about
sorting in the checkpointer.

> I think this patch primarily needs:
> * Benchmarking on FreeBSD/OSX to see whether we should enable the
> mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
> inclined to leave it off till then.

I do not have that. As "msync" seems available on Linux, it is possible to
force using it with a "ifdef 0" to skip sync_file_range and check whether
it does some good there. Idem for the "posix_fadvise" stuff. I can try to
do that, but it takes time to do so, if someone can test on other OS it
would be much better. I think that if it works it should be kept in, so it
is just a matter of testing it.


In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2016-02-21 11:32:36 Re: Relaxing SSL key permission checks
Previous Message Robert Haas 2016-02-21 08:58:16 Re: Speed up Clog Access by increasing CLOG buffers