Re: checkpointer continuous flushing

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-09-06 14:05:01
Message-ID: alpine.DEB.2.10.1509061516560.3687@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello Andres,

> Here's a bunch of comments on this (hopefully the latest?)

Who knows?! :-)

> version of the patch:
> * I'm not sure I like the FileWrite & FlushBuffer API changes. Do you
> forsee other callsites needing similar logic?

I foresee that the bgwriter should also do something more sensible than
generating random I/Os over HDDs, and this is also true for workers... But
this is for another time, maybe.

> Wouldn't it be just as easy to put this logic into the checkpointing
> code?

Not sure it would simplify anything, because the checkpointer currently
knows about buffers but flushing is about files, which are hidden from

Doing it with this API change means that the code does not have to compute
twice in which file is a buffer: The buffer/file boundary has to be broken
somewhere anyway so that flushing can be done when needed, and the
solution I took seems the simplest way to do it, without having to make
the checkpointer too much file concious.

> * We don't do one-line ifs;

Ok, I'll return them.

> function parameters are always in the same line as the function name

Ok, I'll try to improve.

> * Wouldn't a binary heap over the tablespaces + progress be nicer?

I'm not sure where it would fit exactly.

Anyway, I think it would complicate the code significantly (compared to
the straightforward array), so I would not do anything like that without a
strong intensive, such as an actual failing case.

Moreover such a data structure would probably require some kind of pointer
(probably 8 bytes added per node, maybe more), and the amount of memory is
already a concern, at least to me, and moreover it has to reside in shared
memory which does not simplify allocation of tree data structures.

> If you make the sorting criterion include the tablespace id you wouldn't
> need the lookahead loop in NextBufferToWrite().

Yep, I thought of it. It would mean 4 more bytes per buffer, and bsearch
to find the boundaries, so significantly less simple code. I think that
the current approach is ok as the number of tablespace should be small.

It may be improved upon later if there is a motivation to do so.

> Isn't the current approach O(NBuffers^2) in the worst case?

ISTM that the overall lookahead complexity is Nbuffers * Ntablespace:
buffers are scanned once for each tablespace. I assume that the number of
tablespace is kept low, and having a simpler code which use less memory
seems a good idea.

ISTM that using a tablespace in the sorting would reduce the complexity
to ln(NBuffers) * Ntablespace for finding the boundaries, and then
Nbuffers * (Ntablespace/Ntablespace) = NBuffers for scanning, at the
expense of more memory and code complexity.

So this is a voluntary design decision.


In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2015-09-06 17:05:27 Re: checkpointer continuous flushing
Previous Message Julien Rouhaud 2015-09-06 13:49:54 Re: Allow a per-tablespace effective_io_concurrency setting