Re: checkpointer continuous flushing

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-08-10 18:24:21
Message-ID: alpine.DEB.2.10.1508101942030.13396@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Andres,

> You can't allocate 4GB with palloc(), it has a builtin limit against
> allocating more than 1GB.

Argh, too bad, I assumed very naively that palloc was malloc in disguise.

>> [...]
> Well, then everytime the checkpointer is restarted.

Hm...

> The point is that it's done at postmaster startup, and we're pretty much
> guaranteed that the memory will availabl.e.

Ok ok, I stop resisting... I'll have a look.

Would it also fix the 1 GB palloc limit on the same go? I guess so...

>>> That reasoning makes it impossible to move the fsyncing of files into the
>>> loop (whenever we move to a new file). That's not nice.
>>
>> I do not see why.
>
> Because it means that the sorting isn't necessarily correct. I.e. we
> can't rely on it to determine whether a file has already been fsynced.

Ok, I understand your point.

Then the file would be fsynced twice: if the fsync is done properly (data
have already been flushed to disk) then it would not cost much, and doing
it sometimes twice on some file would not be a big issue. The code could
also detect such event and log a warning, which would give a hint about
how often it occurs in practice.

>>> Hm. Is that actually the case for our qsort implementation?
>>
>> I think that it is hard to write a qsort which would fail that. That would
>> mean that it would compare the same items twice, which would be inefficient.
>
> What? The same two elements aren't frequently compared pairwise with
> each other, but of course an individual element is frequently compared
> with other elements.

Sure.

> Consider what happens when the chosen pivot element changes its identity
> after already dividing half. The two partitions will not be divided in
> any meaning full way anymore. I don't see how this will results in a
> meaningful sort.

It would be partly meaningful, which is enough for performance, and does
not matter for correctness: currently buffers are not sorted at all and it
works, even if it does not work well.

>>> If the pivot element changes its identity won't the result be pretty much
>>> random?
>>
>> That would be a very unlikely event, given the short time spent in
>> qsort.
>
> Meh, we don't want to rely on "likeliness" on such things.

My main argument is that even if it occurs, and the qsort result is partly
wrong, it does not change correctness, it just mean that the actual writes
will be less in order than wished. If it occurs, one pivot separation
would be quite strange, but then others would be right, so the buffers
would be "partly sorted".

Another issue I see is that even if buffers are locked within cmp, the
status may change between two cmp... I do not think that locking all
buffers for sorting them is an option. So on the whole, I think that
locking buffers for sorting is probably not possible with the simple (and
efficient) lightweight approach used in the patch.

The good news, as I argued before, is that the order is only advisory to
help with performance, but the correctness is really that all checkpoint
buffers are written and fsync is called in the end, and does not depend on
the buffer order. That is how it currently works anyway.

If you block on this then I'll put a heavy weight approach, but that would
be a waste of memory in my opinion, hence my argumentation for the
lightweight approach.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Waldemar Brodkorb 2015-08-10 18:48:45 linux sparc compile issue
Previous Message Josh Berkus 2015-08-10 18:21:26 Re: Summary of plans to avoid the annoyance of Freezing