Re: checkpointer continuous flushing

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-09-05 06:29:21
Message-ID: alpine.DEB.2.10.1509050740290.429@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Amit,

>> Woops.... 14 GB/s and 1.2 GB/s?! Is this a *hard* disk??
>
> Yes, there is no SSD in system. I have confirmed the same. There are RAID
> spinning drives.

Ok...

I guess that there is some kind of cache to explain these great tps
figures, probably on the RAID controller. What does "lspci" says? Does
"hdparm" suggests that the write cache is enabled? It would be fine if the
I/O system has a BBU, but that could also hide some of the patch
benefits...

A tentative explanation for the similar figures with and without sorting
could be that depending on the controller cache size (may be 1GB or more)
and firmware, the I/O system reorders disk writes so that they are
basically sequential and the fact that pg sorts them beforehand has little
or no impact. This may also be help by the fact that buffers are not
really in random order to begin with as the warmup phase does an initial
"select stuff from table".

There could be other possible factors such as the file system details,
"WAFL" hacks... the tricks are endless:-)

Checking for the right explanation would involve removing the
unconditional select warmup to use only a long and random warmup, and
probably trying a much larger than cache database, and/or disabling the
write cache, reading the hardware documentation in detail... But this is
also a lot of bother and time.

Maybe the simplest approach would be to disable the write cache for the
test. Is that possible?

>> Woops, 1.6 GB/s write... same questions, "rotating plates"??
>
> One thing to notice is that if I don't remove the output file
> (output.img) the speed is much slower, see the below output. I think
> this means in our case we will get ~320 MB/s

I would say that the OS was doing something here, and 320 MB/s looks more
like an actual HDD RAID system sequential write performance.

>> If these are SSD, or if there is some SSD cache on top of the HDD, I would
>> not expect the patch to do much, because the SSD random I/O writes are
>> pretty comparable to sequential I/O writes.
>>
>> I would be curious whether flushing helps, though.
>
> Yes, me too. I think we should try to reach on consensus for exact
> scenarios and configuration where this patch('es) can give benefit or we
> want to verify if there is any regression as I have access to this m/c
> for a very-very limited time. This m/c might get formatted soon for
> some other purpose.

Yep, it would be great if you have time for a flush test before it
disappears... I think it is advisable to disable the write cache as it may
also hide the impact of flushing.

>> So whether the database fits in 8 GB shared buffer during the 2 hours of
>> the pgbench run is an open question.
>
> With this kind of configuration, I have noticed that more than 80%
> of updates are HOT updates, not much bloat, so I think it won't
> cross 8GB limit, but still I can keep it to 32GB if you have any doubts.

The problem with performance tests is that you want to test one thing, but
there are many factors that intervene and you may end up testing something
else, such as lock contention or process scheduler or whatever, rather
than what you were trying to put in evidence. So I would suggest to be on
the safe side and use the larger value.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message dinesh kumar 2015-09-05 06:35:52 Re: [PATCH] SQL function to report log message
Previous Message dinesh kumar 2015-09-05 05:15:22 Re: [PATCH] SQL function to report log message