Re: checkpointer continuous flushing

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-09-01 12:00:41
Message-ID: alpine.DEB.2.10.1509011002470.9763@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Amit,

>> About the disks: what kind of HDD (RAID? speed?)? HDD write cache?
>
> Speed of Reads -
> Timing cached reads: 27790 MB in 1.98 seconds = 14001.86 MB/sec
> Timing buffered disk reads: 3830 MB in 3.00 seconds = 1276.55 MB/sec

Woops.... 14 GB/s and 1.2 GB/s?! Is this a *hard* disk??

> Copy speed -
>
> dd if=/dev/zero of=/tmp/output.img bs=8k count=256k
> 262144+0 records in
> 262144+0 records out
> 2147483648 bytes (2.1 GB) copied, 1.30993 s, 1.6 GB/s

Woops, 1.6 GB/s write... same questions, "rotating plates"?? Looks more
like several SSD... Or the file is kept in memory and not committed to
disk yet? Try a "sync" afterwards??

If these are SSD, or if there is some SSD cache on top of the HDD, I would
not expect the patch to do much, because the SSD random I/O writes are
pretty comparable to sequential I/O writes.

I would be curious whether flushing helps, though.

>>> max_wal_size=5GB
>>
>> Hmmm... Maybe quite small given the average performance?
>
> We can check with larger value, but do you expect some different
> results and why?

Because checkpoints are xlog triggered (which depends on max_wal_size) or
time triggered (which depends on checkpoint_timeout). Given the large tps,
I expect that the WAL is filled very quickly hence may trigger checkpoints
every ... that is the question.

>>> checkpoint_timeout=2min
>>
>> This seems rather small. Are the checkpoints xlog or time triggered?
>
> I wanted to test by triggering more checkpoints, but I can test with
> larger checkpoint interval as wel like 5 or 10 mins. Any suggestions?

For a +2 hours test, I would suggest 10 or 15 minutes.

It would be useful to know about checkpoint stats before suggesting values
for max_wal_size and checkpoint_timeout.

> [...] The value used in your script was 0.8 for
> checkpoint_completion_target which I have not changed during tests.

Ok.

>>> parallelism - 128 clients, 128 threads [...]
> In next run, I can use it with 64 threads, lets settle on other parameters
> first for which you expect there could be a clear win with the first patch.

Ok.

>> Given the hardware, I would suggest to raise checkpoint_timeout,
>> shared_buffers and max_wal_size, [...]. I would expect that it should
>> improve performance both with and without sorting.
>
> I don't think increasing shared_buffers would have any impact, because
> 8GB is sufficient for 300 scale factor data,

It fits at the beginning, but when updates and inserts are performed
postgres adds new pages (update = delete + insert), and the deleted space
is eventually reclaimed by vacuum later on.

Now if space is available in the page it is reused, so what really happens
is not that simple...

At 8500 tps the disk space extension for tables may be up to 3 MB/s at the
beginning, and would evolve but should be at least about 0.6 MB/s (insert
in history, assuming updates are performed in page), on average.

So whether the database fits in 8 GB shared buffer during the 2 hours of
the pgbench run is an open question.

> checkpoint_completion_target is already 0.8 in my previous tests. Lets
> try with checkpoint_timeout = 10 min and max_wal_size = 15GB, do you
> have any other suggestion?

Maybe shared_buffers = 32GB to ensure that it is a "in buffer" run ?

>> It would be interesting to have informations from checkpoint logs
>> (especially how many buffers written in how long, whether checkpoints
>> are time or xlog triggered, ...).

Information still welcome.

> Hmm.. nothing like that, this was based on couple of tests done by
> me and I am open to do some more if you or anybody feels that the
> first patch (checkpoint-continuous-flush-10-a) can alone gives benefit,
> in-fact I have started these tests with the intention to see if first
> patch gives benefit, then that could be evaluated and eventually
> committed separately.

Ok.

My initial question remains: is the setup using HDDs? For SSD there should
be probably no significant benefit with sorting, although it should not
harm, and I'm not sure about flushing.

> True, let us try to find conditions/scenarios where you think it can give
> big boost, suggestions are welcome.

HDDs?

> I think we can leave this for committer to take a call or if anybody
> else has any opinion, because there is nothing wrong in what you
> have done, but I am not clear if there is a clear need for the same.

I may have an old box available with two disks, so that I can run some
tests with table spaces, but with very few cores.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shulgin, Oleksandr 2015-09-01 13:00:58 Re: On-demand running query plans using auto_explain and signals
Previous Message Anastasia Lubennikova 2015-09-01 11:16:17 Re: Should \o mean "everything?"