Re: checkpointer continuous flushing

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-09-05 03:14:34
Message-ID: CAA4eK1+uDKCEzeOLzT5Sok3ukMjzy-ov-=QnZaOY0o3bCm9=Yw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 1, 2015 at 5:30 PM, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> wrote:

>
> Hello Amit,
>
> About the disks: what kind of HDD (RAID? speed?)? HDD write cache?
>>>
>>
>> Speed of Reads -
>> Timing cached reads: 27790 MB in 1.98 seconds = 14001.86 MB/sec
>> Timing buffered disk reads: 3830 MB in 3.00 seconds = 1276.55 MB/sec
>>
>
> Woops.... 14 GB/s and 1.2 GB/s?! Is this a *hard* disk??

Yes, there is no SSD in system. I have confirmed the same. There are RAID
spinning drives.

>
>
> Copy speed -
>>
>> dd if=/dev/zero of=/tmp/output.img bs=8k count=256k
>> 262144+0 records in
>> 262144+0 records out
>> 2147483648 bytes (2.1 GB) copied, 1.30993 s, 1.6 GB/s
>>
>
> Woops, 1.6 GB/s write... same questions, "rotating plates"??

One thing to notice is that if I don't remove the output file (output.img)
the
speed is much slower, see the below output. I think this means in our case
we will get ~320 MB/s

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.28086 s, 1.7 GB/s

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 6.72301 s, 319 MB/s

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 6.73963 s, 319 MB/s

If I remove the file each time:

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.2855 s, 1.7 GB/s

rm /data/akapila/output.img

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.27725 s, 1.7 GB/s

rm /data/akapila/output.img

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.27417 s, 1.7 GB/s

rm /data/akapila/output.img

> Looks more like several SSD... Or the file is kept in memory and not
> committed to disk yet? Try a "sync" afterwards??
>
>

> If these are SSD, or if there is some SSD cache on top of the HDD, I would
> not expect the patch to do much, because the SSD random I/O writes are
> pretty comparable to sequential I/O writes.
>
> I would be curious whether flushing helps, though.
>
>
Yes, me too. I think we should try to reach on consensus for exact scenarios
and configuration where this patch('es) can give benefit or we want to
verify
if there is any regression as I have access to this m/c for a very-very
limited
time. This m/c might get formatted soon for some other purpose.

> max_wal_size=5GB
>>>>
>>>
>>> Hmmm... Maybe quite small given the average performance?
>>>
>>
>> We can check with larger value, but do you expect some different
>> results and why?
>>
>
> Because checkpoints are xlog triggered (which depends on max_wal_size) or
> time triggered (which depends on checkpoint_timeout). Given the large tps,
> I expect that the WAL is filled very quickly hence may trigger checkpoints
> every ... that is the question.
>
> checkpoint_timeout=2min
>>>>
>>>
>>> This seems rather small. Are the checkpoints xlog or time triggered?
>>>
>>
>> I wanted to test by triggering more checkpoints, but I can test with
>> larger checkpoint interval as wel like 5 or 10 mins. Any suggestions?
>>
>
> For a +2 hours test, I would suggest 10 or 15 minutes.
>
>
Okay, lets keep it as 10 minutes.

I don't think increasing shared_buffers would have any impact, because
>> 8GB is sufficient for 300 scale factor data,
>>
>
> It fits at the beginning, but when updates and inserts are performed
> postgres adds new pages (update = delete + insert), and the deleted space
> is eventually reclaimed by vacuum later on.
>
> Now if space is available in the page it is reused, so what really happens
> is not that simple...
>
> At 8500 tps the disk space extension for tables may be up to 3 MB/s at the
> beginning, and would evolve but should be at least about 0.6 MB/s (insert
> in history, assuming updates are performed in page), on average.
>
> So whether the database fits in 8 GB shared buffer during the 2 hours of
> the pgbench run is an open question.
>
>
With this kind of configuration, I have noticed that more than 80%
of updates are HOT updates, not much bloat, so I think it won't
cross 8GB limit, but still I can keep it to 32GB if you have any doubts.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message dinesh kumar 2015-09-05 05:15:22 Re: [PATCH] SQL function to report log message
Previous Message Tom Lane 2015-09-05 02:54:51 Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members