Re: Enable data checksums by default

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Christoph Berg <myon(at)debian(dot)org>, Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Enable data checksums by default
Date: 2019-03-30 19:25:43
Message-ID: 20190330192543.GH4719@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 29, 2019 at 08:35:26PM +0100, Christoph Berg wrote:
>Re: Bernd Helmle 2019-03-29 <3586bb9345a59bfc8d13a50a7c729be1ee6759fd(dot)camel(at)oopsware(dot)de>
>> Am Freitag, den 29.03.2019, 23:10 +0900 schrieb Michael Paquier:
>> >
>> > I can't really believe that many people set up shared_buffers at
>> > 128kB
>> > which would cause such a large number of page evictions, but I can
>> > believe that many users have shared_buffers set to its default value
>> > and that we are going to get complains about "performance drop after
>> > upgrade to v12" if we switch data checksums to on by default.
>>
>> Yeah, i think Christoph's benchmark is based on this thinking. I assume
>> this very unrealistic scenery should emulate the worst case (many
>> buffer_reads, high checksum calculation load).
>
>It's not unrealistic to have large seqscans that are all buffer
>misses, the table just has to be big enough. The idea in my benchmark
>was that if I make shared buffers really small, and the table still
>fits in to RAM, I should be seeing only buffer misses, but without any
>delay for actually reading from disk.
>
>Christoph
>

FWIW I think it's a mistake to focus solely on CPU utilization, which
all the benchmarks performed on this thread do because they look at tps
of in-memory read-only workloads. Checksums have other costs too, not
just the additional CPU time. Most importanly they require wal_log_hints
to be set (which people may or may not want anyway).

I've done a simple benchmark, that does read-only (-S) and read-write
(-N) pgbench runs with different scales, but also measures duration of
the pgbench init and amount of WAL produced during the tests.

On a small machine (i5, 8GB RAM, SSD RAID) the results are these:

scale config | init tps wal
=========================|==================================
ro 10 no-hints | 2 117038 130
hints | 2 116378 146
checksums | 2 115619 147
-------------------|----------------------------------
200 no-hints | 32 88340 2407
hints | 37 86154 2628
checksums | 36 83336 2624
-------------------|----------------------------------
2000 no-hints | 365 38680 1967
hints | 423 38670 2123
checksums | 504 37510 2046
-------------------------|----------------------------------
rw 10 no-hints | 2 19691 437
hints | 2 19712 437
checksums | 2 19654 437
-------------------|----------------------------------
200 no-hints | 32 15839 2745
hints | 37 15735 2783
checksums | 36 15646 2775
-------------------|----------------------------------
2000 no-hints | 365 5371 3721
hints | 423 5270 3671
checksums | 504 5094 3574

The no-hints config is default (wal_log_hints=off, data_checksums=off),
hints sets wal_log_hints=on and checksums enables data checksums. All
the configs were somewhat tuned (1GB shared buffers, max_wal_size high
enough not to hit checkpoints very often, etc.).

I've also done the tests on the a larger machine (2x E5-2620v4, 32GB of
RAM, NVMe SSD), and the general pattern is about the same - while the
tps and amount of WAL (not covering the init) does not change, the time
for initialization increases significantly (by 20-40%).

This effect is even clearer when using slower storage (SATA-based RAID).
The results then look like this:

scale config | init tps wal
=========================|==================================
ro 100 no-hints | 49 229459 122
hints | 101 167983 190
checksums | 103 156307 190
-------------------|----------------------------------
1000 no-hints | 580 152167 109
hints | 1047 122814 142
checksums | 1080 118586 141
-------------------|----------------------------------
6000 no-hints | 4035 508 1
hints | 11193 502 1
checksums | 11376 506 1
-------------------------|----------------------------------
rw 100 no-hints | 49 279 192
hints | 101 275 190
checksums | 103 275 190
-------------------|----------------------------------
1000 no-hints | 580 237 210
hints | 1047 225 201
checksums | 1080 224 200
-------------------|----------------------------------
6000 no-hints | 4035 135 123
hints | 11193 133 122
checksums | 11376 132 121

and when expressed as relative to no-hints:

scale config | init tps wal
============================|===============================
ro 100 hints | 206% 73% 155%
checksums | 210% 68% 155%
-------------------|--------------------------------
1000 hints | 181% 81% 131%
checksums | 186% 78% 129%
-------------------|--------------------------------
6000 hints | 277% 99% 100%
checksums | 282% 100% 104%
----------------------------|--------------------------------
rw 100 hints | 206% 99% 99%
checksums | 210% 99% 99%
-------------------|--------------------------------
1000 hints | 181% 95% 96%
checksums | 186% 95% 95%
-------------------|--------------------------------
6000 hints | 277% 99% 99%
checksums | 282% 98% 98%

I have not investigated the exact reasons, but my hypothesis it's about
the amount of WAL generated during the initial CREATE INDEX (because it
probably ends up setting the hint bits), which puts additional pressure
on the storage.

Unfortunately, this additional cost is unlikely to go away :-(

Now, maybe we want to enable checksums by default anyway, but we should
not pretent the only cost related to checksums is CPU usage.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Banck 2019-03-30 20:07:41 Re: Progress reporting for pg_verify_checksums
Previous Message Peter Geoghegan 2019-03-30 19:19:57 Re: pgsql: Compute XID horizon for page level index vacuum on primary.