Re: checkpointer continuous flushing - V16

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing - V16
Date: 2016-02-19 09:16:41
Message-ID: alpine.DEB.2.10.1602191012160.1254@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Andres,

> I don't want to post a full series right now, but my working state is
> available on
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

Below the results of a lot of tests with pgbench to exercise checkpoints
on the above version when fetched.

Overall comments:
- sorting & flushing is basically always a winner
- benchmarking with short runs on large databases is a bad idea
the results are very different if a longer run is used
(see andres00b vs andres00c)

# HOST/SOFT

16 GB 2 cpu 8 cores
200 GB RAID1 HDD, ext4 FS
Ubuntu 12.04 LTS (precise)

# ABOUT THE REPORTED STATISTICS

tps: is the "excluding connection" time tps, the higher the better
1-sec tps: average of measured per-second tps
note - it should be the same as the previous one, but due to various
hazards in the trace, especially when things go badly and pg get
stuck, it may be different. Such hazard also explain why there
may be some non-integer tps reported for some seconds.
stddev: standard deviation, the lower the better
the five figures in bracket give a feel of the distribution:
- min: minimal per-second tps seen in the trace
- q1: first quarter per-second tps seen in the trace
- med: median per-second tps seen in the trace
- q3: third quarter per-second tps seen in the trace
- max: maximal per-second tps seen in the trace
the last percentage dubbed "<=10.0" is percent of seconds where performance
is below 10 tps: this measures of how unresponsive pg was during the run

###### TINY2

pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4
with scale = 10 (~ 200 MB)

postgresql.conf:
shared_buffers = 1GB
max_wal_size = 1GB
checkpoint_timeout = 300s
checkpoint_completion_target = 0.8
checkpoint_flush_after = { none, 0, 32, 64 }

opts # | tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0

head 0 | 2574.1 / 2574.3 ± 367.4 [229.0, 2570.1, 2721.9, 2746.1, 2857.2] 0.0%
1 | 2575.0 / 2575.1 ± 359.3 [ 1.0, 2595.9, 2712.0, 2732.0, 2847.0] 0.1%
2 | 2602.6 / 2602.7 ± 359.5 [ 54.0, 2607.1, 2735.1, 2768.1, 2908.0] 0.0%

0 0 | 2583.2 / 2583.7 ± 296.4 [164.0, 2580.0, 2690.0, 2717.1, 2833.8] 0.0%
1 | 2596.6 / 2596.9 ± 307.4 [296.0, 2590.5, 2707.9, 2738.0, 2847.8] 0.0%
2 | 2604.8 / 2605.0 ± 300.5 [110.9, 2619.1, 2712.4, 2738.1, 2849.1] 0.0%

32 0 | 2625.5 / 2625.5 ± 250.5 [ 1.0, 2645.9, 2692.0, 2719.9, 2839.0] 0.1%
1 | 2630.2 / 2630.2 ± 243.1 [301.8, 2654.9, 2697.2, 2726.0, 2837.4] 0.0%
2 | 2648.3 / 2648.4 ± 236.7 [570.1, 2664.4, 2708.9, 2739.0, 2844.9] 0.0%

64 0 | 2587.8 / 2587.9 ± 306.1 [ 83.0, 2610.1, 2680.0, 2731.0, 2857.1] 0.0%
1 | 2591.1 / 2591.1 ± 305.2 [455.9, 2608.9, 2680.2, 2734.1, 2859.0] 0.0%
2 | 2047.8 / 2046.4 ± 925.8 [ 0.0, 1486.2, 2592.6, 2691.1, 3001.0] 0.2% ?

Pretty small setup, all data fit in buffers. Good tps performance all around
(best for 32 flushes), and flushing shows a noticable (360 -> 240) reduction
in tps stddev.

###### SMALL

pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4
with scale = 120 (~ 2 GB)

postgresql.conf:
shared_buffers = 2GB
checkpoint_timeout = 300s
checkpoint_completion_target = 0.8
checkpoint_flush_after = { none, 0, 32, 64 }

opts # | tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0

head 0 | 209.2 / 204.2 ± 516.5 [0.0, 0.0, 4.0, 5.0, 2251.0] 82.3%
1 | 207.4 / 204.2 ± 518.7 [0.0, 0.0, 4.0, 5.0, 2245.1] 82.3%
2 | 217.5 / 211.0 ± 530.3 [0.0, 0.0, 3.0, 5.0, 2255.0] 82.0%
3 | 217.8 / 213.2 ± 531.7 [0.0, 0.0, 4.0, 6.0, 2261.9] 81.7%
4 | 230.7 / 223.9 ± 542.7 [0.0, 0.0, 4.0, 7.0, 2282.0] 80.7%

0 0 | 734.8 / 735.5 ± 879.9 [0.0, 1.0, 16.5, 1748.3, 2281.1] 47.0%
1 | 694.9 / 693.0 ± 849.0 [0.0, 1.0, 29.5, 1545.7, 2428.0] 46.4%
2 | 735.3 / 735.5 ± 888.4 [0.0, 0.0, 12.0, 1781.2, 2312.1] 47.9%
3 | 736.0 / 737.5 ± 887.1 [0.0, 1.0, 16.0, 1794.3, 2317.0] 47.5%
4 | 734.9 / 735.1 ± 885.1 [0.0, 1.0, 15.5, 1781.0, 2297.1] 47.2%

32 0 | 738.1 / 737.9 ± 415.8 [0.0, 553.0, 679.0, 753.0, 2312.1] 0.2%
1 | 730.5 / 730.7 ± 413.2 [0.0, 546.5, 671.0, 744.0, 2319.0] 0.1%
2 | 741.9 / 741.9 ± 416.5 [0.0, 556.0, 682.0, 756.0, 2331.0] 0.2%
3 | 744.1 / 744.1 ± 414.4 [0.0, 555.5, 685.2, 758.0, 2285.1] 0.1%
4 | 746.9 / 746.9 ± 416.6 [0.0, 566.6, 685.0, 759.0, 2308.1] 0.1%

64 0 | 743.0 / 743.1 ± 416.5 [1.0, 555.0, 683.0, 759.0, 2353.0] 0.1%
1 | 742.5 / 742.5 ± 415.6 [0.0, 558.2, 680.0, 758.2, 2296.0] 0.1%
2 | 742.5 / 742.5 ± 415.9 [0.0, 559.0, 681.1, 757.0, 2310.0] 0.1%
3 | 529.0 / 526.6 ± 410.9 [0.0, 245.0, 444.0, 701.0, 2380.9] 1.5% ??
4 | 734.8 / 735.0 ± 414.1 [0.0, 550.0, 673.0, 754.0, 2298.0] 0.1%

Sorting brings * 3.3 tps, flushing significantly reduces tps stddev.
Pg comes from 80% unresponsive to nearly always responsive.

###### MEDIUM

pgbench: -M prepared -N -P 1 -T 4000 -j 2 -c 4
with scale = 250 (~ 3.8 GB)

postgresql.conf:
shared_buffers = 4GB
max_wal_size = 4GB
checkpoint_timeout = 15min
checkpoint_completion_target = 0.8
checkpoint_flush_after = { none, 0, 32, 64 }

opts # | tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0

head 0 | 214.8 / 211.8 ± 513.7 [0.0, 1.0, 4.0, 5.0, 2344.0] 82.4%
1 | 219.2 / 215.0 ± 524.1 [0.0, 0.0, 4.0, 5.0, 2316.0] 82.2%
2 | 240.9 / 234.6 ± 550.8 [0.0, 0.0, 4.0, 6.0, 2320.2] 81.0%

0 0 | 1064.7 / 1065.3 ± 888.2 [0.0, 11.0, 1089.0, 2017.7, 2461.9] 24.7%
1 | 1060.2 / 1061.2 ± 889.9 [0.0, 10.0, 1056.7, 2022.0, 2444.9] 25.1%
2 | 1060.2 / 1061.4 ± 889.1 [0.0, 9.0, 1085.8, 2002.8, 2473.0] 25.6%

32 0 | 1059.4 / 1059.4 ± 476.3 [3.0, 804.9, 980.0, 1123.0, 2448.1] 0.1%
1 | 1062.5 / 1062.6 ± 475.6 [0.0, 807.0, 988.0, 1132.0, 2441.0] 0.1%
2 | 1063.7 / 1063.7 ± 475.4 [0.0, 814.0, 987.0, 1131.2, 2432.1] 0.1%

64 0 | 1052.6 / 1052.6 ± 475.3 [0.0, 793.0, 974.0, 1118.1, 2445.1] 0.1%
1 | 1059.8 / 1059.8 ± 475.1 [0.0, 799.0, 987.5, 1131.0, 2457.1] 0.1%
2 | 1058.5 / 1058.5 ± 472.8 [0.0, 807.0, 985.0, 1127.7, 2442.0] 0.1%

Sorting brings * 4.8 tps, flushing significantly reduces tps stddev.
Pg comes from +80% unresponsive to nearly always responsive.

Performance is significantly better than "small" above, probably thanks to
the longer checkpoint timeout.

###### LARGE

pgbench -M prepared -N -P 1 -T 7500 -j 2 -c 4
with scale = 1000 (~ 15 GB)

postgresql.conf:
shared_buffers = 4GB
max_wal_size = 2GB
checkpoint_timeout = 40min
checkpoint_completion_target = 0.8
checkpoint_flush_after = { none, 0, 32, 64}

opts # | tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0

head 0 | 68.7 / 65.3 ± 78.6 [0.0, 3.0, 6.0, 136.0, 291.0] 53.1%
1 | 70.6 / 70.3 ± 80.1 [0.0, 4.0, 10.0, 151.0, 282.0] 50.1%
2 | 74.3 / 75.8 ± 84.9 [0.0, 4.0, 9.0, 162.0, 311.2] 50.3%

0 0 | 117.2 / 116.9 ± 83.8 [0.0, 14.0, 139.0, 193.0, 372.4] 24.0%
1 | 117.3 / 117.8 ± 83.8 [0.0, 16.0, 140.0, 193.0, 279.0] 23.9%
2 | 117.6 / 118.2 ± 84.1 [0.0, 16.0, 141.0, 194.0, 297.8] 23.7%

32 0 | 114.2 / 114.2 ± 45.7 [0.0, 84.0, 100.0, 131.0, 613.6] 0.4%
1 | 112.5 / 112.6 ± 44.0 [0.0, 83.0, 98.0, 130.0, 293.0] 0.2%
2 | 108.0 / 108.0 ± 44.7 [0.0, 79.0, 94.0, 124.0, 303.6] 0.3%

64 0 | 113.0 / 113.0 ± 45.5 [0.0, 83.0, 99.0, 131.0, 289.0] 0.4%
1 | 80.0 / 80.3 ± 39.1 [0.0, 56.0, 72.0, 95.0, 281.0] 0.8% ??
2 | 112.2 / 112.3 ± 44.5 [0.0, 82.0, 99.0, 129.0, 282.0] 0.3%

Data do not fit in the available memory, so plenty of read accesses.
Sorting still has some impact on tps performance (* 1.6), flushing
greatly improves responsiveness.

###### ANDRES00

pgbench -M prepared -N -P 1 -T 300 -c 16 -j 16
with scale = 800 (~ 13 GB)

postgresql.conf:
shared_buffers = 2GB
max_wal_size = 100GB
wal_level = hot_standby
maintenance_work_mem = 2GB
checkpoint_timeout = 30s
checkpoint_completion_target = 0.8
synchronous_commit = off
checkpoint_flush_after = { none, 0, 32, 64 }

opts # | tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0

head 0 | 328.7 / 329.9 ± 716.9 [0.0, 0.0, 0.0, 0.0, 3221.2] 77.7%
1 | 338.2 / 338.7 ± 728.6 [0.0, 0.0, 0.0, 17.0, 3296.3] 75.0%
2 | 304.5 / 304.3 ± 705.5 [0.0, 0.0, 0.0, 0.0, 3463.4] 79.3%

0 0 | 425.6 / 464.0 ± 724.0 [0.0, 0.0, 0.0, 1000.6, 3363.7] 61.0%
1 | 461.5 / 463.1 ± 735.8 [0.0, 0.0, 0.0, 1011.2, 3490.9] 58.7%
2 | 452.4 / 452.6 ± 744.3 [0.0, 0.0, 0.0, 1078.9, 3631.9] 63.3%

32 0 | 514.4 / 515.8 ± 651.8 [0.0, 0.0, 337.4, 808.3, 2876.0] 40.7%
1 | 512.0 / 514.6 ± 661.6 [0.0, 0.0, 317.6, 690.8, 3315.8] 35.0%
2 | 529.5 / 530.3 ± 673.0 [0.0, 0.0, 321.1, 906.4, 3360.8] 40.3%

64 0 | 529.6 / 530.9 ± 668.2 [0.0, 0.0, 322.1, 786.1, 3538.0] 33.3%
1 | 496.4 / 498.0 ± 606.6 [0.0, 0.0, 321.4, 746.0, 2629.6] 36.3%
2 | 521.0 / 521.7 ± 657.0 [0.0, 0.0, 328.4, 737.9, 3262.9] 34.3%

Data just hold in memory, maybe. Run is very short, settings are low, this
is not representative of an sane installation, this is for testing a lot of
checkpoints in a difficult situation. Sorting and flushing do bring
significant benefits.

###### ANDRES00b (same as ANDRES00 but scale 800->1000)

pgbench -M prepared -N -P 1 -T 300 -c 16 -j 16
with scale = 1000 (~ 15 GB)

postgresql.conf:
shared_buffers = 2GB
max_wal_size = 100GB
wal_level = hot_standby
maintenance_work_mem = 2GB
checkpoint_timeout = 30s
checkpoint_completion_target = 0.8
synchronous_commit = off
checkpoint_flush_after = { none, 0, 32, 64 }

opts # | tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0

head 0 | 150.2 / 150.3 ± 401.6 [0.0, 0.0, 0.0, 0.0, 2199.4] 75.1%
1 | 139.2 / 139.2 ± 372.2 [0.0, 0.0, 0.0, 0.0, 2111.4] 78.3% ***
2 | 127.3 / 127.1 ± 341.2 [0.0, 0.0, 0.0, 53.0, 2144.3] 74.7% ***

0 0 | 199.0 / 209.2 ± 400.4 [0.0, 0.0, 0.0, 243.6, 1846.0] 65.7%
1 | 220.4 / 226.7 ± 423.2 [0.0, 0.0, 0.0, 264.0, 1777.0] 63.5% *
2 | 195.5 / 205.3 ± 337.9 [0.0, 0.0, 123.0, 212.0, 1721.9] 43.2%

32 0 | 362.3 / 359.0 ± 308.4 [0.0, 200.0, 265.0, 416.4, 1816.6] 5.0%
1 | 323.6 / 321.2 ± 327.1 [0.0, 142.9, 210.0, 353.4, 1907.0] 4.0%
2 | 309.0 / 310.7 ± 381.3 [0.0, 122.0, 175.5, 298.0, 2090.4] 5.0%

64 0 | 342.7 / 343.6 ± 331.1 [0.0, 143.0, 239.5, 409.9, 1623.6] 5.3%
1 | 333.8 / 328.2 ± 356.3 [0.0, 132.9, 211.5, 358.1, 1629.1] 10.7% ??
2 | 352.0 / 352.0 ± 332.3 [0.0, 163.5, 239.9, 400.1, 1643.4] 5.3%

A little bit larger than previous so that it does not really fit in memory.
The performance inpact is significant compared to previous. Sorting and
flushing brings * 2 tps, unresponsiveness comes from 75% to reach a better 5%.

###### ANDRES00c (same as ANDRES00b but time 300 -> 4000)

opts # | tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0

head 0 | 115.2 / 114.3 ± 256.4 [0.0, 0.0, 75.0, 131.1, 3389.0] 46.5%
1 | 118.4 / 117.9 ± 248.3 [0.0, 0.0, 87.0, 151.0, 3603.6] 46.7%
2 | 120.1 / 119.2 ± 254.4 [0.0, 0.0, 91.0, 143.0, 3307.8] 43.8%

0 0 | 217.4 / 211.0 ± 237.1 [0.0, 139.0, 193.0, 239.0, 3115.4] 16.8%
1 | 216.2 / 209.6 ± 244.9 [0.0, 138.9, 188.0, 231.0, 3331.3] 16.3%
2 | 218.6 / 213.8 ± 246.7 [0.0, 137.0, 187.0, 232.0, 3229.6] 16.2%

32 0 | 146.6 / 142.5 ± 234.5 [0.0, 59.0, 93.0, 151.1, 3294.7] 17.5%
1 | 148.0 / 142.6 ± 239.2 [0.0, 64.0, 95.9, 144.0, 3361.8] 16.0%
2 | 147.6 / 140.4 ± 233.2 [0.0, 59.4, 94.0, 148.0, 3108.4] 18.0%

64 0 | 145.3 / 140.5 ± 233.6 [0.0, 61.0, 93.0, 147.7, 3212.6] 16.5%
1 | 145.6 / 140.3 ± 233.3 [0.0, 58.0, 93.0, 146.0, 3351.8] 17.3%
2 | 147.7 / 142.2 ± 233.2 [0.0, 61.0, 97.0, 148.4, 3616.3] 17.0%

The only difference between ANDRES00B and ANDRES00C is the duration, from
5 minutes to 66 minutes. This show that short runs can be widelely misleading:
In particular the longer runs shows less that half tps for some settings, and
the relative comparison of head vs sort vs sort+flush is different.

###### ANDRES00d (same as ANDRES00b but wal_level hot_standby->minimal)

opts # | tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0

head 0 | 191.6 / 195.1 ± 439.3 [0.0, 0.0, 0.0, 0.0, 2540.2] 76.3%
1 | 211.3 / 213.6 ± 461.9 [0.0, 0.0, 0.0, 13.0, 3203.7] 75.0%
2 | 152.4 / 154.9 ± 217.6 [0.0, 0.0, 58.0, 235.6, 995.9] 39.3% ???

0 0 | 247.2 / 251.7 ± 454.0 [0.0, 0.0, 0.0, 375.3, 2592.4] 67.7%
1 | 215.4 / 232.7 ± 446.5 [0.0, 0.0, 0.0, 103.0, 3046.7] 72.3%
2 | 160.6 / 160.8 ± 222.1 [0.0, 0.0, 80.0, 209.6, 885.3] 42.0% ???

32 0 | 399.9 / 397.0 ± 356.6 [0.0, 67.0, 348.0, 572.8, 2604.2] 21.0%
1 | 391.8 / 392.5 ± 371.7 [0.0, 85.5, 314.4, 549.3, 2590.3] 20.7%
2 | 406.1 / 404.8 ± 380.6 [0.0, 95.0, 348.5, 569.0, 3383.7] 21.3%

64 0 | 395.9 / 396.1 ± 352.4 [0.0, 89.5, 342.5, 556.0, 2366.9] 17.7%
1 | 355.1 / 351.9 ± 296.7 [0.0, 172.5, 306.1, 468.1, 1663.5] 16.0%
2 | 403.6 / 401.8 ± 390.5 [0.0, 0.0, 337.0, 636.1, 2591.3] 26.7% ???

###### ANDRES00e (same as ANDRES00b but maintenance_work_mem=2GB->64MB)

opts # | tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0

head 0 | 153.5 / 161.3 ± 401.3 [0.0, 0.0, 0.0, 0.0, 2546.0] 82.0%
1 | 170.7 / 175.9 ± 399.9 [0.0, 0.0, 0.0, 14.0, 2537.4] 74.7%
2 | 184.7 / 190.4 ± 389.2 [0.0, 0.0, 0.0, 158.5, 2544.6] 69.3%

0 0 | 211.2 / 227.8 ± 418.8 [0.0, 0.0, 0.0, 334.6, 2589.3] 65.7%
1 | 221.7 / 226.0 ± 415.7 [0.0, 0.0, 0.0, 276.8, 2588.2] 68.4%
2 | 232.5 / 233.2 ± 403.5 [0.0, 0.0, 0.0, 377.0, 2260.2] 62.0%

32 0 | 373.2 / 374.4 ± 309.2 [0.0, 180.6, 321.8, 475.2, 2596.5] 11.3%
1 | 348.7 / 348.1 ± 328.4 [0.0, 127.0, 284.1, 451.9, 2595.1] 17.3%
2 | 376.3 / 375.3 ± 315.5 [0.0, 186.5, 329.6, 487.1, 2365.4] 15.3%

64 0 | 388.9 / 387.8 ± 348.7 [0.0, 164.0, 305.9, 546.5, 2587.2] 15.0%
1 | 380.3 / 378.7 ± 338.8 [0.0, 171.1, 317.4, 524.8, 2592.4] 16.7%
2 | 369.8 / 367.4 ± 340.5 [0.0, 77.4, 320.6, 525.5, 2484.7] 20.7%

Hmmm, interesting: maintenance_work_mem seems to have some influence on
performance, although it is not too consistent between settings, probably
because as the memory is used to its limit the performance is quite
sensitive to the available memory.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2016-02-19 09:24:29 Re: Typo in bufmgr.c that result in waste of memory
Previous Message Kyotaro HORIGUCHI 2016-02-19 08:30:03 [PoC] WaitLatchOrSocketMulti (Re: Performance degradation in commit ac1d794)