Re: design for parallel backup

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: design for parallel backup
Date: 2020-05-03 02:36:43
Message-ID: 20200503023643.x2o2244sa4vkikyb@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-05-01 16:32:15 -0400, Robert Haas wrote:
> On Thu, Apr 30, 2020 at 6:06 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > On Thu, Apr 30, 2020 at 3:52 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > Why 8kb? That's smaller than what we currently do in pg_basebackup,
> > > afaictl, and you're actually going to be bottlenecked by syscall
> > > overhead at that point (unless you disable / don't have the whole intel
> > > security mitigation stuff).
> >
> > I just picked something. Could easily try other things.
>
> I tried changing the write size to 64kB, keeping the rest the same.
> Here are the results:
>
> filesystem media 1(at)64GB 2(at)32GB 4(at)16GB 8(at)8GB 16(at)4GB
> xfs mag 65 53 64 74 79
> ext4 mag 96 68 75 303 437
> xfs ssd 75 43 29 33 38
> ext4 ssd 96 68 63 214 254
> spread spread n/a n/a 43 38 40
>
> And here again are the previous results with an 8kB write size:
>
> xfs mag 97 53 60 67 71
> ext4 mag 94 68 66 335 549
> xfs ssd 97 55 33 27 25
> ext4 ssd 116 70 66 227 450
> spread spread n/a n/a 48 42 44
>
> Generally, those numbers look better than the previous numbers, but
> parallelism still looks fairly appealing on the SSD storage - less so
> on magnetic disks, at least in this test.

I spent a fair bit of time analyzing this, and my conclusion is that you
might largely be seeing numa effects. Yay.

I don't have an as large numa machine at hand, but here's what I'm
seeing on my local machine, during a run of writing out 400GiB (this is
a run with noise on the machine, the benchmarks below are without
that). The machine has 192GiB of ram, evenly distributed to two sockets
/ numa domains.

At start I see
numastat -m|grep -E 'MemFree|MemUsed|Dirty|Writeback|Active\(file\)|Inactive\(file\)'"
MemFree 91908.20 92209.85 184118.05
MemUsed 3463.05 4553.33 8016.38
Active(file) 105.46 328.52 433.98
Inactive(file) 68.29 190.14 258.43
Dirty 0.86 0.90 1.76
Writeback 0.00 0.00 0.00
WritebackTmp 0.00 0.00 0.00

For a while there's pretty decent IO throughput (all 10s samples):
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1955.67 2299.32 0.00 0.00 42.48 1203.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 82.10 89.33

Then it starts to be slower on a sustained basis:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1593.33 1987.85 0.00 0.00 42.90 1277.55 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 67.55 76.53

And then performance tanks completely:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 646.33 781.85 0.00 0.00 132.68 1238.70 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 85.43 58.63

That amount of degradation confused me for a while, especially because I
couldn't reproduce it the more controlled I made the setups. In
particular I stopped seeing the same magnitude of issues after pinnning
processes to one numa socket (both running and memory).

After a few seconds:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1882.00 2320.07 0.00 0.00 42.50 1262.35 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 79.05 88.07

MemFree 35356.50 80986.46 116342.96
MemUsed 60014.75 15776.72 75791.47
Active(file) 179.44 163.28 342.72
Inactive(file) 58293.18 13385.15 71678.33
Dirty 18407.50 882.00 19289.50
Writeback 235.78 335.43 571.21
WritebackTmp 0.00 0.00 0.00

A bit later io starts to get slower:

Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1556.30 1898.70 0.00 0.00 40.92 1249.29 0.00 0.00 0.00 0.00 0.00 0.00 0.20 24.00 62.90 72.01

MemFree 519.56 36086.14 36605.70
MemUsed 94851.69 60677.04 155528.73
Active(file) 303.84 212.96 516.80
Inactive(file) 92776.70 58133.28 150909.97
Dirty 10913.20 5374.07 16287.27
Writeback 812.94 331.96 1144.90
WritebackTmp 0.00 0.00 0.00

And then later it gets worse:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1384.70 1671.25 0.00 0.00 40.87 1235.91 0.00 0.00 0.00 0.00 0.00 0.00 0.20 7.00 55.89 63.45

MemFree 519.54 242.98 762.52
MemUsed 94851.71 96520.20 191371.91
Active(file) 175.82 246.03 421.85
Inactive(file) 92820.19 93985.79 186805.98
Dirty 10482.75 4140.72 14623.47
Writeback 0.00 0.00 0.00
WritebackTmp 0.00 0.00 0.00

When using a 1s iostat instead of a 10s, it's noticable that performance
swings widely between very slow (<100MB/s) and very high throughput (>
2500MB/s).

It's clearly visible that performance degrades substantially first when
all of a numa node's free memory is exhausted, then when the second numa
node's is.

Looking at profile I see a lot of cacheline bouncing between the kernel
threads that "reclaim" pages (i.e. make them available for reuse), the
kernel threads that write out dirty pages, the kernel threads where the
IO completes (i.e. where the dirty bit can be flipped / locks get
released), and the writing process.

I think there's a lot from the kernel side that can improve - but it's
not too surprising that letting the kernel cache / forcing it to make
caching decisions for a large streaming wide has substantial costs.

I changed Robert's test program to optionall fallocate,
sync_file_range(WRITE), posix_fadvise(DONTNEED), to avoid a large
footprint in the page cache. The performance
differences are quite substantial:

gcc -Wall -ggdb ~/tmp/write_and_fsync.c -o /tmp/write_and_fsync && \
rm -ff /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && \
/tmp/write_and_fsync --sync_file_range=0 --fallocate=0 --fadvise=0 --filesize=$((400*1024*1024*1024)) /srv/dev/bench/test1

running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0
[/srv/dev/bench/test1][11450] open: 0, fallocate: 0 write: 214, fsync: 6, close: 0, total: 220

comparing that with --sync_file_range=1 --fallocate=1 --fadvise=1
running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1
[/srv/dev/bench/test1][14098] open: 0, fallocate: 0 write: 161, fsync: 0, close: 0, total: 161

Below are the results of running a the program with a variation of
parameters (both file and resutls attached).

I used perf stat in this run to measure the difference in CPU
usage.

ref_cycles are the number of CPU cycles, across all 20 cores / 40
threads, CPUs were doing *something*. It is not affected by CPU
frequency scaling, just by the time CPUs were not "halted". Whereas
cycles is affected by frequency scaling.

A high ref_cycles_sec, combined with a decent number of total
instructions/cycles is *good*, because it indicates fewer CPUs
used. Whereas a very high ref_cycles_tot means that more CPUs were
running doing something for the duration of the benchmark.

The run-to-run variations between the runs without cache control are
pretty large. So this is probably not the end-all-be-all numbers. But I
think the trends are pretty clear.

test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0 248.430736196 1,497,048,950,014 150.653M/sec 1,226,822,167,960 0.123GHz 705,950,461,166 0.54
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1 310.275952938 1,921,817,571,226 154.849M/sec 1,499,581,687,133 0.121GHz 944,243,167,053 0.59
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 0.116GHz 678,451,556,273 0.84
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0 243.609959554 1,802,385,405,203 184.970M/sec 1,449,560,513,247 0.149GHz 855,426,288,031 0.56
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0 230.880100449 1,328,417,418,799 143.846M/sec 1,148,924,667,393 0.124GHz 723,158,246,628 0.63
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1 253.591234992 1,548,485,571,798 152.658M/sec 1,229,926,994,613 0.121GHz 1,117,352,436,324 0.95
numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1 164.488835158 911,974,902,254 138.611M/sec 760,756,011,483 0.116GHz 672,105,046,261 0.84
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 0.214GHz 715,274,337,015 0.51
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0 192.151682414 1,526,440,715,456 198.603M/sec 1,037,135,756,007 0.135GHz 802,754,964,096 0.76
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1 242.648245159 1,782,637,416,163 183.629M/sec 1,463,696,313,881 0.151GHz 1,000,100,694,932 0.69
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1 188.772193248 1,418,274,870,697 187.803M/sec 923,133,958,500 0.122GHz 799,212,291,243 0.92
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0 421.580487642 2,756,486,952,728 163.449M/sec 1,387,708,033,752 0.082GHz 990,478,650,874 0.72
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0 169.854206542 1,333,619,626,854 196.282M/sec 1,036,261,531,134 0.153GHz 666,052,333,591 0.64
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1 305.078100578 1,970,042,289,192 161.445M/sec 1,505,706,462,812 0.123GHz 954,963,240,648 0.62
numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1 166.295223626 1,290,699,256,763 194.044M/sec 857,873,391,283 0.129GHz 761,338,026,415 0.89
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 0.075GHz 888,512,073,477 0.66
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0 256.156100686 2,407,922,637,215 235.003M/sec 1,133,311,037,956 0.111GHz 748,666,206,805 0.65
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1 215.255015340 1,977,578,120,924 229.676M/sec 1,461,504,758,029 0.170GHz 1,005,270,838,642 0.68
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1 158.262790654 1,720,443,307,097 271.769M/sec 1,004,079,045,479 0.159GHz 826,905,592,751 0.84
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0 334.932246893 2,366,388,662,460 176.628M/sec 1,216,049,589,993 0.091GHz 796,698,831,717 0.68
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0 161.697270285 1,866,036,713,483 288.576M/sec 1,068,181,502,433 0.165GHz 739,559,279,008 0.70
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1 231.440889430 1,965,389,749,057 212.391M/sec 1,407,927,406,358 0.152GHz 997,199,361,968 0.72
numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1 214.433248700 2,232,198,239,769 260.300M/sec 1,073,334,918,389 0.125GHz 861,540,079,120 0.80
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=0 644.521613661 3,688,449,404,537 143.079M/sec 2,020,128,131,309 0.078GHz 961,486,630,359 0.48
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=0 243.830464632 1,499,608,983,445 153.756M/sec 1,227,468,439,403 0.126GHz 691,534,661,654 0.59
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=1 292.866419420 1,753,376,415,877 149.677M/sec 1,483,169,463,392 0.127GHz 860,035,914,148 0.56
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1 162.152397194 925,643,754,128 142.719M/sec 743,208,501,601 0.115GHz 554,462,585,110 0.70
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=0 211.369510165 1,558,996,898,599 184.401M/sec 1,359,343,408,200 0.161GHz 766,769,036,524 0.57
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=0 233.315094908 1,427,133,080,540 152.927M/sec 1,166,000,868,597 0.125GHz 743,027,329,074 0.64
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=1 290.698155820 1,732,849,079,701 149.032M/sec 1,441,508,612,326 0.124GHz 835,039,426,282 0.57
numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1 159.945462440 850,162,390,626 132.892M/sec 724,286,281,548 0.113GHz 670,069,573,150 0.90
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=0 163.244592275 1,524,807,507,173 233.531M/sec 1,398,319,581,978 0.214GHz 689,514,058,243 0.46
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=0 231.795934322 1,731,030,267,153 186.686M/sec 1,124,935,745,020 0.121GHz 736,084,922,669 0.70
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=1 315.564163702 1,958,199,733,216 155.128M/sec 1,405,115,546,716 0.111GHz 1,000,595,890,394 0.73
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1 210.945487961 1,527,169,148,899 180.990M/sec 906,023,518,692 0.107GHz 700,166,552,207 0.80
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=0 161.759094088 1,468,321,054,671 226.934M/sec 1,221,167,105,510 0.189GHz 735,855,415,612 0.59
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=0 158.578248952 1,354,770,825,277 213.586M/sec 936,436,363,752 0.148GHz 654,823,079,884 0.68
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=1 274.628500801 1,792,841,068,080 163.209M/sec 1,343,398,055,199 0.122GHz 996,073,874,051 0.73
numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1 179.140070123 1,383,595,004,328 193.095M/sec 850,299,722,091 0.119GHz 706,959,617,654 0.83
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=0 445.496787199 2,663,914,572,687 149.495M/sec 1,267,340,496,930 0.071GHz 787,469,552,454 0.62
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=0 261.866083604 2,325,884,820,091 222.043M/sec 1,094,814,208,219 0.105GHz 649,479,233,453 0.57
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=1 172.963505544 1,717,387,683,260 248.228M/sec 1,356,381,335,831 0.196GHz 822,256,638,370 0.58
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1 157.934678897 1,650,503,807,778 261.266M/sec 970,705,561,971 0.154GHz 637,953,927,131 0.66
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=0 225.623143601 1,804,402,820,599 199.938M/sec 1,086,394,788,362 0.120GHz 656,392,112,807 0.62
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=0 157.930900998 1,797,506,082,342 284.548M/sec 1,001,509,813,741 0.159GHz 644,107,150,289 0.66
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=1 165.772265335 1,805,895,001,689 272.353M/sec 1,514,173,918,970 0.228GHz 823,435,044,810 0.54
numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1 187.664764448 1,964,118,348,429 261.660M/sec 978,060,510,880 0.130GHz 668,316,194,988 0.67

Greetings,

Andres Freund

Attachment Content-Type Size
write_and_fsync.c text/x-csrc 5.5 KB
diskbench_parsed.tsv text/tab-separated-values 7.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-05-03 02:42:15 Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)
Previous Message Atsushi Torikoshi 2020-05-03 00:57:16 Re: pg_stat_reset_slru(name) doesn't seem to work as documented