Re: design for parallel backup

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: design for parallel backup
Date: 2020-05-03 17:49:22
Message-ID: 20200503174922.mfzzdafa5g4rlhez@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-05-03 09:12:59 -0400, Robert Haas wrote:
> On Sat, May 2, 2020 at 10:36 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I changed Robert's test program to optionall fallocate,
> > sync_file_range(WRITE), posix_fadvise(DONTNEED), to avoid a large
> > footprint in the page cache. The performance
> > differences are quite substantial:
> >
> > gcc -Wall -ggdb ~/tmp/write_and_fsync.c -o /tmp/write_and_fsync && \
> > rm -ff /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && \
> > /tmp/write_and_fsync --sync_file_range=0 --fallocate=0 --fadvise=0 --filesize=$((400*1024*1024*1024)) /srv/dev/bench/test1
> >
> > running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0
> > [/srv/dev/bench/test1][11450] open: 0, fallocate: 0 write: 214, fsync: 6, close: 0, total: 220
> >
> > comparing that with --sync_file_range=1 --fallocate=1 --fadvise=1
> > running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1
> > [/srv/dev/bench/test1][14098] open: 0, fallocate: 0 write: 161, fsync: 0, close: 0, total: 161
>
> Ah, nice.

Btw, I forgot to include the result for 0 / 0 / 0 in the results
(off-by-one error in a script :))

numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 0.155GHz 833,345,334,408 0.68

> > The run-to-run variations between the runs without cache control are
> > pretty large. So this is probably not the end-all-be-all numbers. But I
> > think the trends are pretty clear.
>
> Could you be explicit about what you think those clear trends are?

Largely that concurrency can help a bit, but also hurt
tremendously. Below is some more detailed analysis, it'll be a bit
long...

Taking the no concurrency / cache management as a baseline:

> test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 0.155GHz 833,345,334,408 0.68

and comparing cache management with using some concurrency:

> test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 0.116GHz 678,451,556,273 0.84
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 0.214GHz 715,274,337,015 0.51

we can see very similar timing. Which makes sense, because that's
roughly the device's max speed. But then going to higher concurrency,
there's clearly regressions:

> test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 0.075GHz 888,512,073,477 0.66

And I think it is instructive to look at the
ref_cycles_tot/cycles_tot/instructions_tot vs
ref_cycles_sec/cycles_sec/ipc. The units are confusing because they are
across all cores and most are idle. But it's pretty obvious that
numprocs=1 sfr=1 fadvise=1 has cores running for a lot shorter time
(reference cycles basically count the time cores were running on a
absolute time scale). Compared to numprocs=2 sfr=0 fadvise=0, which has
the same resulting performance, it's clear that cores were busier, but
less efficient (lower ipc).

With cache mangement there's very little benefit, and some risk (1->2
regression) in this workload with increasing concurrency:

> test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 0.116GHz 678,451,556,273 0.84
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1 188.772193248 1,418,274,870,697 187.803M/sec 923,133,958,500 0.122GHz 799,212,291,243 0.92
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1 158.262790654 1,720,443,307,097 271.769M/sec 1,004,079,045,479 0.159GHz 826,905,592,751 0.84

And there's good benefit, but tremendous risk, of concurrency in the no
cache control case:

> test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 0.155GHz 833,345,334,408 0.68
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 0.214GHz 715,274,337,015 0.51
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 0.075GHz 888,512,073,477 0.66

sync file range without fadvise isn't a benefit at low concurrency, but prevents bad regressions at high concurency:
> test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 0.155GHz 833,345,334,408 0.68
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0 248.430736196 1,497,048,950,014 150.653M/sec 1,226,822,167,960 0.123GHz 705,950,461,166 0.54

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 0.214GHz 715,274,337,015 0.51
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0 192.151682414 1,526,440,715,456 198.603M/sec 1,037,135,756,007 0.135GHz 802,754,964,096 0.76

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 0.075GHz 888,512,073,477 0.66
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0 256.156100686 2,407,922,637,215 235.003M/sec 1,133,311,037,956 0.111GHz 748,666,206,805 0.65

fadvise alone is similar:
> test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 0.155GHz 833,345,334,408 0.68
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1 310.275952938 1,921,817,571,226 154.849M/sec 1,499,581,687,133 0.121GHz 944,243,167,053 0.59

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 0.214GHz 715,274,337,015 0.51
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1 242.648245159 1,782,637,416,163 183.629M/sec 1,463,696,313,881 0.151GHz 1,000,100,694,932 0.69

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 0.075GHz 888,512,073,477 0.66
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1 215.255015340 1,977,578,120,924 229.676M/sec 1,461,504,758,029 0.170GHz 1,005,270,838,642 0.68

There does not appear to be a huge of benefit in fallocate in this
workload, the OS's delayed allocation works well. Compare:

numprocs=1
> test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 0.155GHz 833,345,334,408 0.68
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0 243.609959554 1,802,385,405,203 184.970M/sec 1,449,560,513,247 0.149GHz 855,426,288,031 0.56

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0 248.430736196 1,497,048,950,014 150.653M/sec 1,226,822,167,960 0.123GHz 705,950,461,166 0.54
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0 230.880100449 1,328,417,418,799 143.846M/sec 1,148,924,667,393 0.124GHz 723,158,246,628 0.63

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1 310.275952938 1,921,817,571,226 154.849M/sec 1,499,581,687,133 0.121GHz 944,243,167,053 0.59
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1 253.591234992 1,548,485,571,798 152.658M/sec 1,229,926,994,613 0.121GHz 1,117,352,436,324 0.95

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 0.116GHz 678,451,556,273 0.84
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1 164.488835158 911,974,902,254 138.611M/sec 760,756,011,483 0.116GHz 672,105,046,261 0.84

numprocs=2
> test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 0.214GHz 715,274,337,015 0.51
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0 421.580487642 2,756,486,952,728 163.449M/sec 1,387,708,033,752 0.082GHz 990,478,650,874 0.72

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0 192.151682414 1,526,440,715,456 198.603M/sec 1,037,135,756,007 0.135GHz 802,754,964,096 0.76
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0 169.854206542 1,333,619,626,854 196.282M/sec 1,036,261,531,134 0.153GHz 666,052,333,591 0.64

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1 242.648245159 1,782,637,416,163 183.629M/sec 1,463,696,313,881 0.151GHz 1,000,100,694,932 0.69
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1 305.078100578 1,970,042,289,192 161.445M/sec 1,505,706,462,812 0.123GHz 954,963,240,648 0.62

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1 188.772193248 1,418,274,870,697 187.803M/sec 923,133,958,500 0.122GHz 799,212,291,243 0.92
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1 166.295223626 1,290,699,256,763 194.044M/sec 857,873,391,283 0.129GHz 761,338,026,415 0.89

numprocs=4
> test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 0.075GHz 888,512,073,477 0.66
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0 334.932246893 2,366,388,662,460 176.628M/sec 1,216,049,589,993 0.091GHz 796,698,831,717 0.68

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0 256.156100686 2,407,922,637,215 235.003M/sec 1,133,311,037,956 0.111GHz 748,666,206,805 0.65
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0 161.697270285 1,866,036,713,483 288.576M/sec 1,068,181,502,433 0.165GHz 739,559,279,008 0.70

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1 215.255015340 1,977,578,120,924 229.676M/sec 1,461,504,758,029 0.170GHz 1,005,270,838,642 0.68
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1 231.440889430 1,965,389,749,057 212.391M/sec 1,407,927,406,358 0.152GHz 997,199,361,968 0.72

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1 158.262790654 1,720,443,307,097 271.769M/sec 1,004,079,045,479 0.159GHz 826,905,592,751 0.84
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1 214.433248700 2,232,198,239,769 260.300M/sec 1,073,334,918,389 0.125GHz 861,540,079,120 0.80

I would say that it seems to help concurrent cases without cache
control, but not particularly reliably so. At higher concurrency it
seems to hurt with cache control, not sure I undstand why.

I was at first confused why 128kb write sizes hurt (128kb is probably on
the higher end of useful, but I wanted to have see a more extreme
difference):

> test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 0.155GHz 833,345,334,408 0.68
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=0 644.521613661 3,688,449,404,537 143.079M/sec 2,020,128,131,309 0.078GHz 961,486,630,359 0.48

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0 248.430736196 1,497,048,950,014 150.653M/sec 1,226,822,167,960 0.123GHz 705,950,461,166 0.54
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=0 243.830464632 1,499,608,983,445 153.756M/sec 1,227,468,439,403 0.126GHz 691,534,661,654 0.59

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1 310.275952938 1,921,817,571,226 154.849M/sec 1,499,581,687,133 0.121GHz 944,243,167,053 0.59
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=1 292.866419420 1,753,376,415,877 149.677M/sec 1,483,169,463,392 0.127GHz 860,035,914,148 0.56

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 0.116GHz 678,451,556,273 0.84
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1 162.152397194 925,643,754,128 142.719M/sec 743,208,501,601 0.115GHz 554,462,585,110 0.70

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0 243.609959554 1,802,385,405,203 184.970M/sec 1,449,560,513,247 0.149GHz 855,426,288,031 0.56
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=0 211.369510165 1,558,996,898,599 184.401M/sec 1,359,343,408,200 0.161GHz 766,769,036,524 0.57

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0 230.880100449 1,328,417,418,799 143.846M/sec 1,148,924,667,393 0.124GHz 723,158,246,628 0.63
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=0 233.315094908 1,427,133,080,540 152.927M/sec 1,166,000,868,597 0.125GHz 743,027,329,074 0.64

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1 253.591234992 1,548,485,571,798 152.658M/sec 1,229,926,994,613 0.121GHz 1,117,352,436,324 0.95
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=1 290.698155820 1,732,849,079,701 149.032M/sec 1,441,508,612,326 0.124GHz 835,039,426,282 0.57

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1 164.488835158 911,974,902,254 138.611M/sec 760,756,011,483 0.116GHz 672,105,046,261 0.84
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1 159.945462440 850,162,390,626 132.892M/sec 724,286,281,548 0.113GHz 670,069,573,150 0.90

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 0.214GHz 715,274,337,015 0.51
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=0 163.244592275 1,524,807,507,173 233.531M/sec 1,398,319,581,978 0.214GHz 689,514,058,243 0.46

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0 192.151682414 1,526,440,715,456 198.603M/sec 1,037,135,756,007 0.135GHz 802,754,964,096 0.76
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=0 231.795934322 1,731,030,267,153 186.686M/sec 1,124,935,745,020 0.121GHz 736,084,922,669 0.70

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1 242.648245159 1,782,637,416,163 183.629M/sec 1,463,696,313,881 0.151GHz 1,000,100,694,932 0.69
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=1 315.564163702 1,958,199,733,216 155.128M/sec 1,405,115,546,716 0.111GHz 1,000,595,890,394 0.73

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1 188.772193248 1,418,274,870,697 187.803M/sec 923,133,958,500 0.122GHz 799,212,291,243 0.92
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1 210.945487961 1,527,169,148,899 180.990M/sec 906,023,518,692 0.107GHz 700,166,552,207 0.80

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0 421.580487642 2,756,486,952,728 163.449M/sec 1,387,708,033,752 0.082GHz 990,478,650,874 0.72
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=0 161.759094088 1,468,321,054,671 226.934M/sec 1,221,167,105,510 0.189GHz 735,855,415,612 0.59

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0 169.854206542 1,333,619,626,854 196.282M/sec 1,036,261,531,134 0.153GHz 666,052,333,591 0.64
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=0 158.578248952 1,354,770,825,277 213.586M/sec 936,436,363,752 0.148GHz 654,823,079,884 0.68

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1 305.078100578 1,970,042,289,192 161.445M/sec 1,505,706,462,812 0.123GHz 954,963,240,648 0.62
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=1 274.628500801 1,792,841,068,080 163.209M/sec 1,343,398,055,199 0.122GHz 996,073,874,051 0.73

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1 166.295223626 1,290,699,256,763 194.044M/sec 857,873,391,283 0.129GHz 761,338,026,415 0.89
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1 179.140070123 1,383,595,004,328 193.095M/sec 850,299,722,091 0.119GHz 706,959,617,654 0.83

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 0.075GHz 888,512,073,477 0.66
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=0 445.496787199 2,663,914,572,687 149.495M/sec 1,267,340,496,930 0.071GHz 787,469,552,454 0.62

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0 256.156100686 2,407,922,637,215 235.003M/sec 1,133,311,037,956 0.111GHz 748,666,206,805 0.65
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=0 261.866083604 2,325,884,820,091 222.043M/sec 1,094,814,208,219 0.105GHz 649,479,233,453 0.57

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1 215.255015340 1,977,578,120,924 229.676M/sec 1,461,504,758,029 0.170GHz 1,005,270,838,642 0.68
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=1 172.963505544 1,717,387,683,260 248.228M/sec 1,356,381,335,831 0.196GHz 822,256,638,370 0.58

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1 158.262790654 1,720,443,307,097 271.769M/sec 1,004,079,045,479 0.159GHz 826,905,592,751 0.84
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1 157.934678897 1,650,503,807,778 261.266M/sec 970,705,561,971 0.154GHz 637,953,927,131 0.66

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0 334.932246893 2,366,388,662,460 176.628M/sec 1,216,049,589,993 0.091GHz 796,698,831,717 0.68
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=0 225.623143601 1,804,402,820,599 199.938M/sec 1,086,394,788,362 0.120GHz 656,392,112,807 0.62

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0 161.697270285 1,866,036,713,483 288.576M/sec 1,068,181,502,433 0.165GHz 739,559,279,008 0.70
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=0 157.930900998 1,797,506,082,342 284.548M/sec 1,001,509,813,741 0.159GHz 644,107,150,289 0.66

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1 231.440889430 1,965,389,749,057 212.391M/sec 1,407,927,406,358 0.152GHz 997,199,361,968 0.72
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=1 165.772265335 1,805,895,001,689 272.353M/sec 1,514,173,918,970 0.228GHz 823,435,044,810 0.54

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1 214.433248700 2,232,198,239,769 260.300M/sec 1,073,334,918,389 0.125GHz 861,540,079,120 0.80
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1 187.664764448 1,964,118,348,429 261.660M/sec 978,060,510,880 0.130GHz 668,316,194,988 0.67

It's pretty clear that the larger write block size can hurt quite
badly. I was somewhat confused by this at first, but after thinking
about it for a while longer it actually makes sense: For the OS to
finish an 8k write it needs to find two free pagecache pages. For an
128k write it needs to find 32. Which means that it's much more likely
that kernel threads and the writes are going to fight over locks /
cachelines: In the 8k page it's quite likely that ofen the kernel
threads will do so while the memcpy() from userland is happening, but
that's less the case with 32 pages that need to be acquired before the
memcpy() can happen.

With cache control that problem doesn't exist, which is why the larger
block size is beneficial:

> test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 0.116GHz 678,451,556,273 0.84
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1 162.152397194 925,643,754,128 142.719M/sec 743,208,501,601 0.115GHz 554,462,585,110 0.70

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1 164.488835158 911,974,902,254 138.611M/sec 760,756,011,483 0.116GHz 672,105,046,261 0.84
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1 159.945462440 850,162,390,626 132.892M/sec 724,286,281,548 0.113GHz 670,069,573,150 0.90

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1 188.772193248 1,418,274,870,697 187.803M/sec 923,133,958,500 0.122GHz 799,212,291,243 0.92
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1 210.945487961 1,527,169,148,899 180.990M/sec 906,023,518,692 0.107GHz 700,166,552,207 0.80

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1 166.295223626 1,290,699,256,763 194.044M/sec 857,873,391,283 0.129GHz 761,338,026,415 0.89
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1 179.140070123 1,383,595,004,328 193.095M/sec 850,299,722,091 0.119GHz 706,959,617,654 0.83

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1 158.262790654 1,720,443,307,097 271.769M/sec 1,004,079,045,479 0.159GHz 826,905,592,751 0.84
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1 157.934678897 1,650,503,807,778 261.266M/sec 970,705,561,971 0.154GHz 637,953,927,131 0.66

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1 214.433248700 2,232,198,239,769 260.300M/sec 1,073,334,918,389 0.125GHz 861,540,079,120 0.80
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1 187.664764448 1,964,118,348,429 261.660M/sec 978,060,510,880 0.130GHz 668,316,194,988 0.67

Note how especially in the first few cases the total number of
instructions required is improved (although due to the way I did the
perf stat the sampling error is pretty large).

I haven't run that test yest, but after looking at all this I would bet
that reducing the block size to 4kb (i.e. a single os/hw page) would
help the no cache control case significantly, in particular in the
concurrent case.

And conversely, I'd expect that the CPU efficiency will be improved by
larger block size for the cache control case for just about any
realistic block size.

I'd love to have a faster storage available (faster NVMes, or multiple
ones I can use for benchmarking) to see what the cutoff point for
actually benefiting from concurrency is.

Also worthwhile to note that even the "best case" from a CPU usage point
here absolutely *pales* against using direct-IO. It's not an
apples/apples comparison, but comparing buffered io using
write_and_fsync, and unbuffered io using fio:

128KiB blocksize:

write_and_fsync:
echo 3 |sudo tee /proc/sys/vm/drop_caches && /usr/bin/time perf stat -a -e cpu-clock,ref-cycles,cycles,instructions /tmp/write_and_fsync --blocksize $((128*1024)) --sync_file_range=1 --fallocate=1 --fadvise=1 --sequential=0 --filesize=$((400*1024*1024*1024)) /srv/dev/bench/test1

Performance counter stats for 'system wide':

6,377,903.65 msec cpu-clock # 39.999 CPUs utilized
628,014,590,200 ref-cycles # 98.467 M/sec
634,468,623,514 cycles # 0.099 GHz
795,771,756,320 instructions # 1.25 insn per cycle

159.451492209 seconds time elapsed

fio:
rm -f /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && /usr/bin/time perf stat -a -e cpu-clock,ref-cycles,cycles,instructions fio --name=test --iodepth=512 --iodepth_low=8 --iodepth_batch_submit=8 --iodepth_batch_complete_min=8 --iodepth_batch_complete_max=128 --ioengine=libaio --rw=write --bs=128k --filesize=$((400*1024*1024*1024)) --direct=1 --numjobs=1

Performance counter stats for 'system wide':

6,313,522.71 msec cpu-clock # 39.999 CPUs utilized
458,476,185,800 ref-cycles # 72.618 M/sec
196,148,015,054 cycles # 0.031 GHz
158,921,457,853 instructions # 0.81 insn per cycle

157.842080440 seconds time elapsed

CPU usage for fio most of the time was around 98% for write_and_fsync
and 40% for fio.

I.e. system-wide CPUs were active 0.73x the time, and 0.2x as many
instructions had to be executed in the DIO case.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2020-05-03 19:23:24 Re: Postgres Windows build system doesn't work with python installed in Program Files
Previous Message Noah Misch 2020-05-03 17:18:27 Re: Should program exit, When close() failed for O_RDONLY mode