Quick Links

Re: Vacuum: allow usage of more than 1GB of work mem

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Vacuum: allow usage of more than 1GB of work mem
Date:	2016-09-06 18:25:10
Message-ID:	CAGTBQpbayY-t5-ySW19yQs1dBqvV6dm8dmdpTv_FWXmDC0A0cQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Sep 6, 2016 at 3:45 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 5 September 2016 at 21:58, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>
>>>>> How long does that part ever take? Is there any substantial gain from this?
>
>> Btw, without a further patch to prefetch pages on the backward scan
>> for truncate, however, my patience ran out before it finished
>> truncating. I haven't submitted that patch because there was an
>> identical patch in an older thread that was discussed and more or less
>> rejected since it slightly penalized SSDs.
>
> OK, thats enough context. Sorry for being forgetful on that point.
>
> Please post that new patch also.

Attached.

On Mon, Sep 5, 2016 at 5:58 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> On Mon, Sep 5, 2016 at 5:36 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On 5 September 2016 at 15:50, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>>> On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>>> On 3 September 2016 at 04:25, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>>>>> The patch also makes vacuum free the dead_tuples before starting
>>>>> truncation. It didn't seem necessary to hold onto it beyond that
>>>>> point, and it might help give the OS more cache, especially if work
>>>>> mem is configured very high to avoid multiple index scans.
>>>>
>>>> How long does that part ever take? Is there any substantial gain from this?
>>>>
>>>> Lets discuss that as a potential second patch.
>>>
>>> In the test case I mentioned, it takes longer than the vacuum part itself.
>>
>> Please provide a test case and timings so we can see what's happening.

Robert made a strong point for a change in the approach, so the
information below is applicable only to the old patch (to be
rewritten).

I'm sending this merely to document the testing done, it will be a
while before I can get the proposed design running and tested.

> The referenced test case is the one I mentioned on the OP:
>
> - createdb pgbench
> - pgbench -i -s 4000 pgbench
> - psql pgbench -c 'delete from pgbench_accounts;'
> - vacuumdb -v -t pgbench_accounts pgbench
>
> fsync=off, autovacuum=off, maintainance_work_mem=4GB
>
> From what I remember, it used ~2.7GB of RAM up until the truncate
> phase, where it freed it. It performed a single index scan over the
> PK.
>
> I don't remember timings, and I didn't take them, so I'll have to
> repeat the test to get them. It takes all day and makes my laptop
> unusably slow, so I'll post them later, but they're not very
> interesting. The only interesting bit is that it does a single index
> scan instead of several, which on TB-or-more tables it's kinda nice.

So, the test results below:

During setup, maybe for context, the delete took 52m 50s real time
(measured with time psql pgbench -c 'delete from pgbench_accounts;')

During the delete, my I/O was on average like the following, which
should give an indication of what my I/O subsystem is capable of (not
much, granted):

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.47 5.27 35.53 77.42 17.58 42.95
1097.51 145.22 1295.23 33.47 1874.36 8.85 100.00

Since it's a 5k RPM laptop drive, it's rather slow on IOPS, and since
I'm using the defaults for shared buffers and checkpoints, write
thoughput isn't stellar either. But that's not the point of the test
anyway, it's just for context.

The hardware is an HP envy laptop with a 1TB 5.4k RPM hard drive, 12GB
RAM, core i7-4722HQ, no weird performance tweaking of any kind (ie:
cpu scaling left intact). The system was not dedicated of course,
being a laptop, but it had little else going on while the test was
running. Given the size of the test, I don't believe there's any
chance concurrent activity could invalidate the results.

The timing for setup was comparable with both versions (patched and
unpatched), so I'm reporting the patched times only.

The vacuum phase:

patched:

$ vacuumdb -v -t pgbench_accounts pgbench
INFO: vacuuming "public.pgbench_accounts"
INFO: scanned index "pgbench_accounts_pkey" to remove 400000000 row versions
DETAIL: CPU 12.46s/48.76u sec elapsed 566.47 sec.
INFO: "pgbench_accounts": removed 400000000 row versions in 6557378 pages
DETAIL: CPU 56.68s/28.90u sec elapsed 1872.76 sec.
INFO: index "pgbench_accounts_pkey" now contains 0 row versions in
1096762 pages
DETAIL: 400000000 index row versions were removed.
1092896 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.47 sec.
INFO: "pgbench_accounts": found 400000000 removable, 0 nonremovable
row versions in 6557378 out of 6557378 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 129.24s/127.24u sec elapsed 3877.13 sec.
INFO: "pgbench_accounts": truncated 6557378 to 0 pages
DETAIL: CPU 34.88s/7.91u sec elapsed 1645.90 sec.

Total elapsed time: ~92 minutes

I/O during initial heap scan:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 1.52 99.78 72.63 62.47 31.94 33.22
987.80 146.71 1096.29 25.39 2341.48 7.40 100.00

Index scan:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 7.08 3.87 55.18 59.87 17.06 31.83
870.33 146.61 1243.34 31.42 2360.44 8.69 100.00

Final heap scan:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.78 8.65 65.32 57.32 31.50 32.96
1076.56 152.22 1928.67 1410.63 2519.01 8.15 100.00

Truncate (with prefetch):

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 159.67 0.87 3720.03 2.82 30.31 0.12
16.74 19.11 5.13 4.95 242.60 0.27 99.23

Without prefetch, rMB/s during truncation varies between 4MB/s and
6MB/s, so it's on average 6 times slower, meaning it would take over 3
hours.

Peak memory used: 2369MB RSS, 4260MB VIRT (source: top)

Unpatched + prefetch (same config, effective work mem 1GB due to
non-huge allocation limit):

$ vacuumdb -v -t pgbench_accounts pgbench
INFO: vacuuming "public.pgbench_accounts"
INFO: scanned index "pgbench_accounts_pkey" to remove 178956737 row versions
DETAIL: CPU 5.88s/53.77u sec elapsed 263.63 sec.
INFO: "pgbench_accounts": removed 178956737 row versions in 2933717 pages
DETAIL: CPU 22.28s/12.94u sec elapsed 757.45 sec.
INFO: scanned index "pgbench_accounts_pkey" to remove 178956737 row versions
DETAIL: CPU 7.44s/31.28u sec elapsed 282.41 sec.
INFO: "pgbench_accounts": removed 178956737 row versions in 2933717 pages
DETAIL: CPU 22.24s/13.30u sec elapsed 806.54 sec.
INFO: scanned index "pgbench_accounts_pkey" to remove 42086526 row versions
DETAIL: CPU 4.30s/5.83u sec elapsed 170.30 sec.
INFO: "pgbench_accounts": removed 42086526 row versions in 689944 pages
DETAIL: CPU 3.35s/3.23u sec elapsed 126.22 sec.
INFO: index "pgbench_accounts_pkey" now contains 0 row versions in
1096762 pages
DETAIL: 400000000 index row versions were removed.
1096351 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.40 sec.
INFO: "pgbench_accounts": found 400000000 removable, 0 nonremovable
row versions in 6557378 out of 6557378 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 123.82s/183.76u sec elapsed 4071.54 sec.
INFO: "pgbench_accounts": truncated 6557378 to 0 pages
DETAIL: CPU 40.36s/7.72u sec elapsed 1648.22 sec.

Total elapsed time: ~95m

I/O during initial heap scan:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 1.48 32.53 66.10 60.50 31.95 34.88
1081.06 149.20 1175.78 25.44 2432.59 8.02 101.59

First index scan:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 1.17 14.95 43.85 70.07 19.65 40.18
1075.57 145.98 1278.39 31.86 2058.51 8.78 100.00

Final index scan:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 17.12 1.50 169.85 2.28 68.33 0.67
820.95 158.32 312.00 28.14 21426.95 5.81 100.00

Truncation:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 142.93 1.23 3444.70 4.65 28.16 0.36
16.93 18.52 5.37 5.25 91.17 0.29 99.22

Peak memory used is 1135MB RSS and 1188MB VIRT

Comparing:

Time reportedly spent scanning indexes: 716.34 unpatched, 566.47 patched
Time reportedly spent scanning heap: 1690.21 unpatched, 1872.76 patched
Total vacuum scan as reported: 4071.54 unpatched, 3877.13 patched

Surely I didn't expect it to be such a close call. I believe the key
reason is the speedup it got during the final index scan for not
having to delete so many tuples. Clearly, having to interleave reads
and writes is stressing my HD, and the last index scan, having to
write less, was thus faster.

I don't believe this would have happened if the index hadn't been
pristine and in almost physical (heap) order, so I'd expect real-world
cases (with properly aged, shuffled and bloated indexes) to show a
more pronounced difference. Or when using a cost limit, that will
artificially limit the I/O rate vacuum can reach.

Clearly the patch is of use when I/O is the limiting factor, either
due to vacuum cost limits, or due to the I/O subsystem being the
bottleneck, as was the case during the above test case. Since more
work mem will mean a slower lookup of the dead_tuples array, not only
due to the extra comparisons but also poorer cache locality, I believe
it won't benefit the runtime cost of CPU-bound cases, but it should at
least generate less WAL since that's another benefit of scanning the
indexes fewer times (increased WAL rates during vacuum is another
problem we regularly face in our production setup).

Given the I/O subsystem on my test machine isn't able to produce a
CPU-bound test case for the amount of dead_tuples involved in
stressing the patch, I cannot confirm the above statement, but it
should be evident given the implementation.

Attachment	Content-Type	Size
0002-Vacuum-prefetch-buffers-on-backward-scan.patch	text/x-patch	2.5 KB

In response to

Re: Vacuum: allow usage of more than 1GB of work mem at 2016-09-05 20:58:17 from Claudio Freire

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2016-09-06 18:30:27	Re: [PATCH] Alter or rename enum value
Previous Message	Robert Haas	2016-09-06 18:23:48	Re: Vacuum: allow usage of more than 1GB of work mem