Re: Vacuum: allow usage of more than 1GB of work mem

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Vacuum: allow usage of more than 1GB of work mem
Date: 2016-10-26 20:25:23
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Thu, Sep 15, 2016 at 1:16 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> On Wed, Sep 14, 2016 at 12:24 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>> On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> I am kind of doubtful about this whole line of investigation because
>>> we're basically trying pretty hard to fix something that I'm not sure
>>> is broken. I do agree that, all other things being equal, the TID
>>> lookups will probably be faster with a bitmap than with a binary
>>> search, but maybe not if the table is large and the number of dead
>>> TIDs is small, because cache efficiency is pretty important. But even
>>> if it's always faster, does TID lookup speed even really matter to
>>> overall VACUUM performance? Claudio's early results suggest that it
>>> might, but maybe that's just a question of some optimization that
>>> hasn't been done yet.
>> FYI, the reported impact was on CPU time, not runtime. There was no
>> significant difference in runtime (real time), because my test is
>> heavily I/O bound.
>> I tested with a few small tables and there was no significant
>> difference either, but small tables don't stress the array lookup
>> anyway so that's expected.
>> But on the assumption that some systems may be CPU bound during vacuum
>> (particularly those able to do more than 300-400MB/s sequential I/O),
>> in those cases the increased or decreased cost of lazy_tid_reaped will
>> directly correlate to runtime. It's just none of my systems, which all
>> run on amazon and is heavily bandwidth constrained (fastest I/O
>> subsystem I can get my hands on does 200MB/s).
> Attached is the patch with the multiarray version.
> The tests are weird. Best case comparison over several runs, to remove
> the impact of concurrent activity on this host (I couldn't remove all
> background activity even when running the tests overnight, the distro
> adds tons of crons and background cleanup tasks it would seem),
> there's only very mild CPU impact. I'd say insignificant, as it's well
> below the mean variance.

I reran the tests on a really dedicated system, and with a script that
captured a lot more details about the runs.

The system isn't impressive, an i5 with a single consumer HD and 8GB
RAM, but it did the job.

These tests make more sense, so I bet it was the previous tests that
were spoiled by concurrent activity on the host.

Attached is the raw output of the test, the script used to create it,
and just in case the patch set used. I believe it's the same as the
last one I posted, just rebased.

In the results archive, the .vacuum prefix is the patched version with
both patch 1 and 2, .git.ref is just patch 1 (without which the
truncate takes unholily long).

Grepping the results a bit, picking an average run out of all runs on
each scale:



s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s.
s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s.
s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s.


s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s.
s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s.
s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s.

Total I/O (in MB)


s100: R:2.4 - W:5862
s400: R:1337.4 - W:29385.6
s4000: R:318631 - W:370154


s100: R:1412.4 - W:7644.6
s400: R:3180.6 - W:36281.4
s4000: R:330683 - W:370391

So, in essence, CPU time didn't get adversely affected. If anything,
it got improved by about 20% on the biggest case (scale 4000).

While total runtime didn't change much, I believe this is only due to
the fact that the index is perfectly correlated (clustered?) since
it's a pristine index, so index scans either remove or skip full
pages, never leaving things half-way. A bloated index would probably
show a substantially different behavior, I'll try to get a script that
does it by running pgbench a while before the vacuum or something like

However, the total I/O metric already shows remarkable improvement.
This metric is measuring all the I/O including pgbench init, the
initial vacuum pgbench init does, the delete and the final vacuum. So
it's not just the I/O for the vacuum itself, but the whole run. We can
see the patched version reading a lot less (less scans over the
indexes will do that), and in some cases writing less too (again, less
index scans may be performing less redundant writes when cleaning
up/reclaiming index pages).

I'll post when I get the results for the bloated case, but I believe
this already shows substantial improvement as is.

This approach can later be improved upon by turning tid segments into
bitmaps if they're packed densely enough, but I believe this patch
represents a sensible first step before attempting that.

Attachment Content-Type Size
vacuum_bench_results.tar.gz application/x-gzip 498.8 KB
0001-Vacuum-prefetch-buffers-on-backward-scan.patch text/x-patch 2.5 KB
0002-Vacuum-allow-using-more-than-1GB-work-mem.patch text/x-patch 18.9 KB application/x-sh 684 bytes

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2016-10-26 20:45:43 Re: emergency outage requiring database restart
Previous Message Tom Lane 2016-10-26 20:11:58 Re: pg_hba_file_settings view patch