Re: Block level parallel vacuum WIP

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block level parallel vacuum WIP
Date: 2016-08-23 13:50:55
Message-ID: CA+TgmobV6+ZPTNE3Z+08D9Xp7UK+mSq-rztOW+=RGsr5-pKiUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 23, 2016 at 7:02 AM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> I'd like to propose block level parallel VACUUM.
> This feature makes VACUUM possible to use multiple CPU cores.

Great. This is something that I have thought about, too. Andres and
Heikki recommended it as a project to me a few PGCons ago.

> As for PoC, I implemented parallel vacuum so that each worker
> processes both 1 and 2 phases for particular block range.
> Suppose we vacuum 1000 blocks table with 4 workers, each worker
> processes 250 consecutive blocks in phase 1 and then reclaims dead
> tuples from heap and indexes (phase 2).
> To use visibility map efficiency, each worker scan particular block
> range of relation and collect dead tuple locations.
> After each worker finished task, the leader process gathers these
> vacuum statistics information and update relfrozenxid if possible.

This doesn't seem like a good design, because it adds a lot of extra
index scanning work. What I think you should do is:

1. Use a parallel heap scan (heap_beginscan_parallel) to let all
workers scan in parallel. Allocate a DSM segment to store the control
structure for this parallel scan plus an array for the dead tuple IDs
and a lock to protect the array.

2. When you finish the heap scan, or when the array of dead tuple IDs
is full (or very nearly full?), perform a cycle of index vacuuming.
For now, have each worker process a separate index; extra workers just
wait. Perhaps use the condition variable patch that I posted
previously to make the workers wait. Then resume the parallel heap
scan, if not yet done.

Later, we can try to see if there's a way to have multiple workers
work together to vacuum a single index. But the above seems like a
good place to start.

> I also changed the buffer lock infrastructure so that multiple
> processes can wait for cleanup lock on a buffer.

You won't need this if you proceed as above, which is probably a good thing.

> And the new GUC parameter vacuum_parallel_workers controls the number
> of vacuum workers.

I suspect that for autovacuum there is little reason to use parallel
vacuum, since most of the time we are trying to slow vacuum down, not
speed it up. I'd be inclined, for starters, to just add a PARALLEL
option to the VACUUM command, for when people want to speed up
parallel vacuums. Perhaps

VACUUM (PARALLEL 4) relation;

...could mean to vacuum the relation with the given number of workers, and:

VACUUM (PARALLEL) relation;

...could mean to vacuum the relation in parallel with the system
choosing the number of workers - 1 worker per index is probably a good
starting formula, though it might need some refinement.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-08-23 13:59:33 Re: Logical decoding of sequence advances, part II
Previous Message Amit Kapila 2016-08-23 13:50:36 Re: Block level parallel vacuum WIP