Quick Links

Re: Block level parallel vacuum WIP

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Block level parallel vacuum WIP
Date:	2016-08-23 15:56:18
Message-ID:	CAD21AoDn6YUya9ar0=s92Li9N=Zmiq+dhWtkD8UuEOV3xLn8gw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Aug 23, 2016 at 10:50 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Aug 23, 2016 at 7:02 AM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>> I'd like to propose block level parallel VACUUM.
>> This feature makes VACUUM possible to use multiple CPU cores.
>
> Great. This is something that I have thought about, too. Andres and
> Heikki recommended it as a project to me a few PGCons ago.
>
>> As for PoC, I implemented parallel vacuum so that each worker
>> processes both 1 and 2 phases for particular block range.
>> Suppose we vacuum 1000 blocks table with 4 workers, each worker
>> processes 250 consecutive blocks in phase 1 and then reclaims dead
>> tuples from heap and indexes (phase 2).
>> To use visibility map efficiency, each worker scan particular block
>> range of relation and collect dead tuple locations.
>> After each worker finished task, the leader process gathers these
>> vacuum statistics information and update relfrozenxid if possible.
>
> This doesn't seem like a good design, because it adds a lot of extra
> index scanning work. What I think you should do is:
>
> 1. Use a parallel heap scan (heap_beginscan_parallel) to let all
> workers scan in parallel. Allocate a DSM segment to store the control
> structure for this parallel scan plus an array for the dead tuple IDs
> and a lock to protect the array.
>
> 2. When you finish the heap scan, or when the array of dead tuple IDs
> is full (or very nearly full?), perform a cycle of index vacuuming.
> For now, have each worker process a separate index; extra workers just
> wait. Perhaps use the condition variable patch that I posted
> previously to make the workers wait. Then resume the parallel heap
> scan, if not yet done.
>
> Later, we can try to see if there's a way to have multiple workers
> work together to vacuum a single index. But the above seems like a
> good place to start.

Thank you for the advice.
That's a what I thought as an another design, I will change the patch
to this design.

>> I also changed the buffer lock infrastructure so that multiple
>> processes can wait for cleanup lock on a buffer.
>
> You won't need this if you proceed as above, which is probably a good thing.

Right.

>
>> And the new GUC parameter vacuum_parallel_workers controls the number
>> of vacuum workers.
>
> I suspect that for autovacuum there is little reason to use parallel
> vacuum, since most of the time we are trying to slow vacuum down, not
> speed it up. I'd be inclined, for starters, to just add a PARALLEL
> option to the VACUUM command, for when people want to speed up
> parallel vacuums. Perhaps
>
> VACUUM (PARALLEL 4) relation;
>
> ...could mean to vacuum the relation with the given number of workers, and:
>
> VACUUM (PARALLEL) relation;
>
> ...could mean to vacuum the relation in parallel with the system
> choosing the number of workers - 1 worker per index is probably a good
> starting formula, though it might need some refinement.

It looks convenient.
I was thinking that we can manage the number of parallel worker per
table using this parameter for autovacuum , like
ALTER TABLE relation SET (parallel_vacuum_workers = 2)

Regards,

--
Masahiko Sawada

In response to

Re: Block level parallel vacuum WIP at 2016-08-23 13:50:55 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2016-08-23 16:17:30	Re: Block level parallel vacuum WIP
Previous Message	Robert Haas	2016-08-23 15:42:53	Re: Slowness of extended protocol