Re: cost based vacuum (parallel)

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: cost based vacuum (parallel)
Date: 2019-11-08 04:09:31
Message-ID: CAFiTN-tFLN=vdu5Ra-23E9_7Z1JXkk5MkRY3Bkj2zAoWK7fULA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 8, 2019 at 8:37 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> On Fri, Nov 8, 2019 at 8:18 AM Masahiko Sawada
> <masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
> >
> > On Wed, 6 Nov 2019 at 15:45, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Nov 5, 2019 at 11:28 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Mon, Nov 4, 2019 at 11:42 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > > >
> > > > >
> > > > > > The two approaches to solve this problem being discussed in that
> > > > > > thread [1] are as follows:
> > > > > > (a) Allow the parallel workers and master backend to have a shared
> > > > > > view of vacuum cost related parameters (mainly VacuumCostBalance) and
> > > > > > allow each worker to update it and then based on that decide whether
> > > > > > it needs to sleep. Sawada-San has done the POC for this approach.
> > > > > > See v32-0004-PoC-shared-vacuum-cost-balance in email [2]. One
> > > > > > drawback of this approach could be that we allow the worker to sleep
> > > > > > even though the I/O has been performed by some other worker.
> > > > >
> > > > > I don't understand this drawback.
> > > > >
> > > >
> > > > I think the problem could be that the system is not properly throttled
> > > > when it is supposed to be. Let me try by a simple example, say we
> > > > have two workers w-1 and w-2. The w-2 is primarily doing the I/O and
> > > > w-1 is doing very less I/O but unfortunately whenever w-1 checks it
> > > > finds that cost_limit has exceeded and it goes for sleep, but w-1
> > > > still continues.
> > > >
> > >
> > > Typo in the above sentence. /but w-1 still continues/but w-2 still continues.
> > >
> > > > Now in such a situation even though we have made one
> > > > of the workers slept for a required time but ideally the worker which
> > > > was doing I/O should have slept. The aim is to make the system stop
> > > > doing I/O whenever the limit has exceeded, so that might not work in
> > > > the above situation.
> > > >
> > >
> > > One idea to fix this drawback is that if we somehow avoid letting the
> > > workers sleep which has done less or no I/O as compared to other
> > > workers, then we can to a good extent ensure that workers which are
> > > doing more I/O will be throttled more. What we can do is to allow any
> > > worker sleep only if it has performed the I/O above a certain
> > > threshold and the overall balance is more than the cost_limit set by
> > > the system. Then we will allow the worker to sleep proportional to
> > > the work done by it and reduce the VacuumSharedCostBalance by the
> > > amount which is consumed by the current worker. Something like:
> > >
> > > If ( VacuumSharedCostBalance >= VacuumCostLimit &&
> > > MyCostBalance > (threshold) VacuumCostLimit / workers)
> > > {
> > > VacuumSharedCostBalance -= MyCostBalance;
> > > Sleep (delay * MyCostBalance/VacuumSharedCostBalance)
> > > }
> > >
> > > Assume threshold be 0.5, what that means is, if it has done work more
> > > than 50% of what is expected from this worker and the overall share
> > > cost balance is exceeded, then we will consider this worker to sleep.
> > >
> > > What do you guys think?
> >
> > I think the idea that the more consuming I/O they sleep more longer
> > time seems good. There seems not to be the drawback of approach(b)
> > that is to unnecessarily delay vacuum if some indexes are very small
> > or bulk-deletions of indexes does almost nothing such as brin. But on
> > the other hand it's possible that workers don't sleep even if shared
> > cost balance already exceeds the limit because it's necessary for
> > sleeping that local balance exceeds the worker's limit divided by the
> > number of workers. For example, a worker is scheduled doing I/O and
> > exceeds the limit substantially while other 2 workers do less I/O. And
> > then the 2 workers are scheduled and consume I/O. The total cost
> > balance already exceeds the limit but the workers will not sleep as
> > long as the local balance is less than (limit / # of workers).
> >
>
> Right, this is the reason I told to keep some threshold for local
> balance(say 50% of (limit / # of workers)). I think we need to do
> some experiments to see what is the best thing to do.
>
I have done some experiments on this line. I have first produced a
case where we can show the problem with the existing shared costing
patch (worker which is doing less I/O might pay the penalty on behalf
of the worker who is doing more I/O). I have also hacked the shared
costing patch of Swada-san so that worker only go for sleep if the
shared balance has crossed the limit and it's local balance has
crossed some threadshold[1].

Test setup: I have created 4 indexes on the table. Out of which 3
indexes will have a lot of pages to process but need to dirty a few
pages whereas the 4th index will have to process a very less number of
pages but need to dirty all of them. I have attached the test script
along with the mail. I have shown what is the delay time each worker
have done. What is total I/O[1] each worker and what is the page hit,
page miss and page dirty count?
[1] total I/O = _nhit * VacuumCostPageHit + _nmiss *
VacuumCostPageMiss + _ndirty * VacuumCostPageDirty

patch 1: Shared costing patch: (delay condition ->
VacuumSharedCostBalance > VacuumCostLimit)
worker 0 delay=80.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=40.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=110.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=120.98 total I/O=16378 hit=4318 miss=0 dirty=603

Observation1: I think here it's clearly visible that worker 3 is
doing the least total I/O but delaying for maximum amount of time.
OTOH, worker 1 is delaying for very little time compared to how much
I/O it is doing. So for solving this problem, I have add a small
tweak to the patch. Wherein the worker will only sleep if its local
balance has crossed some threshold. And, we can see that with that
change the problem is solved up to quite an extent.

patch 2: Shared costing patch: (delay condition ->
VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance >
VacuumCostLimit/number of workers)
worker 0 delay=100.12 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=90.00 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=80.06 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=80.72 total I/O=16378 hit=4318 miss=0 dirty=603

Observation2: This patch solves the problem discussed with patch1 but
in some extreme cases there is a possibility that the shared limit can
become twice as much as local limit and still no worker goes for the
delay. For solving that there could be multiple ideas a) Set the max
limit on shared balance e.g. 1.5 * VacuumCostLimit after that we will
give the delay whoever tries to do the I/O irrespective of its local
balance.
b) Set a little lower value for the local threshold e.g 50% of the local limit

Here I have changed the patch2 as per (b) If local balance reaches to
50% of the local limit and shared balance hit the vacuum cost limit
then go for the delay.

patch 3: Shared costing patch: (delay condition ->
VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance > 0.5
* VacuumCostLimit/number of workers)
worker 0 delay=70.03 total I/O=17931 hit=17891 miss=0 dirty=2
worker 1 delay=100.14 total I/O=17931 hit=17891 miss=0 dirty=2
worker 2 delay=80.01 total I/O=17931 hit=17891 miss=0 dirty=2
worker 3 delay=101.03 total I/O=16378 hit=4318 miss=0 dirty=603

Observation3: I think patch3 doesn't completely solve the issue
discussed in patch1 but its far better than patch1. But, patch 2
might have another problem as discussed in observation2.

I think I need to do some more analysis and experiment before we can
reach to some conclusion. But, one point is clear that we need to do
something to solve the problem observed with patch1 if we are going
with the shared costing approach.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
test.sh text/x-sh 429 bytes
test.sql application/octet-stream 460 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2019-11-08 04:14:55 Re: pg_waldump and PREPARE
Previous Message Grigory Smolkin 2019-11-08 04:00:24 Re: [proposal] recovery_target "latest"