Re: Reduce the time required for a database recovery from archive.

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Dmitry Shulga <d(dot)shulga(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Reduce the time required for a database recovery from archive.
Date: 2020-11-09 16:31:59
Message-ID: 20201109163159.GH16415@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Dmitry Shulga (d(dot)shulga(at)postgrespro(dot)ru) wrote:
> > On 19 Oct 2020, at 23:25, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> >>>> Implementation of this approach assumes running several background processes (bgworkers)
> >>>> each of which runs a shell command specified by the parameter restore_command
> >>>> to deliver an archived WAL file. Number of running parallel processes is limited
> >>>> by the new parameter max_restore_command_workers. If this parameter has value 0
> >>>> then WAL files delivery is performed using the original algorithm, that is in
> >>>> one-by-one manner. If this parameter has value greater than 0 then the database
> >>>> server starts several bgworker processes up to the limit specified by
> >>>> the parameter max_restore_command_workers and passes to every process
> >>>> WAL file name to deliver. Active processes start prefetching of specified
> >>>> WAL files and store received files in the directory pg_wal/pgsql_tmp. After
> >>>> bgworker process finishes receiving a file it marks itself as a free process
> >>>> and waits for a new request to receive a next WAL file. The main process
> >>>> performing database recovery still handles WAL files in one-by-one manner,
> >>>> but instead of waiting for a next required WAL file's availability it checks for
> >>>> that file in the prefetched directory. If a new file is present there,
> >>>> the main process starts its processing.
> >>>
> >>> I'm a bit confused about this description- surely it makes sense for the
> >> OK. The description I originally provided was probably pretty misleading so I will try to clarify it a bit.
> >>
> >> So, as soon as a bgworker process finishes delivering a WAL file it marks itself as a free.
> >>
> >> WAL records applier working in parallel and processing the WAL files in sequential manner.
> >> Once it finishes handling of the current WAL file, it checks whether it is possible to run extra bgworker processes
> >> to deliver WAL files which will be required a bit later. If there are free bgworker processes then applier requests
> >> to start downloading of one or more extra WAL files. After that applier determines a name of next WAL file to handle
> >> and checks whether it exist in the prefetching directory. If it does exist then applier starts handling it and
> >> processing loop is repeated.
> >
> > Ok- so the idea is that each time the applying process finishes with a
> > WAL file then it'll see if there's an available worker and, if so, will
> > give it the next file to go get (which would presumably be some number
> > in the future and the actual next file the applying process needs is
> > already available). That sounds better, at least, though I'm not sure
> > why we're making it the job of the applying process to push the workers
> > each time..?
> Every bgwork serves as a task to deliver a WAL file. Considering a task as an active entity is well-known approach in software design.
> So I don't see any issues with such implementation. Moreover, implementation of this approach is probably simpler than any other alternatives
> and still providing positive performance impact in comparing with current (non optimized) implementation.

I don't think we look only at if something is an improvement or not over
the current situation when we consider changes.

The relatively simple approach I was thinking was that a couple of
workers would be started and they'd have some prefetch amount that needs
to be kept out ahead of the applying process, which they could
potentially calculate themselves without needing to be pushed forward by
the applying process.

> > Also, I'm not sure about the interface- wouldn't it make
> > more sense to have a "pre-fetch this amount of WAL" kind of parameter
> > directly instead of tying that to the number of background workers?
> This approach was originally considered and closely discussed.
> Finally, it was decided that introducing an extra GUC parameter to control pre-fetch limit is not practical since it shifts responsibility for tuning prefetching
> mechanism from postgres server to a user.
> From my point of view the fewer parameters exist to set up some feature the better.

I agree in general that it's better to have fewer parameters, but I
disagree that this isn't an important option for users to be able to
tune- the rate of fetching WAL and of applying WAL varies quite a bit
from system to system. Being able to tune the pre-fetch seems like it'd
actually be more important to a user than the number of processes
required to keep up with that amount of pre-fetching, which is something
we could actually figure out on our own...

> > You
> > might only need one or two processes doing WAL fetching to be able to
> > fetch faster than the applying process is able to apply it, but you
> > probably want to pre-fetch more than just one or two 16 MB WAL files.
>
> Every time when prefetching is started a number of potentially prefetched files is calculated by expression
> PREFETCH_RATION * max_restore_command_workers - 'number of already prefetched files'
> where PREFETCH_RATION is compiled-in constant and has value 16.
>
> After that a task for delivering a next WAL file is placed to a current free bgworker process up until no more free bgworker processes.

Ah, it wasn't mentioned that we've got a multiplier in here, but it
still ends up meaning that if a user actually wants to tune the amount
of pre-fetching being done, they're going to end up having to tune the,
pretty much entirely unrelated, value of max_restore_command_workers.
That really seems entirely backwards to me from what I would think the
user would actually want to tune.

> > In other words, I would have thought we'd have:
> >
> > wal_prefetch_amount = 1GB
> > max_restore_command_workers = 2
> >
> > and then you'd have up to 2 worker processes running and they'd be
> > keeping 1GB of WAL pre-fetched at all times. If we have just
> > 'max_restore_command_workers' and you want to pre-fetch 1GB of WAL then
> > you'd have to have a pretty high value there and you'd end up with
> > a bunch of threads that all spike to go do work each time the applying
> Sorry, I don't see how we can end up with a bunch of threads?
> max_restore_command_workers has value 2 in your example meaning that no more than 2 bgworkers could be run concurrently for the sake of WAL files prefetching

If you don't give the user the option to configure the prefetch amount,
except indirectly by changing the number of max restore workers, then to
get a higher prefetch amount they have to increase the number of
workers. That's what I'm referring to above, and previously, here.

> > process finishes a WAL file but then just sit around doing nothing while
> > waiting for the applying process to finish another segment.
>
> I believe that for typical set-up the parameter max_restore_command_workers would have value 2 or 3 in order to supply
> a delivered WAL file on time just before it be started processing.
>
> This use case is for environment where time required for delivering WAL file from archive is greater than time required for applying records contained in the WAL file.
> If time required for WAL file delivering lesser than than time required for handling records contained in it then max_restore_command_workers shouldn't be specified at all

That's certainly not correct at all- the two aren't really all that
related, because any time spent waiting for a WAL file to be delivered
is time that the applying process *could* be working to apply WAL
instead of waiting. At a minimum, I'd expect us to want to have, by
default, at least one worker process running out in front of the
applying process to hopefully eliminate most, if not all, time where the
applying process is waiting for a WAL to show up. In cases where the
applying process is faster than a single fetching process, a user might
want to have two or more restore workers, though ultimately I still
contend that what they really want is as many workers as needed to make
sure that the applying process doesn't ever need to wait- up to some
limit based on the amount of space that's available.

And back to the configuration side of this- have you considered the
challenge that a user who is using very large WAL files might run
into with the proposed approach that doesn't allow them to control the
amount of space used? If I'm using 1G WAL files, then I need to have
16G available to have *any* pre-fetching done with this proposed
approach, right? That doesn't seem great.

Thanks,

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-11-09 16:33:54 Re: PATCH: Report libpq version and configuration
Previous Message Alvaro Herrera 2020-11-09 16:08:22 Re: PATCH: Report libpq version and configuration