Re: restoration after crash slowness, any way to improve?

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: restoration after crash slowness, any way to improve?
Date: 2016-09-02 01:01:21
Message-ID: CAMkU=1zEJUBFi=PrhqgiMebrimYDmsKF2uz+dxQEdc+C17FoSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 31, 2016 at 6:26 PM, Joshua D. Drake <jd(at)commandprompt(dot)com>
wrote:

> -hackers,
>
> So this is more of a spit balling thread than anything. As I understand
> it, if we have a long running transaction or a large number of wal logs and
> we crash, we then have to restore those logs on restart to the last known
> good transaction. No problem.
>

It only has to replay from the start of the last successful checkpoint. It
doesn't matter whether there was a long-running transaction or not. If a
long transaction spans many checkpoints, replay still only has to go back
to the start of the last successful checkpoint. Maybe you just had
checkpoint_segments or max_wal_size st way too high, assuming
checkpoint_timeout to always kick in instead and be the limiting factor.
But then your long-running transaction invalidated that assumption?

> I recently ran a very long transaction. I was building up a large number
> of rows into a two column table to test index performance. I ended up
> having to kill the connection and thus the transaction after I realized I
> had an extra zero in my generate_series(). (Side note: Amazing the
> difference a single zero can make ;)). When coming back up, I watched the
> machine and I was averaging anywhere from 60MBs to 97MBs on writes.

Was it IO limited?

Killing a session/connection/transaction should not take down the entire
server, so there should be no recovery taking place in the first place.
Are you sure you are seeing recovery, and not just the vacuuming of the
aborted tuples?

> However, since I know this machine can get well over 400MBs when using
> multiple connections I can't help but wonder if there is anything we can do
> to make restoration more efficient without sacrificing the purpose of what
> it is doing?
>
> Can we have multiple readers pull transaction logs into shared_buffers (on
> recovery only), sort the good transactions and then push them back to the
> walwriter or bgwriter to the pages?
>

I don't see how that could work. Whether a page is consistent or not is
orthogonal to whether the transactions on that page have committed or
aborted.

There are two possibilities that I've considered though for long-running
PITR, which could also apply to crash recovery, and which I think have been
discussed here before. One is to have a leading recovery process which
would identify pages which will be recovered from a FPI, and send word back
to the lagging process not to bother applying incremental WAL to those
pages. The other would be for a leading process to asynchronously read
into memory (either FS cache or shared_buffers) pages which it sees the
lagging process will need to write to.

In the first case, you would want the leading process to be leading by a
lot, so that it has the broadest scope to detect FPI. Basically you would
want it to read all the way to the end of the replay, provided it had
enough memory to store the list of FPI pages. For the second one, you
would not want it to run so far ahead that it the pages it read in would
get pushed out again before the lagging process got to them. Controlling
how far ahead that would be seems like it would be hard.

Cheers,

Jeff

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2016-09-02 01:39:56 Re: Exclude schema during pg_restore
Previous Message Tom Lane 2016-09-01 22:16:43 Re: [PATCH] OpenSSL 1.1.0 support