Re: Speedup twophase transactions

From: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Speedup twophase transactions
Date: 2017-01-24 13:01:30
Message-ID: 06A44F22-B58D-4FF1-BEED-5447A63D2A11@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> On 24 Jan 2017, at 09:42, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:
>
> On Mon, Jan 23, 2017 at 9:00 PM, Nikhil Sontakke
> <nikhils(at)2ndquadrant(dot)com> wrote:
>> Speeding up recovery or failover activity via a faster promote is a
>> desirable thing. So, maybe, we should look at teaching the relevant
>> code about using "KnownPreparedList"? I know that would increase the
>> size of this patch and would mean more testing, but this seems to be
>> last remaining optimization in this code path.
>
> That's a good idea, worth having in this patch. Actually we may not
> want to call KnownPreparedRecreateFiles() here as promotion is not
> synonym of end-of-recovery checkpoint for a couple of releases now.

Thanks for review, Nikhil and Michael.

I don’t follow here. We are moving data away from WAL to files on checkpoint because after checkpoint
there is no guaranty that WAL segment with our prepared tx will be still available.

> The difference between those two is likely noise.
>
> By the way, in those measurements, the OS cache is still filled with
> the past WAL segments, which is a rather best case, no? What happens
> if you do the same kind of tests on a box where memory is busy doing
> something else and replayed WAL segments get evicted from the OS cache
> more aggressively once the startup process switches to a new segment?
> This could be tested for example on a VM with few memory (say 386MB or
> less) so as the startup process needs to access again the past WAL
> segments to recover the 2PC information it needs to get them back
> directly from disk... One trick that you could use here would be to
> tweak the startup process so as it drops the OS cache once a segment
> is finished replaying, and see the effects of an aggressive OS cache
> eviction. This patch is showing really nice improvements with the OS
> cache backing up the data, still it would make sense to test things
> with a worse test case and see if things could be done better. The
> startup process now only reads records sequentially, not randomly
> which is a concept that this patch introduces.
>
> Anyway, perhaps this does not matter much, the non-recovery code path
> does the same thing as this patch, and the improvement is too much to
> be ignored. So for consistency's sake we could go with the approach
> proposed which has the advantage to not put any restriction on the
> size of the 2PC file contrary to what an implementation saving the
> contents of the 2PC files into memory would need to do.

Maybe i’m missing something, but I don’t see how OS cache can affect something here.

Total WAL size was 0x44 * 16 = 1088 MB, recovery time is about 20s. Sequential reading 1GB of data
is order of magnitude faster even on the old hdd, not speaking of ssd. Also you can take a look on flame graphs
attached to previous message — majority of time during recovery spent in pg_qsort while replaying
PageRepairFragmentation, while whole xact_redo_commit() takes about 1% of time. That amount can
grow in case of uncached disk read but taking into account total recovery time this should not affect much.

If you are talking about uncached access only during checkpoint than here we are restricted with
max_prepared_transaction, so at max we will read about hundred of small files (usually fitting into one filesystem page) which will also
be barely noticeable comparing to recovery time between checkpoints. Also wal segments cache eviction during
replay doesn’t seems to me as standard scenario.

Anyway i took the machine with hdd to slow down read speed and run tests again. During one of the runs i
launched in parallel bash loop that was dropping os cache each second (while wal fragment replay takes
also about one second).

1.5M transactions
start segment: 0x06
last segment: 0x47

patched, with constant cache_drop:
total recovery time: 86s

patched, without constant cache_drop:
total recovery time: 68s

(while difference is significant, i bet that happens mostly because of database file segments should be re-read after cache drop)

master, without constant cache_drop:
time to recover 35 segments: 2h 25m (after that i tired to wait)
expected total recovery time: 4.5 hours

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2017-01-24 13:03:24 Re: patch proposal
Previous Message Simon Riggs 2017-01-24 11:59:18 Superowners