Re: Logical replication prefetch

From: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Logical replication prefetch
Date: 2025-07-08 18:38:01
Message-ID: 84ed36b8-7d06-4945-9a6b-3826b3f999a6@garret.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 08/07/2025 2:51 pm, Amit Kapila wrote:
> On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik<knizhnik(at)garret(dot)ru> wrote:
>> It is possible to enforce parallel apply of short
>> transactions using `debug_logical_replication_streaming` but then
>> performance is ~2x times slower than in case of sequential apply by
>> single worker.
>>
> What is the reason of such a large slow down? Is it because the amount
> of network transfer has increased without giving any significant
> advantage because of the serialization of commits?

No, I do not think that network traffic is somehow increased.
If I removed locks (just by commenting body of `pa_lock_stream` and
`pa_unlock_stream` functions and callof `pa_wait_for_xact_finish`), I
get 3x speed improvement (with 4 parallel apply workers) comparing with
normal mode
(when transactions are applied by main logical replication worker). So
the main reason is lock overhead/contention and de-facto serialization
of transactions (in `top` I see that only one worker is active most the
time.

Even with simulated 0.1msec read delay, results of update tests are the
following:

normal mode: 7:40
forced parallel mode: 8:30
forced parallel mode (no locks): 1:45

>> By removing serialization by commits, it is possible to
>> speedup apply 3x times and make subscriber apply changes faster then
>> producer can produce them even with multiple clients. But it is possible
>> only if transactions are independent and it can be enforced only by
>> tracking dependencies which seems to be very non-trivial and invasive.
>>
>> I still do not completely give up with tracking dependencies approach,
>> but decided first to try more simple solution - prefetching.
>>
> Sounds reasonable, but in the long run, we should track transaction
> dependencies and allow parallel apply of all the transactions.

I agree.
I see two different approaches:

1. Build dependency graph: track dependency between xids when
transaction is executed at publisher and then include this graph in
commit record.
2. Calculate hash of replica identity key  and check that data sets of
transactions do no intersect (certainly will notwork if there are some
triggers).

>
> It is
>> already used for physical replication. Certainly in case of physical
>> replication it is much simpler, because each WAL record contains list of
>> accessed blocks.
>>
>> In case of logical replication prefetching can be done either by
>> prefetching access to replica identity index (usually primary key),
>> either by executing replication command by some background worker
>> Certainly first case is much more easy.
>>
> It seems there is only one case described, so what exactly are you
> referring to first and second?

First: perform lookup in replica identity index (primary key). It will
prefetch index pages and referenced heap page.
Seconds: execute LR operation (insert/update) in prefetch worker and
then rollback transaction.

> But what about worst cases where these additional pre-fetches could
> lead to removing some pages from shared_buffers, which are required by
> the workload on the subscriber? I think you try such workloads as
> well.
>
It is the common problem of all prefetch algorithms: if size of cache
where prefetch results are stored (shared buffers, OS file cache,...) is
not larger enough to keep prefetch result until it will be used,
then prefetch will not provide any improvement of performance and may be
even cause some degradation.
So it is really challenged task to choose optimal time for prefetch
operation: too early - and its results will be thrown away before
requested, too late - executor has to wait prefetch completion or load
page itself. Certainly there is some kind of autotuning: worker
performing prefetch has to wait for IO completion and executor whichm
pickup page from cache process requests faster and so should catch up
prefetch workers. Then it has to perform IO itself and start fall behind
prefetch workers.

> I understand that it is just a POC, so you haven't figured out all the
> details, but it would be good to know the reason of these deadlocks.
>
Will investigate it.

>> I wonder if such LR prefetching approach is considered to be useful?
>> Or it is better to investigate other ways to improve LR apply speed
>> (parallel apply)?
>>
> I think it could be a good intermediate step till we are able to find
> a solution for tracking the dependencies. Do you think this work will
> be useful once we have parallel apply, and if so how?

I think that if we will have parallel apply, prefetch is not needed.
At least that is what I see now in Neon (open source serverless Postgres
which separates compute and storage).
We have implemented prefetch for seqscan and indexscan because of
relatively large acess latency with page server. And it can really
significantly improve performance - 4 or even more times.
But the same effect cna be achieved by forcing parallel plan with larger
number of parallel workers. Unfortunately effect of this two
optimizations is not multiplied, so parallel plan + prefetch shows
almost the same speed as any of this optimizations.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dagfinn Ilmari Mannsåker 2025-07-08 18:50:05 Tab completion for large objects
Previous Message sunil s 2025-07-08 18:31:55 Re: Unnecessary delay in streaming replication due to replay lag