Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Subject: Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Date: 2019-08-29 18:48:24
Message-ID: 20190829184824.kmrbchrk2ged6vjw@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 29, 2019 at 05:37:45PM +0300, Alexey Kondratov wrote:
>On 28.08.2019 22:06, Tomas Vondra wrote:
>>>>>>>Interesting. Any idea where does the extra overhead in
>>>>>>>this particular
>>>>>>>case come from? It's hard to deduce that from the single
>>>>>>>flame graph,
>>>>>>>when I don't have anything to compare it with (i.e. the
>>>>>>>flame graph for
>>>>>>>the "normal" case).
>>>>>>I guess that bottleneck is in disk operations. You can check
>>>>>>logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>>>>>writes (~26%) take around 35% of CPU time in summary. To compare,
>>>>>>please, see attached flame graph for the following transaction:
>>>>>>INSERT INTO large_text
>>>>>>SELECT (SELECT string_agg('x', ',')
>>>>>>FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>>>>Execution Time: 44519.816 ms
>>>>>>Time: 98333,642 ms (01:38,334)
>>>>>>where disk IO is only ~7-8% in total. So we get very roughly the same
>>>>>>~x4-5 performance drop here. JFYI, I am using a machine with
>>>>>>SSD for tests.
>>>>>>Therefore, probably you may write changes on receiver in
>>>>>>bigger chunks,
>>>>>>not each change separately.
>>>>>Possibly, I/O is certainly a possible culprit, although we should be
>>>>>using buffered I/O and there certainly are not any fsyncs here. So I'm
>>>>>not sure why would it be cheaper to do the writes in batches.
>>>>>BTW does this mean you see the overhead on the apply side? Or are you
>>>>>running this on a single machine, and it's difficult to decide?
>>>>I run this on a single machine, but walsender and worker are
>>>>utilizing almost 100% of CPU per each process all the time, and
>>>>at apply side I/O syscalls take about 1/3 of CPU time. Though I
>>>>am still not sure, but for me this result somehow links
>>>>performance drop with problems at receiver side.
>>>>Writing in batches was just a hypothesis and to validate it I
>>>>have performed test with large txn, but consisting of a smaller
>>>>number of wide rows. This test does not exhibit any significant
>>>>performance drop, while it was streamed too. So it seems to be
>>>>valid. Anyway, I do not have other reasonable ideas beside that
>>>>right now.
>>>It seems that overhead added by synchronous replica is lower by
>>>2-3 times compared with Postgres master and streaming with
>>>spilling. Therefore, the original patch eliminated delay before
>>>large transaction processing start by sender, while this
>>>additional patch speeds up the applier side.
>>>Although the overall speed up is surely measurable, there is a
>>>room for improvements yet:
>>>1) Currently bgworkers are only spawned on demand without some
>>>initial pool and never stopped. Maybe we should create a small
>>>pool on replication start and offload some of idle bgworkers if
>>>they exceed some limit?
>>>2) Probably we can track somehow that incoming change has
>>>conflicts with some of being processed xacts, so we can wait for
>>>specific bgworkers only in that case?
>>>3) Since the communication between main logical apply worker and
>>>each bgworker from the pool is a 'single producer --- single
>>>consumer' problem, then probably it is possible to wait and
>>>set/check flags without locks, but using just atomics.
>>>What do you think about this concept in general? Any concerns and
>>>criticism are welcome!
>Hi Tomas,
>Thank you for a quick response.
>>I don't think it matters very much whether the workers are started at the
>>beginning or allocated ad hoc, that's IMO a minor implementation detail.
>OK, I had the same vision about this point. Any minor differences here
>will be neglectable for a sufficiently large transaction.
>>There's one huge challenge that I however don't see mentioned in your
>>message or in the patch (after cursory reading) - ensuring the same
>>order, and introducing deadlocks that would not exist in single-process
>Probably I haven't explained well this part, sorry for that. In my
>patch I don't use workers pool for a concurrent transaction apply, but
>rather for a fast context switch between long-lived streamed
>transactions. In other words we apply all changes arrived from the
>sender in a completely serial manner. Being written step-by-step it
>looks like:
>1) Read STREAM START message and figure out the target worker by xid.
>2) Put all changes, which belongs to this xact to the selected worker
>one by one via shm_mq_send.
>3) Read STREAM STOP message and wait until our worker will apply all
>changes in the queue.
>4) Process all other chunks of streamed xacts in the same manner.
>5) Process all non-streamed xacts immediately in the main apply worker loop.
>6) If we read STREAMED COMMIT/ABORT we again wait until selected
>worker either commits or aborts.
>Thus, it automatically guaranties the same commit order on replica as
>on master. Yes, we loose some performance here, since we don't apply
>transactions concurrently, but it would bring all those problems you
>have described.

OK, so it's apply in multiple processes, but at any moment only a single
apply process is active.

>However, you helped me to figure out another point I have forgotten.
>Although we ensure commit order automatically, the beginning of
>streamed xacts may reorder. It happens if some small xacts have been
>commited on master since the streamed one started, because we do not
>start streaming immediately, but only after logical_work_mem hit. I
>have performed some tests with conflicting xacts and it seems that
>it's not a problem, since locking mechanism in Postgres guarantees
>that if there would some deadlocks, they will happen earlier on
>master. So if some records hit the WAL, it is safe to apply the
>sequentially. Am I wrong?

I think you're right the way you interleave the changes ensures you
can't introduce new deadlocks between transactions in this stream. I don't
think reordering the blocks of streamed trasactions does matter, as long
as the comit order is ensured in this case.

>Anyway, I'm going to double check the safety of this part later.


FWIW my understanding is that the speedup comes mostly from elimination of
the serialization to a file. That however requires savepoints to handle
aborts of subtransactions - I'm pretty sure I'd be trivial to create a
workload where this will be much slower (with many aborts of large


Tomas Vondra
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2019-08-29 18:52:21 Re: pg_upgrade: Error out on too many command-line arguments
Previous Message Etsuro Fujita 2019-08-29 18:08:27 Re: A problem about partitionwise join