Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

From: Alexey Kondratov <a(dot)kondratov(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Subject: Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Date: 2019-08-29 14:37:45
Message-ID: bb84dcf8-0a7d-5c8b-9cea-7186dc3acc3c@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 28.08.2019 22:06, Tomas Vondra wrote:
>
>>
>>>>>> Interesting. Any idea where does the extra overhead in this
>>>>>> particular
>>>>>> case come from? It's hard to deduce that from the single flame
>>>>>> graph,
>>>>>> when I don't have anything to compare it with (i.e. the flame
>>>>>> graph for
>>>>>> the "normal" case).
>>>>> I guess that bottleneck is in disk operations. You can check
>>>>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>>>> writes (~26%) take around 35% of CPU time in summary. To compare,
>>>>> please, see attached flame graph for the following transaction:
>>>>>
>>>>> INSERT INTO large_text
>>>>> SELECT (SELECT string_agg('x', ',')
>>>>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>>>
>>>>> Execution Time: 44519.816 ms
>>>>> Time: 98333,642 ms (01:38,334)
>>>>>
>>>>> where disk IO is only ~7-8% in total. So we get very roughly the same
>>>>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD
>>>>> for tests.
>>>>>
>>>>> Therefore, probably you may write changes on receiver in bigger
>>>>> chunks,
>>>>> not each change separately.
>>>>>
>>>> Possibly, I/O is certainly a possible culprit, although we should be
>>>> using buffered I/O and there certainly are not any fsyncs here. So I'm
>>>> not sure why would it be cheaper to do the writes in batches.
>>>>
>>>> BTW does this mean you see the overhead on the apply side? Or are you
>>>> running this on a single machine, and it's difficult to decide?
>>>
>>> I run this on a single machine, but walsender and worker are
>>> utilizing almost 100% of CPU per each process all the time, and at
>>> apply side I/O syscalls take about 1/3 of CPU time. Though I am
>>> still not sure, but for me this result somehow links performance
>>> drop with problems at receiver side.
>>>
>>> Writing in batches was just a hypothesis and to validate it I have
>>> performed test with large txn, but consisting of a smaller number of
>>> wide rows. This test does not exhibit any significant performance
>>> drop, while it was streamed too. So it seems to be valid. Anyway, I
>>> do not have other reasonable ideas beside that right now.
>>
>> It seems that overhead added by synchronous replica is lower by 2-3
>> times compared with Postgres master and streaming with spilling.
>> Therefore, the original patch eliminated delay before large
>> transaction processing start by sender, while this additional patch
>> speeds up the applier side.
>>
>> Although the overall speed up is surely measurable, there is a room
>> for improvements yet:
>>
>> 1) Currently bgworkers are only spawned on demand without some
>> initial pool and never stopped. Maybe we should create a small pool
>> on replication start and offload some of idle bgworkers if they
>> exceed some limit?
>>
>> 2) Probably we can track somehow that incoming change has conflicts
>> with some of being processed xacts, so we can wait for specific
>> bgworkers only in that case?
>>
>> 3) Since the communication between main logical apply worker and each
>> bgworker from the pool is a 'single producer --- single consumer'
>> problem, then probably it is possible to wait and set/check flags
>> without locks, but using just atomics.
>>
>> What do you think about this concept in general? Any concerns and
>> criticism are welcome!
>>
>

Hi Tomas,

Thank you for a quick response.

> I don't think it matters very much whether the workers are started at the
> beginning or allocated ad hoc, that's IMO a minor implementation detail.

OK, I had the same vision about this point. Any minor differences here
will be neglectable for a sufficiently large transaction.

>
> There's one huge challenge that I however don't see mentioned in your
> message or in the patch (after cursory reading) - ensuring the same
> commit
> order, and introducing deadlocks that would not exist in single-process
> apply.

Probably I haven't explained well this part, sorry for that. In my patch
I don't use workers pool for a concurrent transaction apply, but rather
for a fast context switch between long-lived streamed transactions. In
other words we apply all changes arrived from the sender in a completely
serial manner. Being written step-by-step it looks like:

1) Read STREAM START message and figure out the target worker by xid.

2) Put all changes, which belongs to this xact to the selected worker
one by one via shm_mq_send.

3) Read STREAM STOP message and wait until our worker will apply all
changes in the queue.

4) Process all other chunks of streamed xacts in the same manner.

5) Process all non-streamed xacts immediately in the main apply worker loop.

6) If we read STREAMED COMMIT/ABORT we again wait until selected worker
either commits or aborts.

Thus, it automatically guaranties the same commit order on replica as on
master. Yes, we loose some performance here, since we don't apply
transactions concurrently, but it would bring all those problems you
have described.

However, you helped me to figure out another point I have forgotten.
Although we ensure commit order automatically, the beginning of streamed
xacts may reorder. It happens if some small xacts have been commited on
master since the streamed one started, because we do not start streaming
immediately, but only after logical_work_mem hit. I have performed some
tests with conflicting xacts and it seems that it's not a problem, since
locking mechanism in Postgres guarantees that if there would some
deadlocks, they will happen earlier on master. So if some records hit
the WAL, it is safe to apply the sequentially. Am I wrong?

Anyway, I'm going to double check the safety of this part later.

Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeevan Ladhe 2019-08-29 14:41:04 Re: block-level incremental backup
Previous Message Joe Conway 2019-08-29 14:28:00 Re: RFC: seccomp-bpf support