| From: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
|---|---|
| To: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
| Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Report bytes and transactions actually sent downtream |
| Date: | 2025-12-19 01:53:53 |
| Message-ID: | 914399D0-F2A1-430E-8EB9-0EDE106D1040@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
> On Dec 18, 2025, at 20:52, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
>
> On Thu, Dec 18, 2025 at 7:56 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>>
>>
>>
>>> On Dec 17, 2025, at 13:55, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
>>>
>>> Thanks for pointing this out. I have fixed it my code. However, at
>>> this point I am looking for a design review, especially to verify that
>>> the new implementation addresses Andres's concern raised in [1] while
>>> not introducing any design issues raised earlier e.g. those raised in
>>> threads [2], [3] and [4]
>>>
>>> [1] https://www.postgresql.org/message-id/zzidfgaowvlv4opptrcdlw57vmulnh7gnes4aerl6u35mirelm@tj2vzseptkjk
>>>>> [2] https://www.postgresql.org/message-id/CAA4eK1KzYaq9dcaa20Pv44ewomUPj_PbbeLfEnvzuXYMZtNw0A%40mail.gmail.com
>>>>> [3] https://www.postgresql.org/message-id/aNZ1T5vYC1BtKs4M@ip-10-97-1-34.eu-west-3.compute.internal
>>>>> [4] https://www.postgresql.org/message-id/CAExHW5tfVHABuv1moL_shp7oPrWmg8ha7T8CqwZxiMrKror7iw%40mail.gmail.com
>>>
>>> --
>>> Best Wishes,
>>> Ashutosh Bapat
>>
>>
>> Hi Ashutosh,
>>
>> Yeah, I owe you a review. I committed to review this patch but I forgot, sorry about that.
>>
>> From design perspective, I agree increasing counters should belong to the core, plugin should return properly values following the contract. And I got some more comments:
>>
>> 1. I just feel a bool return value might not be clear enough. For example:
>>
>> ```
>> - ctx->callbacks.change_cb(ctx, txn, relation, change);
>> + if (!ctx->callbacks.change_cb(ctx, txn, relation, change))
>> + cache->filteredBytes += ReorderBufferChangeSize(change);
>> ```
>>
>> You increase filteredBytes when change_cb returns false. But if we look at pgoutput_change(), there are many reasons to return false. Counting all the cases to filteredBytes seems wrong.
>
> I am not able to understand this. Every "return false" from
> pgoutput_change() indicates that the change was filtered out and hence
> the size of corresponding change is being added to filteredBytes by
> the caller. Which "return false" does not indicate a filtered out
> change?
I think the confusion comes from the counter name “filteredBytes”, what does “filtered” mean? There are 3 types of data not steaming out:
a. WAL data of tables that doesn’t belong to the publication
b. table belong to the publication, but action doesn’t. For example, FOR ALL TABLES (INSERT), then update/delete will not be streamed out
c. Filtered by row filter (WHERE)
I thought only c should be counted to filteredBytes; thinking over again, maybe b should also be counted. But I still don’t think a should be counted.
IMO, sentBytes + filteredBytes == supposedToSendBytes. If a table doesn’t belong to a publication, then it should not be counted into supposedToSendBytes, so it should not be counted into filteredBytes.
The other point is that, if we count a into filteredBytes, then ends up totalBytes == sendBytes + filteredBytes, if that’s true, why don’t compute such a number by (totalBytes-sendBytes) in client side?
If we insist to count a, then maybe we need to consider a better counter name.
>>
>> 2.
>> ```
>> - ctx->callbacks.truncate_cb(ctx, txn, nrelations, relations, change);
>> + if (!ctx->callbacks.truncate_cb(ctx, txn, nrelations, relations, change))
>> + cache->filteredBytes += ReorderBufferChangeSize(change);
>> ```
>>
>> Row filter doesn’t impact TRUNCATE, why increase filteredBytes after truncate_cb()?
>
> A TRUNCATE of a relation which is not part of the publication will be
> filtered out.
Same as 1.
>
>>
>> 3.
>> ```
>> - ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn);
>> + if (ctx->callbacks.prepare_cb(ctx, txn, prepare_lsn))
>> + cache->sentTxns++;
>> ```
>>
>> For 2-phase commit, it increase sentTxns after prepare_cb, and
>> ```
>> + if (ctx->callbacks.stream_abort_cb(ctx, txn, abort_lsn))
>> + cache->sentTxns++;
>> ```
>>
>> If the transaction is aborted, sentTxns is increased again, which is confusing. Though for aborting there is some data (a notification) is streamed, but I don’t think that should be counted as a transaction.
>>
>> After commit, sentTxns is also increased, so that, a 2-phase commit is counted as two transactions, which feels also confusing. IMO, a 2-phase commit should still be counted as one transaction.
>
> stream_commit/abort_cb is called after stream_prepare_cb not after prepare_cb.
That’s my typo, but the problem is still there. Should we count a 2-phase-commit as 2 transactions?
>
>>
>> 4. You add sentBytes and filteredBytes. I am thinking if it makes sense to also add sentRows and filteredRows. Because tables could be big or small, bytes + rows could show a more clear picture to users.
>
> We don't have corresponding total_rows and streamed_rows counts. I
> think that's because we haven't come across a use case for them. Do
> you have a use case in mind?
>
That’s still related to 1. totalBytes includes tables don’t belong to the publication, thus totalRows doesn’t make much sense. But sendRows will only include those rows belonging to the publication. For filterRows, if we exclude a, then I believe filterRows also makes sense.
If you argue that “rows” request should be treated in a separate thread, I’ll be okay with that.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter Smith | 2025-12-19 02:19:47 | Re: [WIP]Vertical Clustered Index (columnar store extension) - take2 |
| Previous Message | Michael Paquier | 2025-12-19 01:43:10 | Switch buffile.c/h to use pgoff_t |