Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

From: Önder Kalacı <onderkalaci(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Marco Slot <marco(dot)slot(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Date: 2023-03-02 13:20:09
Message-ID: CACawEhUDwAqFx4T6XhxBuKJKqWTFfy_iUhf22JVcnK9m6mHCVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Amit, Shi Yu

> >
> > b. Executed SQL.
> > I executed TRUNCATE and INSERT before each UPDATE. I am not sure if you
> did the
> > same, or just executed 50 consecutive UPDATEs. If the latter one, there
> would be
> > lots of old tuples and this might have a bigger impact on sequential
> scan. I
> > tried this case (which executes 50 consecutive UPDATEs) and also saw
> that the
> > overhead is smaller than before.
>

Alright, I'll do similarly, execute truncate/insert before each update.

> In the above profile number of calls to index_fetch_heap(),
> heapam_index_fetch_tuple() explains the reason for the regression you
> are seeing with the index scan. Because the update will generate dead
> tuples in the same transaction and those dead tuples won't be removed,
> we get those from the index and then need to perform
> index_fetch_heap() to find out whether the tuple is dead or not. Now,
> for sequence scan also we need to scan those dead tuples but there we
> don't need to do back-and-forth between index and heap.

Thanks for the insights, I think what you describe makes a lot of sense.

> I think we can
> once check with more number of tuples (say with 20000, 50000, etc.)
> for case-1.
>
>
As we'd expect, this test made the performance regression more visible.

I quickly ran case-1 for 50 times with 50000 as Shi Yu does, and got
the following results. I'm measuring end-to-end times for running the
whole set of commands:

seq_scan: 00 hr 24 minutes, 42 seconds
index_scan: 01 hr 04 minutes 54 seconds

But, I'm still not sure whether we should focus on this regression too
much. In the end, what we are talking about is a case (e.g., all or many
rows are duplicated) where using an index is not a good idea anyway. So,
I doubt users would have such indexes.

> The quadratic apply performance
> the sequential scans cause, are a much bigger hazard for users than some
apply
> performance reqression.

Quoting Andres' note, I personally think that the regression for this case
is not a big concern.

> I'd prefer not having an option, because we figure out the cause of the
> performance regression (reducing it to be small enough to not care). After
> that an option defaulting to using indexes. I don't think an option
defaulting
> to false makes sense.

I think we figured out the cause of the performance regression. I think it
is not small
enough for some scenarios like the above. But those scenarios seem like
synthetic
test cases, with not much user impacting implications. Still, I think you
are better suited
to comment on this.

If you consider that this is a significant issue, we could consider the
second patch as well
such that for this unlikely scenario users could disable index scans.

Thanks,
Onder

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2023-03-02 13:25:17 RE: Time delayed LR (WAS Re: logical replication restrictions)
Previous Message Drouvot, Bertrand 2023-03-02 13:04:58 Re: Fix comments in gistxlogDelete, xl_heap_freeze_page and xl_btree_delete