| From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
|---|---|
| To: | vignesh C <vignesh21(at)gmail(dot)com> |
| Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Duncan Sands <duncan(dot)sands(at)deepbluecap(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5 |
| Date: | 2025-06-13 23:52:00 |
| Message-ID: | CAD21AoCHqKXVhUZbxPKBve-uEHJWKxV9SVC5deiB5VUONTH=6w@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-bugs |
On Wed, Jun 11, 2025 at 11:47 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Thu, 12 Jun 2025 at 03:33, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Tue, Jun 10, 2025 at 12:19 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Sun, Jun 8, 2025 at 6:43 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > > >
> > > > On Sat, 7 Jun 2025 at 12:47, vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Fri, 6 Jun 2025 at 22:51, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > > >
> > > > > > On Thu, Jun 5, 2025 at 8:43 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > > >
> > > > > > > On Thu, Jun 5, 2025 at 11:14 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jun 4, 2025 at 11:36 PM Hayato Kuroda (Fujitsu)
> > > > > > > > <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> > > > > > > > >
> > > > > > > > > Dear Sawada-san,
> > > > > > > > >
> > > > > > > > > > Alternative idea is to lower the constant value when using an
> > > > > > > > > > assertion build. That way, we don't need to rely on injection points
> > > > > > > > > > being enabled.
> > > > > > > > >
> > > > > > > > > Hmm, possible but I prefer current one. Two concerns:
> > > > > > > > >
> > > > > > > > > 1.
> > > > > > > > > USE_ASSERT_CHECKING has not been used to change the value yet. The main usage is
> > > > > > > > > to call debug functions in debug build.
> > > > > > > >
> > > > > > > > I think we have a similar precedent such as MT_NRELS_HASH to improve
> > > > > > > > the test coverages.
> > > > > > > >
> > > > > > > > > 2.
> > > > > > > > > If we add tests which is usable only for debug build, it must be run only when it
> > > > > > > > > is enabled. IIUC such test does not exist yet.
> > > > > > > >
> > > > > > > > I think we need to test cases not to check if we reach a specific code
> > > > > > > > point but to check if we can get the correct results even if we've
> > > > > > > > executed various code paths. As for this bug, it is better to check
> > > > > > > > that it works properly in a variety of cases. That way, we can check
> > > > > > > > overflow cases and non-overflow cases also in test cases added in the
> > > > > > > > future, improving the test coverage more.
> > > > > > > >
> > > > > > >
> > > > > > > This makes sense, but we should see whether some existing tests cover
> > > > > > > this code path after lowering the limit in the overflow code path. One
> > > > > > > minor point to consider is that at the time, the value MT_NRELS_HASH
> > > > > > > was used to cover cases in a debug build, but we didn't have the
> > > > > > > injection_point framework.
> > > > > >
> > > > > > True.
> > > > > >
> > > > > > After thinking about it more, perhaps my proposal would not be a good
> > > > > > idea for this case. I think that the cases where we selectively
> > > > > > invalidate caches is more complex and error-prone than the cases where
> > > > > > we invalidate a complete cache. If we invalidated all caches after
> > > > > > decoding each transaction, we wouldn't have had the original data-loss
> > > > > > issue. Having a lower MAX_DISTR_INVAL_MSG_PER_TXN value when using an
> > > > > > assertio build means that we're going to test the cases using a
> > > > > > simpler invalidation mechanism while productions systems, which has a
> > > > > > higher MAX_DISTR_INVAL_MSG_PER_TXN value, would end up executing
> > > > > > complex cases, which is not great. What do you think?
> > > > > >
> > > > > > BTW, as for a new test case, it might be worth having a case I
> > > > > > mentioned before[1]:
> > > > > >
> > > > > > 1) S1: CREATE TABLE d (data text not null);
> > > > > > 2) S1: INSERT INTO d VALUES ('d1');
> > > > > > 3) S2: BEGIN; INSERT INTO d VALUES ('d2');
> > > > > > 4) S3: BEGIN; INSERT INTO d VALUES ('d3');
> > > > > > 5) S1: ALTER PUBLICATION pb ADD TABLE d;
> > > > > > 6) S2: INSERT INTO d VALUES ('d4');
> > > > > > 7) S2: COMMIT;
> > > > > > 8) S3: COMMIT;
> > > > > > 9) S2: INSERT INTO d VALUES('d5');
> > > > > > 10) S1: INSERT INTO d VALUES ('d6');
> > > > > >
> > > > > > With this case, we can test if we need to execute the distributed
> > > > > > invalidations as well in the non-error path in
> > > > > > ReorderBufferProcessTXN().
> > > > >
> > > > > The attached v13 version patch has the change to include this test case.
> > > >
> > > > Attached are the patches, including those required for the back branches.
> > >
> > > Thank you for updating the patches! I'll review them and share comments if any.
> >
> > Thank you for updating the patch. I have one comment on the newly added test:
> >
> > +session "s3"
> > +step "s3i1" { INSERT INTO tbl1 (val1, val2) VALUES (1, 1);}
> > +step "s3a" { ALTER PUBLICATION pub ADD TABLE tbl1; }
> > +step "s3i2" { INSERT INTO tbl1 (val1, val2) VALUES (6, 6); }
> > +step "s3_get_binary_changes" { SELECT count(data) FROM
> > pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL,
> > 'proto_version', '4', 'publication_names', 'pub') WHERE get_byte(data,
> > 0) = 73; }
> > +
> > +session "s4"
> > +step "s4b" { BEGIN; }
> > +step "s4i1" { INSERT INTO tbl1 (val1, val2) VALUES (2, 2);}
> > +step "s4i2" { INSERT INTO tbl1 (val1, val2) VALUES (4, 4); }
> > +step "s4c" { COMMIT; }
> > +step "s4i3" { INSERT INTO tbl1 (val1, val2) VALUES (5, 5); }
> > +
> > +session "s5"
> > +step "s5b" { BEGIN; }
> > +step "s5i1" { INSERT INTO tbl1 (val1, val2) VALUES (3, 3); }
> > +step "s5c" { COMMIT; }
> >
> > I think we don't necessarily need to add sessions "s4" and "s5". Let's
> > reuse "s1" and "s2" instead of adding them. I've attached a patch to
> > change that.
>
> Thanks, this is better.
> In the case of PG13 I have slightly changed the test to do the insert
> after the commit, because the queuing of invalidation messages into
> reorder buffer queue is not supported in PG13. This limitation is due
> to the absence of support for REORDER_BUFFER_CHANGE_INVALIDATION which
> is already present in >= PG14
> The attached v14 version patch has the changes for the same.
Hmm, but the modified test is essentially the same as what we already
have in invalidation_distribution.spec. I think that it's good to have
the same test case in all branches even though we get a different
result in v13 as we currently don't have the test to check such
differences.
I've attached the updated patches accordingly and the commit messages.
I'm going to push them to all supported branches early next week,
barring any objections and further review comments.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
| Attachment | Content-Type | Size |
|---|---|---|
| REL15_v15-0001-Fix-re-distributing-previously-distributed-inval.patch | application/octet-stream | 19.1 KB |
| master_v15-0001-Fix-re-distributing-previously-distributed-inval.patch | application/octet-stream | 18.4 KB |
| REL16_v15-0001-Fix-re-distributing-previously-distributed-inval.patch | application/octet-stream | 19.2 KB |
| REL14_v15-0001-Fix-re-distributing-previously-distributed-inval.patch | application/octet-stream | 19.0 KB |
| REL17_v15-0001-Fix-re-distributing-previously-distributed-inval.patch | application/octet-stream | 19.1 KB |
| REL13_v15-0001-Fix-re-distributing-previously-distributed-inval.patch | application/octet-stream | 17.8 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Lowell Hought | 2025-06-14 14:27:19 | Re: BUG #18950: pgsql function that worked in Postgresql 16 does not return in Postgresql 17 |
| Previous Message | Tom Lane | 2025-06-13 20:58:44 | Re: BUG #18953: Planner fails to build plan for complex query with LATERAL references |