Quick Links

Re: "unexpected duplicate for tablespace" problem in logical replication

From:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To:	vignesh C <vignesh21(at)gmail(dot)com>
Cc:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, wangsh(dot)fnst(at)fujitsu(dot)com, osumi(dot)takamichi(at)fujitsu(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: "unexpected duplicate for tablespace" problem in logical replication
Date:	2025-06-19 08:47:40
Message-ID:	CAExHW5vaHu=s6eopJ2jxyGNRG--uEsJwYk92yxa2L_80+Y=1YA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On Tue, Jun 17, 2025 at 6:29 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Thu, 25 Jul 2024 at 03:21, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Sun, Mar 17, 2024 at 11:35 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > >
> > > On Fri, Feb 02, 2024 at 10:49:17AM -0500, Robert Haas wrote:
> > > > Andres, what do you think about this idea? I wonder if you just
> > > > momentarily forgot about temporary relations when coding
> > > > RelidByRelfilenumber -- because for that function to give well-defined
> > > > answers with temporary relations included, it would need the backend
> > > > ID as an additional argument.
> > >
> > >
> > > Ignoring temporary relations entirely makes sense: one cannot get a
> > > regclass from only a tablespace and a relfilenode, the persistence, as
> > > well as a backend ID would also be required. I've not checked the
> > > patch in details, but it's to say that the idea to cut temporary
> > > relations sounds rather right here.
> >
> > That makes sense to me too.
> >
> > Regarding the patch, filtering by the relpersistence in
> > systable_getnext() loop seems to be good to me. Alternatively we can
> > add "relpersistence == RELPERSISTENCE_TEMP" to the scan key.

Do you mean relpersistence != RELPERSISTENCE_TEMP? I tried that. It's
not possible to add this check to the scankey since indexes do not
work with != conditions. We will need to use a sequential scan to do
so, but that will affect the performance. I think what Vignesh has
done in his patch is good enough.

>
> The attached patch adds a new test and resolves an existing test
> failure. However, a downside is that we can no longer verify the
> mapping of the temporary tables.

Yes, but I think the current way of verification wasn't correct anyway
because it couldn't match the temporary table's relation exactly. We
will need to devise another way to do that, maybe creating a version
of pg_filenode_relation() for temporary tables.

Some more comments, some of which I have applied in the attached
patches. Please review those. Let me know what you think.

I feel that we should mention permanent tables in the prologue of
pg_filenode_relation() for somebody who just looks at
pg_filenode_relation(). Also in its pg_proc.dat description for one
who looks at \df+ output. Attached patch does that.

- * Returns InvalidOid if no relation matching the criteria could be found.
+ * Returns InvalidOid if no permanent relation matching the criteria could be
+ * found.

The relpersistence enum has values
#define RELPERSISTENCE_PERMANENT 'p' /* regular table */
#define RELPERSISTENCE_UNLOGGED 'u' /* unlogged permanent table */
#define RELPERSISTENCE_TEMP 't' /* temporary table */

The description of UNLOGGED mentions "permanent" so it looks like your
use of "permanent" covers both regular and unlogged tables. However
looking purely at the RELPERSISTENCE_ names, it' s possible that one
will associate the word "permanent" with RELPERSISTENCE_PERMANENT
only, and not with RELPERSISTENCE_UNLOGGED. Should we use
"non-temporary" instead to avoid confusion?

+ /*
+ * Temporary relations should be excluded. This exclusion is
+ * necessary because they can lead to the wrong result; the
+ * relfilenumber space is shared with persistent
+ * relations. Additionally, excluding them is possible since they
+ * are not the focus in this context.
+ */
+ if (classform->relpersistence == RELPERSISTENCE_TEMP)
+ continue;
+

I slightly rephrased the comment and moved it to the function prologue
since it defines the function's scope.

WHERE relkind IN ('r', 'i', 'S', 't', 'm') AND mapped_oid IS DISTINCT FROM oid;
+WHERE relkind IN ('r', 'i', 'S', 'm') AND mapped_oid IS DISTINCT FROM oid;

Why do we need to remove permanent toast tables from the query?
Instead adjustment below

SELECT m.* FROM filenode_mapping m LEFT JOIN pg_class c ON c.oid = m.oid
-WHERE c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL;
+WHERE (c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL) AND
c.relpersistence != 't';

seems enough. Actually, we shouldn't pass temporary tables to
pg_filenode_relation(), since it doesn't map temporary relations now.
Adjusted that way in the attached patch.

While reviewing the patch, I found something else. The
RelidByRelfilenumber() enters negative cache entries.
RelfilenumberMapInvalidateCallback() treats the negative entries
specifically which indicates that it's intentional. But if somebody
calls pg_filenode_relation() repeatedly with invalid relfilenodes,
that would bloat the cache with "kinda useless entries". It's a way to
make a backend consume a lot of memory quickly and thus perform a DOS.
For example, on my laptop, I could make a backend consume almost 3GB
of memory in 7 minutes.

#SELECT pg_size_pretty(total_bytes) total_bytes,
pg_size_pretty(used_bytes) used_bytes FROM pg_backend_memory_contexts
where name = 'RelfilenumberMap cache';
total_bytes | used_bytes
-------------+------------
(0 rows)

#\timing
Timing is on.
#select r is null, count(*) from (select pg_filenode_relation(i, j) r
from generate_series(1, 1000) i, generate_series(1, 1000) j) q group
by r is null;
?column? | count
----------+---------
t | 1000000
(1 row)

Time: 4705.483 ms (00:04.705)
#SELECT pg_size_pretty(total_bytes) total_bytes,
pg_size_pretty(used_bytes) used_bytes FROM pg_backend_memory_contexts
where name = 'RelfilenumberMap cache';
total_bytes | used_bytes
-------------+------------
40 MB | 39 MB
(1 row)

Time: 2.351 ms
#select r is null, count(*) from (select pg_filenode_relation(i, j) r
from generate_series(1001, 10000) i, generate_series(1001, 10000) j) q
group by r is null;
?column? | count
----------+----------
f | 153215
t | 80846785
(2 rows)

Time: 421998.039 ms (07:01.998)
#SELECT pg_size_pretty(total_bytes) total_bytes,
pg_size_pretty(used_bytes) used_bytes FROM pg_backend_memory_contexts
where name = 'RelfilenumberMap cache';
total_bytes | used_bytes
-------------+------------
3180 MB | 3176 MB
(1 row)

Time: 132.187 ms

Logical replication and autoprewarm may not cause such a large bloat
but there is no limit to passing invalid combinations of reltablespace
and relfilenumber to pg_filenode_relation(). Do we want to prohibit
that case by passing a flag from logical pg_filenode_relation() to not
cache invalid entries?

I have moved the CF entry to the July commitfest.

--
Best Wishes,
Ashutosh Bapat

Attachment	Content-Type	Size
0001-Ignore-temporary-relations-in-RelidByRelfil-20250619.patch	text/x-patch	11.8 KB

In response to

Re: "unexpected duplicate for tablespace" problem in logical replication at 2025-03-25 10:01:01 from vignesh C

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	PG Bug reporting form	2025-06-19 12:21:15	BUG #18962: Type Conversion incorrect when performing UNION of queries.
Previous Message	PG Bug reporting form	2025-06-19 00:27:27	BUG #18961: Race scenario where max_standby_streaming_delay is not honored