Weird planner issue on a standby

From: Guillaume Lelarge <guillaume(at)lelarge(dot)info>
To: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Weird planner issue on a standby
Date: 2022-10-11 15:51:46
Message-ID: CAECtzeVPM4Oi6dTdqVQmjoLkDBVChNj7ed3hNs1RGrBbwCJ7Cw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

My customer has a really weird planner issue on a standby.

First, the context. There's a primary, and two standbys, all on a 11.8
release. (I know this isn't the latest release for the v11 branch.) The
cluster is 5.5TB. The PostgreSQL configuration is the same on all servers.
pg_db_role_setting is almost empty, and there's nothing specific to
planning and replication.

Here is the issue. Queries on both standbys take a lot more time than on
the primary. We eventually found that the queries take a lot of time to be
planned, not to be executed. For example:

On the primary:
Planning Time: 1806.574 ms
Execution Time: 771.888 ms
On any of the standbys:
Planning Time: 41566.439 ms
Execution Time: 1125.352 ms

A simple EXPLAIN already shows this difference in duration.

The query in itself isn't really that big. 8 tables (without partitions), a
few predicates. Nothing fancy. Nothing huge to plan. Here it is, a bit
anonymized:

select ...
from t1
left outer join t2 on ...
left outer join t3 on ...
left outer join t4 on ...
left outer join t5 on ...
left outer join t6 on ...
left outer join t7 on ...
left outer join t8 on ...
where c1='...' and c2='...'
and c3>='...' and c4<='...' ;

Join conditions are really simple. There's no function called anywhere.

Plans on the three servers are exactly the same. Same nodes, same
statistics, same calculations' results.

Moreover, we've looked into what the planner was doing with strace, and
we've been surprised that it involved massive reading on tables (more than
130,000 calls to read() on 15 files). We found that most of these files are
the files for one of the tables on this query. Also, this massive reading
happens only on standbys, not on the primary.

Our customer took the time to drop and create the standbys yesterday night,
but the new standbys still have the issue.

And finally, we discovered that promoting a standby brings back the old
performance. We have the same performance between the primary and the
promoted standby.

To be honest, I'm lost, I don't know what to do next. But I have some
questions :)

* The planner seems to read tables to generate the plans and/or select the
right plan. Why does it do that? I thought it only reads indexes if needed,
but it is quite clear it reads tables also.
* How can the replication have an impact on the planner ? (I'm not sure I'm
asking the right question... I mean, why do I see a difference in behaviour
between a standby and an old standby, new autonomous server?)
* Do you have any idea on how to fix this?

Thank you.

Regards.

--
Guillaume.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2022-10-11 16:17:28 Re: Weird planner issue on a standby
Previous Message Bruce Momjian 2022-10-11 14:30:53 Re: Different execution plan between PostgreSQL 8.4 and 12.11