Re: Postgres is not able to handle more than 4k tables!?

From: Nikolay Samokhvalov <samokhvalov(at)gmail(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Postgres is not able to handle more than 4k tables!?
Date: 2020-07-09 16:19:34
Message-ID: CANNMO++NMpDHD0Z2B0oVtaUeYmd+D=Hvd0u=WvGYhFJaOZoXtw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Konstantin, a silly question: do you consider the workload you have as
well-optimized? Can it be optimized further? Reading this thread I have a
strong feeling that a very basic set of regular optimization actions is
missing here (or not explained): query analysis and optimization based on
pg_stat_statements (and, maybe pg_stat_kcache), some method to analyze the
state of the server in general, resource consumption, etc.

Do you have some monitoring that covers pg_stat_statements?

Before looking under the hood, I would use multiple pg_stat_statements
snapshots (can be analyzed using, say, postgres-checkup or pgCenter) to
understand the workload and identify the heaviest queries -- first of all,
in terms of total_time, calls, shared buffers reads/hits, temporary files
generation. Which query groups are Top-N in each category, have you looked
at it?

You mentioned some crazy numbers for the planning time, but why not to
analyze the picture holistically and see the overall numbers? Those queries
that have increased planning time, what their part of total_time, on the
overall picture, in %? (Unfortunately, we cannot see Top-N by planning time
in pg_stat_statements till PG13, but it doesn't mean that we cannot have
some good understanding of overall picture today, it just requires more
work).

If workload analysis & optimization was done holistically already, or not
possible due to some reason — pardon me. But if not and if your primary
goal is to improve this particular setup ASAP, then the topic could be
started in the -performance mailing list first, discussing the workload and
its aspects, and only after it's done, raised in -hackers. No?

On Thu, Jul 9, 2020 at 8:57 AM Konstantin Knizhnik <
k(dot)knizhnik(at)postgrespro(dot)ru> wrote:

> Hi Stephen,
>
> Thank you for supporting an opinion that it is the problems not only of
> client system design (I agree it is not so good
> idea to have thousands tables and thousands active backends) but also of
> Postgres.
>
> We have made more investigation and found out one more problem in
> Postgres causing "invalidation storm".
> There are some log living transactions which prevent autovacuum to do it
> work and remove dead tuples.
> So autovacuum is started once and once again and each time did no
> progress but updated statistics and so sent invalidation messages.
> autovacuum_naptime was set to 30 seconds, so each 30 seconds autovacuum
> proceed huge number of tables and initiated large number of invalidation
> messages which quite soon cause overflow of validation message buffers
> for backends performing long OLAP queries.
>
> It makes me think about two possible optimizations:
>
> 1. Provide separate invalidation messages for relation metadata and its
> statistic.
> So update of statistic should not invalidate relation cache.
> The main problem with this proposal is that pg_class contains relpages
> and reltuples columns which conceptually are \ part of relation statistic
> but stored in relation cache. If relation statistic is updated, then
> most likely this fields are also changed. So we have to remove this
> relation
> from relation cache in any case.
>
> 2. Remember in relation info XID of oldest active transaction at the
> moment of last autovacuum.
> At next autovacuum iteration we first of all compare this stored XID
> with current oldest active transaction XID
> and bypass vacuuming this relation if XID is not changed.
>
> Thoughts?
>
> >> So, that's really the core of your problem. We don't promise that
> >> you can run several thousand backends at once. Usually it's recommended
> >> that you stick a connection pooler in front of a server with (at most)
> >> a few hundred backends.
> > Sure, but that doesn't mean things should completely fall over when we
> > do get up to larger numbers of backends, which is definitely pretty
> > common in larger systems. I'm pretty sure we all agree that using a
> > connection pooler is recommended, but if there's things we can do to
> > make the system work at least a bit better when folks do use lots of
> > connections, provided we don't materially damage other cases, that's
> > probably worthwhile.
>
> I also think that Postgres performance should degrade gradually with
> increasing number
> of active backends. Actually further investigations of this particular
> case shows that such large number of
> database connections was caused by ... Postgres slowdown.
> During normal workflow number of active backends is few hundreds.
> But "invalidation storm" cause hangout of queries, so user application
> has to initiate more and more new connections to perform required actions.
> Yes, this may be not the best behavior of application in this case. At
> least it should first terminate current connection using
> pg_terminate_backend. I just want to notice that large number of
> backends was not the core of the problem.
>
> >
> > Making them GUCs does seem like it's a few steps too far... but it'd be
> > nice if we could arrange to have values that don't result in the system
> > falling over with large numbers of backends and large numbers of tables.
> > To get a lot of backends, you'd have to set max_connections up pretty
> > high to begin with- perhaps we should contemplate allowing these values
> > to vary based on what max_connections is set to?
>
> I think that optimal value of number of lock partitions should depend
> not on number of connections
> but on number of available CPU cores and so expected level on concurrency.
> It is hard to propose some portable way to obtain this number.
> This is why I think that GUCs is better solution.
> Certainly I realize that it is very dangerous parameter which should be
> changed with special care.
> Not only because of MAX_SIMUL_LWLOCKS.
>
> There are few places in Postgres when it tries to lock all partitions
> (deadlock detector, logical replication,...).
> If there very thousands of partitions, then such lock will be too
> expensive and we get yet another
> popular Postgres program: "deadlock detection storm" when due to high
> contention between backends lock can not be obtained
> in deadlock timeout and so initiate deadlock detection. Simultaneous
> deadlock detection performed by all backends
> (which tries to take ALL partitions locks) paralyze the system (TPS
> falls down to 0).
> Proposed patch for this problem was also rejected (once again - problem
> can be reproduced only of powerful server with large number of cores).
>
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2020-07-09 16:31:38 Re: Log the location field before any backtrace
Previous Message Konstantin Knizhnik 2020-07-09 16:16:26 Re: Postgres is not able to handle more than 4k tables!?