Possible important data point on stats collection, wondering about possible improvement

From: Chris Travers <chris(dot)travers(at)adjust(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Possible important data point on stats collection, wondering about possible improvement
Date: 2018-10-04 12:11:12
Message-ID: CAN-RpxDpdup84Sz_rH_BdjPd9fZkKufGpsSshV6dVnby=uRZ_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi;

System is PostgreSQL 10.5, all partitioning done the old way (via
inheritance).

Last month we had some performance issues caused by statistics being out of
date and the planner choosing the wrong index for a large number of
queries. The proximal fix was to increase the stats target from 1000 to
10000 and analyze which prevents the problem from continuing to manifest.
Looking back at the graphs though I notice a number of things which make me
think that maybe there might be ways of improving the situation without
increasing stats targets.

I wanted to bring up the question here and see if there were opportunities
to work together on improving the situation.

What I measured was the difference between the maximum value in the
statistics histogram and the maximum value in the table. The relevant
field is a timestamp field, so calculating lag is straight-forward and
gives us a percentage of the table that is outside stats.

What I noticed was that the major analytics tables seem to fall into two
groups:
1. One group has actual clear sampling issues as evidenced by the fact
that the difference in values swung wildly around. One of these tables had
52 million rows spread between two partitions (1M and 51M respectively).
On this group I understand the need to set stats targets up.

2. A second group saw a more mild increase in lag between max value
recorded in stats and max value in the db. However what struck me about
this was that the lag seemed fairly linear. In other words, the
fluctuations were within a relatively narrow range and the lag seemed to
grow linearly with time. These tables were actually larger (one typical
one was 60M rows split between a partition of 51M rows and one of 9M rows).

The workload for the database in question is heavily update driven so I
would expect fewer sampling bias problems than might happen for insert-only
workloads.

The second case puzzles me. I have been looking carefully into how the
stats collector works and I cannot find anything that could account for a
near-linear increase in statics missing recent data. What starts out in a
60M row table with default stats targets (1000) ends up going about 10% off
over time.

What I am wondering is whether it would make any sense whatsoever to expand
the stats to include min and max values found in a scan, or whether it
would make more sense to try to help the planner extrapolate from existing
stats in a more efficient way.

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

Browse pgsql-hackers by date

  From Date Subject
Next Message Pierre Ducroquet 2018-10-04 12:11:51 Poor plan when using EXISTS in the expression list
Previous Message Laurenz Albe 2018-10-04 11:26:10 Re: Function to promote standby servers