Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.

From: David Gould <daveg(at)sonic(dot)net>
To: Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Alina Alexeeva <alexeeva(at)adobe(dot)com>, Ullas Lakkur Raghavendra <lakkurra(at)adobe(dot)com>
Subject: Re: [patch] BUG #15005: ANALYZE can make pg_class.reltuples inaccurate.
Date: 2018-03-01 22:27:26
Message-ID: 20180301142726.0f276e7a@engels
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 1 Mar 2018 17:25:09 +0300
Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru> wrote:

> Well, that sounds reasonable. But the problem with the moving average
> calculation remains. Suppose you run vacuum and not analyze. If the
> updates are random enough, vacuum won't be able to reclaim all the
> pages, so the number of pages will grow. Again, we'll have the same
> thing where the number of pages grows, the real number of live tuples
> stays constant, and the estimated reltuples grows after each vacuum run.

I agree VACUUM's moving average may be imperfect, but the rationale makes
sense and I don't have a plan to improve it now. This patch only intends to
improve the behavior of ANALYZE by using the estimated row density time
relpages to get reltuples. It does not change VACUUM.

The problem with the moving average for ANALYZE is that it prevents ANALYZE
from changing the reltuples estimate enough for large tables.

Consider this based on the test setup from the patch:

create table big as select id*p, ten, hun, thou, tenk, lahk, meg, padding
from reltuples_test,
generate_series(0,9) g(p);
-- SELECT 100000000
alter table big set (autovacuum_enabled=false);

select count(*) from big;
-- count
-- 100000000
select reltuples::int, relpages from pg_class where relname = 'big';
-- reltuples | relpages
-- 0 | 0

analyze verbose big;
-- INFO: analyzing "public.big"
-- INFO: "big": scanned 30000 of 1538462 pages, containing 1950000 live rows and 0 dead rows;
-- 30000 rows in sample, 100000030 estimated total rows

select reltuples::int, relpages from pg_class where relname = 'big';
-- reltuples | relpages
-- 100000032 | 1538462

delete from big where ten > 1;
-- DELETE 80000000
select count(*) from big;
-- count
-- 20000000
select reltuples::int, relpages from pg_class where relname = 'big';
-- reltuples | relpages
-- 100000032 | 1538462

analyze verbose big;
-- INFO: analyzing "public.big"
-- INFO: "big": scanned 30000 of 1538462 pages, containing 388775 live rows and 1561225 dead rows;
-- 30000 rows in sample, 98438807 estimated total rows

select reltuples::int, relpages from pg_class where relname = 'big';
reltuples | relpages
98438808 | 1538462
select count(*) from big;
-- count
-- 20000000

analyze verbose big;
-- INFO: analyzing "public.big"
-- INFO: "big": scanned 30000 of 1538462 pages, containing 390885 live rows and 1559115 dead rows;
-- 30000 rows in sample, 96910137 estimated total rows

select reltuples::int, relpages from pg_class where relname = 'big';
reltuples | relpages
96910136 | 1538462

Table big has 1.5 million pages. ANALYZE samples 30 thousand. No matter how
many rows we change in T, ANALYZE can only change the reltuples estimate
by old_estimate + new_estimate * (30000/1538462), ie about 1.9 percent.

With the patch on this same table we get:

select count(*) from big;
-- count
-- 20000000
select reltuples::int, relpages from pg_class where relname = 'big';
reltuples | relpages
96910136 | 1538462

analyze verbose big;
-- INFO: analyzing "public.big"
-- INFO: "big": scanned 30000 of 1538462 pages, containing 390745 live rows and 1559255 dead rows;
-- 30000 rows in sample, 20038211 estimated total rows

select reltuples::int, relpages from pg_class where relname = 'big';
-- reltuples | relpages
-- 20038212 | 1538462

-dg

--
David Gould daveg(at)sonic(dot)net
If simplicity worked, the world would be overrun with insects.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-03-01 22:29:58 Re: [PATCH] GET DIAGNOSTICS FUNCTION_NAME
Previous Message Dean Rasheed 2018-03-01 22:26:48 Re: MCV lists for highly skewed distributions