From: | Jehan-Guillaume de Rorthais <ioguix(at)free(dot)fr> |
---|---|
To: | "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Strange avg value size for index on expression in pg_stats |
Date: | 2014-11-10 10:52:54 |
Message-ID: | 20141110115254.2997aaf8@erg |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hello,
I'm not sure if I should post here or on pgsql-hackers.
While investigating about a wrong result with my btree bloat estimation query,
I found a strange stat deviation between the average size of a value in its
table and its average size in one index on the "lower" expression.
Take the following scenario:
postgres(at)test=# create table test as
test-# select md5(t::text) AS t
test-# from generate_series(1, 1000000) t;
SELECT 1000000
postgres(at)test=# create index ON test (lower(t));
CREATE INDEX
postgres(at)test=# analyze test;
ANALYZE
postgres(at)test=# select tablename, attname, avg_width
test-# from pg_stats
test-# where schemaname = 'public';
tablename | attname | avg_width
----------------+---------+-----------
test | t | 33
test_lower_idx | lower | 36
Md5 values are always 32 bytes long. In the table, pg_stats reports 33 because
of the text header. In the index, the reported value is 36!
Looking at the page layout documentation and in the index using hexdump, I can
not find any answer about this 3 bytes. PFA the "hexdump -C" output from the
index. For each row, we clearly see a 8 bytes row header followed by a ONE byte
value header (43 'C'), the 32 bytes of the md5 and 7 bytes of padding (00).
A wild guess would be that ANALYZE is considering a text field from an
expression has always a 4 bytes header whatever its actual size (text field
header size is one if the value is < 127 bytes long, 4 in other situations).
This tiny difference is the source of a very bad estimation with the Btree bloat
estimation query when values are around an alignement boundary. As instance,
here is the use case that lead me to this:
tablename | attname | avg_width
--------------------+---------+-----------
customer | email | 23
customer_lower_idx | lower | 26
We have an index on email, and another one on lower(index). The first one is
aligned on 24, the second one on 32. Leading to bloat estimation of 17% for
the first one and -1% for the second one (the estimated index size is bigger
than the real one).
Any information about this from a hacker? Do anyone have an explanation about
this? Is it something that's worth posting on pgsql-hackers?
Regards,
Attachment | Content-Type | Size |
---|---|---|
index-lower-hexdump-C.txt | text/plain | 25.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Jehan-Guillaume de Rorthais | 2014-11-10 11:00:45 | Re: Strange avg value size for index on expression in pg_stats |
Previous Message | Ilya Ashchepkov | 2014-11-10 09:07:48 | justify_interval: days in year |