Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Mark Dilger <hornschnorter(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-03-17 13:18:21
Message-ID: c08dd2ce-14f0-463b-763c-13af49926d03@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 3/17/19 12:47 PM, Dean Rasheed wrote:
> On Sat, 16 Mar 2019 at 23:44, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>>> 28). I just spotted the 1MB limit on the serialised MCV list size. I
>>> think this is going to be too limiting. For example, if the stats
>>> target is at its maximum of 10000, that only leaves around 100 bytes
>>> for each item's values, which is easily exceeded. In fact, I think
>>> this approach for limiting the MCV list size isn't a good one --
>>> consider what would happen if there were lots of very large values.
>>> Would it run out of memory before getting to that test? Even if not,
>>> it would likely take an excessive amount of time.
>>>
>>
>> True. I don't have a very good argument for a specific value, or even
>> having an explicit limit at all. I've initially added it mostly as a
>> safety for development purposes, but I think you're right we can just
>> get rid of it. I don't think it'd run out of memory before hitting the
>> limit, but I haven't tried very hard (but I recall running into the 1MB
>> limit in the past).
>>
>
> I've just been playing around a little with this and found that it
> isn't safely dealing with toasted values. For example, consider the
> following test:
>
> create or replace function random_string(x int) returns text
> as $$
> select substr(string_agg(md5(random()::text), ''), 1, x)
> from generate_series(1,(x+31)/32);
> $$ language sql;
>
> drop table if exists t;
> create table t(a int, b text);
> insert into t values (1, random_string(10000000));
> create statistics s (mcv) on a,b from t;
> analyse t;
>
> select length(b), left(b,5), right(b,5) from t;
> select length(stxmcv), length((m.values::text[])[2]),
> left((m.values::text[])[2], 5), right((m.values::text[])[2],5)
> from pg_statistic_ext, pg_mcv_list_items(stxmcv) m
> where stxrelid = 't'::regclass;
>
> The final query returns the following:
>
> length | length | left | right
> --------+----------+-------+-------
> 250 | 10000000 | c2667 | 71492
> (1 row)
>
> suggesting that there's something odd about the stxmcv value. Note,
> also, that it doesn't hit the 1MB limit, even though the value is much
> bigger than that.
>
> If I then delete the value from the table, without changing the stats,
> and repeat the final query, it falls over:
>
> delete from t where a=1;
> select length(stxmcv), length((m.values::text[])[2]),
> left((m.values::text[])[2], 5), right((m.values::text[])[2],5)
> from pg_statistic_ext, pg_mcv_list_items(stxmcv) m
> where stxrelid = 't'::regclass;
>
> ERROR: unexpected chunk number 5008 (expected 0) for toast value
> 16486 in pg_toast_16480
>
> So I suspect it was using the toast data from the table t, although
> I've not tried to investigate further.
>

Yes, it was using the toasted value directly. The attached patch
detoasts the value explicitly, similarly to the per-column stats, and it
also removes the 1MB limit.

>
>>> I think this part of the patch needs a bit of a rethink. My first
>>> thought is to do something similar to what happens for per-column
>>> MCVs, and set an upper limit on the size of each value that is ever
>>> considered for inclusion in the stats (c.f. WIDTH_THRESHOLD and
>>> toowide_cnt in analyse.c). Over-wide values should be excluded early
>>> on, and it will need to track whether or not any such values were
>>> excluded, because then it wouldn't be appropriate to treat the stats
>>> as complete and keep the entire list, without calling
>>> get_mincount_for_mcv_list().
>>>
>> Which part? Serialization / deserialization? Or how we handle long
>> values when building the MCV list?
>>
>
> I was thinking (roughly) of something like the following:
>
> * When building the values array for the MCV list, strip out rows with
> values wider than some threshold (probably something like the
> WIDTH_THRESHOLD = 1024 from analyse.c would be reasonable).
>
> * When building the MCV list, if some over-wide values were previously
> stripped out, always go into the get_mincount_for_mcv_list() block,
> even if nitems == ngroups (for the same reason a similar thing happens
> for per-column stats -- if some items were stripped out, we're already
> saying that not all items will go in the MCV list, and it's not safe
> to assume that the remaining items are common enough to give accurate
> estimates).
>

Yes, that makes sense I guess.

> * In the serialisation code, remove the size limit entirely. We know
> that each value is now at most 1024 bytes, and there are at most 10000
> items, and at most 8 columns, so the total size is already reasonably
> well bounded. In the worst case, it might be around 80MB, but in
> practice, it's always likely to be much much smaller than that.
>

Yep, I've already removed the limit from the current patch.

cheers

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-retain-reltuples-relpages-on-ALTER-TABLE-20190317b.patch.gz application/gzip 1.7 KB
0002-multivariate-MCV-lists-20190317b.patch.gz application/gzip 39.3 KB
0003-multivariate-histograms-20190317b.patch.gz application/gzip 48.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-03-17 13:19:25 Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Previous Message Chris Travers 2019-03-17 13:00:57 Data-only pg_rewind, take 2