Re: Collect frequency statistics for arrays

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Nathan Boley <npboley(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collect frequency statistics for arrays
Date: 2012-03-04 01:38:35
Message-ID: 17172.1330825115@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> [ array statistics patch ]

I've committed this after a fair amount of editorialization. There are
still some loose ends to deal with, but I felt it was ready to go into
the tree for wider testing.

The main thing I changed that wasn't in the nature of cleanup/bugfixing
was that I revised the effort-limiting logic in
mcelem_array_contained_selec. The submitted code basically just punted
if the estimated work was too much, but as was already noted in
http://archives.postgresql.org/pgsql-hackers/2011-10/msg01349.php
that can result in really bad estimates. What I did instead is
something closer to Robert's original suggestion: trim the number of
element values taken into consideration from the array constant to a
value that fits within the desired effort limit. If we consider just
the N most common values from the array constant, we still get a pretty
good estimate (since the trimmed N will still be close to 100 for the
values we're talking about).

I redid the tests in the above-mentioned message and see no cases where
the estimate is off by more than a factor of 2, and very few where it's
off by more than 20%, so this seems to work pretty well now.

The remaining loose ends IMO are:

1. I'm still unhappy about the loop that fills the count histogram,
as I noted earlier today. It at least needs a decent comment and some
overflow protection, and I'm not entirely convinced that it doesn't have
more bugs than the overflow issue.

2. The tests in the above-mentioned message show that in most cases
where mcelem_array_contained_selec falls through to the "rough
estimate", the resulting rowcount estimate is just 1, ie we are coming
out with very small selectivities. Although that path will now only be
taken when there are no stats, it seems like we'd be better off to
return DEFAULT_CONTAIN_SEL instead of what it's doing. I think there
must be something wrong with the "rough estimate" logic. Could you
recheck that?

3. As I mentioned yesterday, I think it'd be a good idea to make some
provisions to reduce the width of pg_statistic rows for array columns
by not storing the scalar-style and/or array-style stats, if the DBA
knows that they're not going to be useful for a particular column.
I have not done anything about that.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-03-04 02:00:26 Re: poll: CHECK TRIGGER?
Previous Message Jeff Janes 2012-03-04 01:20:01 Re: archive_keepalive_command