Re: Multi-column distinctness.

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multi-column distinctness.
Date: 2015-09-06 08:31:25
Message-ID: 55EBF9DD.9070605@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Horiguchi-san,

On 08/28/2015 10:33 AM, Kyotaro HORIGUCHI wrote:
> Hello, this patch enables planner to be couscious of inter-column
> correlation.
>
> Sometimes two or more columns in a table has some correlation
> which brings underestimate, which leads to wrong join method and
> ends with slow execution.
>
> Tomas Vondra is now working on heavily-equipped multivariate
> statistics for OLAP usage. In contrast, this is a lightly
> implemented solution which calculates only the ratio between a
> rows estimated by current method and a actual row number. I think
> this doesn't conflict with his work except the grammar part.

I'd like to sign up as a reviewer of this patch, if you're OK with that.
I don't see the patch in any of the commitfests, though :-(

Now, a few comments based on reading the patches (will try testing them
in the next few days, hopefully).

1) catalog design
-----------------
I see you've invented a catalog with a slightly different structure than
I use in the multivariate stats patch, storing the coefficient for each
combination of columns in a separate row. The idea in the multivariate
patch is to generate all possible combinations of columns and store the
coefficients packed in a single row, which requires implementing a more
complicated data structure. But your patch does not do that (only
computes coefficient for the one combination).

This is another potential incompatibility with the multivariate patch,
although it's rather in the background and should not be difficult to
rework if needed.

2) find_mv_coeffeicient
---------------------
I don't quite see why this is the right thing

/* Prefer smaller one */
if (mvc->mvccoefficient > 0 && mvc->mvccoefficient < mv_coef)
mv_coef = mvc->mvccoefficient;

i.e. why it's correct to choose the record with the lower coefficient
and not the one with most columns (the largest subset). Maybe there's
some rule that those are in fact the same thing (seems like adding a
column can only lower the coefficient), but it's not quite obvious and
would deserve a comment.

3) single statistics limitation
-------------------------------
What annoys me a bit is that this patch only applies a single statistics
- it won't try "combining" multiple smaller ones (even if they don't
overlap, which shouldn't be that difficult).

4) no sub-combinations
----------------------
Another slightly annoying thing is that the patch computes statistics
only for the specified combination of columns, and no subsets. So if you
have enabled stats on [a,b,c] you won't be able to use that for
conditions on [a,b], for example. Sure, the user can create the stats
manually, but IMO the less burden we place on the user the better. It's
enough that we force users to create the stats at all, we shouldn't
really force them co create all possible combinations.

5) syntax
---------
The syntax might be one of the pain points if we eventually decide to
commit the multivariate stats patch. I have no intention in blocking
this patch for that reasons, but if we might design the syntax to make
it compatible with the multivariate patch, that'd be nice. But it seems
to me the syntax is pretty much the same, no?

I.e. it uses

ADD STATISTICS (options) ON (columns)

just like the multivariate patch, no? Well, it doesn't really check the
stattypes in ATExecAddDropMvStatistics except for checking there's a
single entry, but the syntax seems OK.

BTW mixing ADD and DROP in ATExecAddDropMvStatistics seems a bit
confusing. Maybe two separate methods would be better?

6) GROUP BY
-----------
One thing I'd really love to see is improvement of GROUP BY clauses.
That's one of the main sources of pain in analytical queries, IMHO,
resulting in either OOM issues in Hash Aggregate, or choice of sort
paths where hash would be more efficient.

regards
Tomas

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2015-09-06 08:33:04 Re: Outdated documentation for PageHeaderData
Previous Message Simon Riggs 2015-09-06 08:26:47 Re: Multi-column distinctness.