Re: Improving N-Distinct estimation by ANALYZE

From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving N-Distinct estimation by ANALYZE
Date: 2006-01-16 20:24:38
Message-ID: 50dns19bc4cs20tgg7aq4d3l80ph1f6vjq@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 13 Jan 2006 19:18:29 +0000, Simon Riggs
<simon(at)2ndquadrant(dot)com> wrote:
>I enclose a patch for checking out block sampling.

Can't comment on the merits of block sampling and your implementation
thereof. Just some nitpicking:

|! * Row Sampling: As of May 2004, we use the Vitter algorithm to create

Linking the use of the Vitter algorithm to May 2004 is not quite
appropriate. We introduced two stage sampling at that time.

| * a random sample of targrows rows (or less, if there are less in the
|! * sample of blocks). In this case, targblocks is always the same as
|! * targrows, so we always read one row per block.

This is just wrong, unless you add "on average". Even then it is a
bit misleading, because in most cases we *read* more tuples than we
use.

| * Although every row has an equal chance of ending up in the final
| * sample, this sampling method is not perfect: not every possible
| * sample has an equal chance of being selected. For large relations
| * the number of different blocks represented by the sample tends to be
|! * too small. In that case, block sampling should be used.

Is the last sentence a fact or personal opinion?

| * block. The previous sampling method put too much credence in the row
| * density near the start of the table.

FYI, "previous" refers to the date mentioned above:
previous == before May 2004 == before two stage sampling.

|+ /* Assume that we will have rows at least 64 bytes wide
|+ * Currently we're very unlikely to overflow availMem here
|+ */
|+ if ((allocrows * sizeof(HeapTuple)) > (allowedMem >> 4))

This is a funny way of saying
if (allocrows * (sizeof(Heaptuple) + 60) > allowedMem)

It doesn't match the comment above; and it is something different if
the size of a pointer is not four bytes.

|- if (bs.m > 0)
|+ if (bs.m > 0 )

Oops.

|+ ereport(DEBUG2,
|+ (errmsg("ANALYZE attr %u sample: n=%u nmultiple=%u f1=%u d=%u",
|+ stats->tupattnum,samplerows, nmultiple, f1, d)));
^
missing space here and in some more places

I haven't been following the discussion too closely but didn't you say
that a block sampling algorithm somehow compensates for the fact that
the sample is not random?
Servus
Manfred

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2006-01-16 20:38:23 Re: Improving N-Distinct estimation by ANALYZE
Previous Message Simon Riggs 2006-01-16 20:05:46 Re: Improving N-Distinct estimation by ANALYZE