Re: O(samplesize) tuple sampling, proof of concept

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: O(samplesize) tuple sampling, proof of concept
Date: 2004-04-05 19:37:07
Message-ID: 14645.1081193827@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
> The old sampling method is still there. I have added a GUC variable
> sampling_method to make testing and benchmarking easier.

I wouldn't bother with a GUC variable for the production patch.

> Once a block is selected for inspection, all tuples of this
> block are accessed to get a good estimation of the live : dead row
> ratio.

Why are you bothering to compute the live:dead ratio? The statistics
model doesn't have any place for a dead-tuples count, so there's no need
to think about anything but live tuples. (This is a historical
artifact, perhaps, arising from the fact that originally analysis
happened after VACUUM FULL and so the number of dead tuples was
guaranteed to be zero at the time. But still, I'm not sure how we'd
factor a dead-tuples count into the estimates if we had it.)

> Because I was afraid that method 1 might be too expensive in terms of
> CPU cycles, I implemented a small variation that skips tuples without
> checking them for visibility; this is sampling_method 2.

And? Does it matter? (I'd guess not in practice, as checking a tuple
for visibility is cheap if someone's already marked it good.)

> At 1000 pages there is only a difference of approximately 1%,
> at 3000 pages the difference has its maximum of ca. 15%. We can sell
> this by pointing to the better quality of the statistics.

If that's as bad as it gets I think we are OK. You should redo the test
with larger sample size though (try stats target = 100) to see if the
answer changes.

> Are there any spots in the documentation that
> should be updated?

AFAIR there is noplace that specifically needs to be changed.

> diff -ruN ../base/src/backend/commands/analyze.c src/backend/commands/analyze.c

I find -u diffs close to unreadable for reviewing purposes. Please
submit diffs in -c format in future.

> + /*
> + * If we ran out of tuples then we're done. No sort is needed,
> + * since they're already in order.
> + *
> + * Otherwise we need to sort the collected tuples by position
> + * (itempointer). I don't care for corner cases where the tuples
> + * are already sorted.
> + */
> + if (numrows == targrows)
> + qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);

AFAICS the rows will *always* be sorted already, and so this qsort is an
utter waste of cycles. No?

regards, tom lane

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Manfred Koizar 2004-04-05 22:23:19 Re: O(samplesize) tuple sampling, proof of concept
Previous Message Manfred Koizar 2004-04-05 17:50:09 O(samplesize) tuple sampling, proof of concept