Re: O(samplesize) tuple sampling, proof of concept

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: O(samplesize) tuple sampling, proof of concept
Date: 2004-04-05 22:30:29
Message-ID: 16177.1081204229@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

Manfred Koizar <mkoi-pg(at)aon(dot)at> writes:
>>> Because I was afraid that method 1 might be too expensive in terms of
>>> CPU cycles, I implemented a small variation that skips tuples without
>>> checking them for visibility; this is sampling_method 2.
>>
>> And? Does it matter?

> There's a clearly visible difference for mid-size relations. I'm not
> sure whether this can be attributed to visibility bit updating or other
> noise-contributing factors.

I think it would have to be visibility-bit updates. Can you try it with
a pre-vacuumed relation, so that there is no slowdown for updates?

The update cost will have to be paid sooner or later by some xact, and
on the whole it's better that it be done by a maintenance xact than by
a foreground client; so even if there is a noticeable slowdown here,
that doesn't seem like a reason not to inspect all the tuples.

>>> I find -u diffs close to unreadable for reviewing purposes. Please
>>> submit diffs in -c format in future.

> De gustibus non est disputandum :-)
> Fortunately this patch wouldn't look much different. There is just a
> bunch of "+" lines.

Yeah, so I managed to read it anyway ;-). It's the ones with
intricately intermixed "-" and "+" that I find difficult to follow.

>> AFAICS the rows will *always* be sorted already, and so this qsort is an
>> utter waste of cycles. No?

> I don't think so. We get the *blocks* in the correct order. But tuples
> are still sampled by the Vitter method which starts to replace random
> tuples after the pool is filled.

Duh, you're right --- I was thinking that the old code doesn't need a
qsort, but it does. This seems a tad annoying considering that we know
the tuples were inserted into the pool in increasing order. I wonder if
it's worth using a more complex data structure to keep track of both
orders at once? I suppose that could easily wind up costing more than
the qsort though ...

regards, tom lane

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Manfred Koizar 2004-04-06 00:00:51 Re: O(samplesize) tuple sampling, proof of concept
Previous Message Manfred Koizar 2004-04-05 22:23:19 Re: O(samplesize) tuple sampling, proof of concept