Re: TABLESAMPLE patch

From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: TABLESAMPLE patch
Date: 2015-01-23 00:09:35
Message-ID: 54C1913F.50705@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19/01/15 07:08, Amit Kapila wrote:
> On Sun, Jan 18, 2015 at 12:46 AM, Petr Jelinek <petr(at)2ndquadrant(dot)com
> <mailto:petr(at)2ndquadrant(dot)com>> wrote:
> No issues, but it seems we should check other paths where
> different handling could be required for tablesample scan.
> In set_rel_size(), it uses normal path for heapscan (set_plain_rel_size())
> for rel size estimates where it checks the presence of partial indexes
> and that might effect the size estimates and that doesn't seem to
> be required when we have to perform scan based on TableSample
> method.

I think that's actually good to have, because we still do costing and
the partial index might help produce better estimate of number of
returned rows for tablesample as well.

>
> 4.
> SampleNext()
> a.
> Isn't it better to code SampleNext() similar to SeqNext() and
> IndexNext(), which encapsulate the logic to get the tuple in
> another function(tbs_next() or something like that)?
>
>
> Possibly, my thinking was that unlike the index_getnext() and
> heap_getnext(), this function would not be called from any other
> place so adding one more layer of abstraction didn't seem useful.
> And it would mean creating new ScanDesc struct, etc.
>
>
> I think adding additional abstraction would simplify the function
> and reduce the chance of discrepency, I think we need to almost
> create a duplicate version of code for heapgetpage() method.
> Yeah, I agree that we need to build structure like ScanDesc, but
> still it will be worth to keep code consistent.

Well similar, not same as we are not always fetching whole page or doing
visibility checks on all tuples in the page, etc. But I don't see why it
can't be inside nodeSamplescan. If you look at bitmap heap scan, that
one is also essentially somewhat modified sequential scan and does
everything within the node nodeBitmapHeapscan because the algorithm is
not used anywhere else, just like sample scan.

>
> Another one is PageIsAllVisible() check.
>

Done.

> Another thing is don't we want to do anything for sync scans
> for these method's as they are doing heap scan?
>
>
> I don't follow this one tbh.
>
>
> Refer parameter (HeapScanDesc->rs_syncscan) and syncscan.c.
> Basically during sequiantial scan on same table by different
> backends, we attempt to keep them synchronized to reduce the I/O.
>

Ah this, yes, it makes sense for bernoulli, not for system though. I
guess it should be used for sampling methods that use SAS_SEQUENTIAL
strategy.

>
> c.
> for bernoulli method, it will first get the tupoffset with
> the help of function and then apply visibility check, it seems
> that could reduce the sample set way lower than expected
> in case lot of tuples are not visible, shouldn't it be done in
> reverse
> way(first visibility check, then call function to see if we want to
> include it in the sample)?
> I think something similar is done in acquire_sample_rows which
> seems right to me.
>
>
> I don't think so, the way bernoulli works is that it returns every
> tuple with equal probability, so the visible tuples have same chance
> of being returned as the invisible ones so the issue should be
> smoothed away automatically (IMHO).
>
>
> How will it get smoothen for cases when let us say
> more than 50% of tuples are not visible. I think its
> due to Postgres non-overwritting storage architecture
> that we maintain multiple version of rows in same heap,
> otherwise for different kind of architecture (like mysql/oracle)
> where multiple row versions are not maintained in same
> heap, the same function (with same percentage) can return
> entirely different number of rows.
>

I don't know how else to explain, if we loop over every tuple in the
relation and return it with equal probability then visibility checks
don't matter as the percentage of visible tuples will be same in the
result as in the relation.

For example if you have 30% visible tuples and you are interested in 10%
of tuples overall it will return 10% of all tuples and since every tuple
has 10% chance of being returned, 30% of those should be visible (if we
assume smooth distribution of random numbers generated). So in the end
you are getting 10% of visible tuples because the scan will obviously
skip the invisible ones and that's what you wanted.

As I said problem of analyze is that it uses tuple limit instead of
probability.

> 5.
>
>
> So above test yield 60% rows first time and 100% rows next time,
> when the test has requested 80%.
> I understand that sample percentage will result an approximate
> number of rows, however I just wanted that we should check if the
> implementation has any problem or not?
>
>
> I think this is caused by random generator not producing smooth
> enough random distribution on such a small sample (or possibly in
> general?).
>
>
> Yeah it could be possible, I feel we should check with large
> sample of rows to see if there is any major difference?
>
>
> In this regard, I have one question related to below code:
>
> Why are we not considering tuple in above code
> if tupoffset is less than maxoffset?
>
>
> We consider it, I will rename the samplesize to probability or
> something to make it more clear what it actually is and maybe expand
> the comment above the loop.

Yes the differences is smaller (in relative numbers) for bigger tables
when I test this. On 1k row table the spread of 20 runs was between
770-818 and on 100k row table it's between 79868-80154. I think that's
acceptable variance and it's imho indeed the random generator that
produces this.

Oh and BTW when I delete 50k of tuples (make them invisible) the results
of 20 runs are between 30880 and 40063 rows.

>
>
> I originally started from SeqScan but well, it requires completely
> different State structure, it needs to call sampling methods in
> various places (not just for next tuple), it needs different
> handling in explain and in optimizer (costing). If we add all this
> to sequential scan it will not be that much different from new scan
> node (we'd save the couple of new copy functions and one struct, but
> I'd rather have clearly distinct scan).
>
>
> I understand that point, but I think it is worth considering if
> it can be done as SeqScan node especially because plan node
> doesn't need to store any additional information for sample scan.
>
> I think this point needs some more thoughts and if none of us
> could come with a clean way to do it via seqscan node then we can
> proceed with a separate node idea.
>
> Another reason why I am asking to consider it is that after
> spending effort on further review and making the current approach
> robust, it should not happen that someone else (probably one
> of the committers) should say that it can be done with sequence scan
> node without much problems.

I am sure it could be done with sequence scan. Not sure if it would be
pretty and more importantly, the TABLESAMPLE is *not* sequential scan.
The fact that bernoulli method looks similar should not make us assume
that every other method does as well, especially when system method is
completely different.

>
> One another separate observation:
>
> typedef struct SamplePath
> {
> Pathpath;
> Oidtsmcost;/*
> table sample method costing function */
> List *tsmargs;/* arguments to a TABLESAMPLE clause
> */
> } SamplePath;
>
> a.
> Do we really need to have tsmcost and tsmargs stored in SamplePath
> when we don't want to maintain them in plan (SamplePlan), patch gets
> all the info via RTE in executor, so it seems to me we can do without
> them.
>

Hmm yeah, we actually don't, I removed it.

Anyway, attached is new version with some updates that you mentioned
(all_visible, etc).
I also added optional interface for the sampling method to access the
tuple contents as I can imagine sampling methods that will want to do
that. And I updated the test/sample module to use this new api.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-separate-block-sampling-functions-v2.patch text/x-diff 21.9 KB
0002-tablesample-v7.patch text/x-diff 99.0 KB
0003-tablesample-ddl-v3.patch text/x-diff 48.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-01-23 00:19:33 Re: pg_upgrade and rsync
Previous Message Jim Nasby 2015-01-23 00:04:24 Re: pg_upgrade and rsync