Re: TABLESAMPLE patch

From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jaime Casanova <jaime(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: TABLESAMPLE patch
Date: 2015-01-31 19:08:19
Message-ID: 54CD2823.6080803@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 31/01/15 14:27, Amit Kapila wrote:
> On Fri, Jan 23, 2015 at 5:39 AM, Petr Jelinek <petr(at)2ndquadrant(dot)com
> <mailto:petr(at)2ndquadrant(dot)com>> wrote:
>
> On 19/01/15 07:08, Amit Kapila wrote:
>
> On Sun, Jan 18, 2015 at 12:46 AM, Petr Jelinek
> <petr(at)2ndquadrant(dot)com <mailto:petr(at)2ndquadrant(dot)com>
> <mailto:petr(at)2ndquadrant(dot)com <mailto:petr(at)2ndquadrant(dot)com>>> wrote:
>
>
> I think that's actually good to have, because we still do costing
> and the partial index might help produce better estimate of number
> of returned rows for tablesample as well.
>
>
> I don't understand how will it help, because for tablesample scan
> it doesn't consider partial index at all as per patch.
>

Oh I think we were talking abut different things then, I thought you
were talking about the index checks that planner/optimizer sometimes
does to get more accurate statistics. I'll take another look then.

>
> Well similar, not same as we are not always fetching whole page or
> doing visibility checks on all tuples in the page, etc. But I don't
> see why it can't be inside nodeSamplescan. If you look at bitmap
> heap scan, that one is also essentially somewhat modified sequential
> scan and does everything within the node nodeBitmapHeapscan because
> the algorithm is not used anywhere else, just like sample scan.
>
>
> I don't mind doing everything in nodeSamplescan, however if
> you can split the function, it will be easier to read and understand,
> if you see in nodeBitmapHeapscan, that also has function like
> bitgetpage().
> This is just a suggestion and if you think that it can be splitted,
> then it's okay, otherwise leave it as it is.

Yeah I can split it to separate function within the nodeSamplescan, that
sounds reasonable.

>
>
> Refer parameter (HeapScanDesc->rs_syncscan) and syncscan.c.
> Basically during sequiantial scan on same table by different
> backends, we attempt to keep them synchronized to reduce the I/O.
>
>
> Ah this, yes, it makes sense for bernoulli, not for system though. I
> guess it should be used for sampling methods that use SAS_SEQUENTIAL
> strategy.
>
>
> Have you taken care of this in your latest patch?

Not yet. I think I will need to make the strategy property of the
sampling method instead of returning it by costing function so that the
info can be used by the scan.

>
> Oh and BTW when I delete 50k of tuples (make them invisible) the
> results of 20 runs are between 30880 and 40063 rows.
>
>
> This is between 60% to 80%, lower than what is expected,
> but I guess we can't do much for this except for trying with
> reverse order for visibility test and sample tuple call,
> you can decide if you want to try that once just to see if that
> is better.
>

No, that's because I can't write properly, the lower number was supposed
to be 39880 which is well within the tolerance, sorry for the confusion
(9 and 0 are just too close).

> Anyway, attached is new version with some updates that you mentioned
> (all_visible, etc).
> I also added optional interface for the sampling method to access
> the tuple contents as I can imagine sampling methods that will want
> to do that.
>
> +/*
> +* Let the sampling method examine the actual tuple and decide if we
> +* should return it.
> +*
> +* Note that we let it examine even invisible tuples.
> +*/
> +if (OidIsValid(node->tsmreturntuple.fn_oid))
> +{
> +found = DatumGetBool(FunctionCall4(&node->tsmreturntuple,
> + PointerGetDatum
> (node),
> + UInt32GetDatum
> (blockno),
> + PointerGetDatum
> (tuple),
> + BoolGetDatum
> (visible)));
> +/* XXX: better error */
> +if (found && !visible)
> +elog(ERROR, "Sampling method wanted to return invisible tuple");
> +}
>
> You have mentioned in comment that let it examine invisible tuple,
> but it is not clear why you want to allow examining invisible tuple
> and then later return error, why can't it skips invisible tuple.
>

Well I think returning invisible tuples to user is bad idea and that's
why the check, but I also think it might make sense to examine the
invisible tuple for the sampling function in case it wants to create
some kind of stats about the scan and wants to use those for making the
decision about returning other tuples. The interface should be probably
called tsmexaminetuple instead to make it more clear what the intention is.

> 1.
> How about statistics (pgstat_count_heap_getnext()) during
> SampleNext as we do in heap_getnext?

Right, will add.

>
> 2.
> +static TupleTableSlot *
> +SampleNext(SampleScanState *node)
> +{
> ..
> +/*
> +* Lock the buffer so we can safely assess tuple
> +* visibility later.
> +*/
> +LockBuffer(buffer, BUFFER_LOCK_SHARE);
> ..
> }
>
> When is this content lock released, shouldn't we release it after
> checking visibility of tuple?

Here,
+ if (!OffsetNumberIsValid(tupoffset))
+ {
+ UnlockReleaseBuffer(buffer);

but yes you are correct, it should be just released there and we can
unlock already after visibility check.

>
> 3.
> +static TupleTableSlot *
> +SampleNext(SampleScanState *node)
> {
> ..
> }
>
> Currently in this function as soon as it sees one valid tuple,
> it return's the same, however isn't it better to do some caching
> for tuples on same page like we do in heapgetpage()
> (scan->rs_vistuples[ntup++] = lineoff;). Basically that can avoid
> taking content lock and some other overhead of operating on a
> page.
>

That's somewhat hard question, it would make sense in cases where we
read most of the page (which is true for system sampling for example)
but it would probably slow things down in case where we select small
number of tuples (like say 1) which is true for bernoulli with small
percentage parameter, it's actually quite easy to imagine that on really
big tables (which is where TABLESAMPLE makes sense) we might get blocks
where we don't actually read any tuples. This is where optimizing for
one sampling method will hurt another so I don't know what's better
here. Perhaps the sampling method should have option that says if it
prefers page mode reading or not, because only the author knows this.

Anyway, I am thinking of making the heapgetpage() public and using it
directly. It will mean that we have to initialize HeapScanDesc which
might add couple of lines but we anyway already have to keep last buffer
and last tuple and position info in the scan info so we can instead use
HeapScanDesc for that. There will couple of properties of HeapScanDesc
we don't use but I don't think we care.

BTW I don't expect to have time to work on this patch in next ~10 days
so I will move it to Feb commitfest.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2015-01-31 19:56:48 Re: Providing catalog view to pg_hba.conf file - Patch submission
Previous Message Erik Rijkers 2015-01-31 16:22:58 Re: File based Incremental backup v8