Re: TABLESAMPLE doesn't actually satisfy the SQL spec, does it?

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: TABLESAMPLE doesn't actually satisfy the SQL spec, does it?
Date: 2015-07-14 06:31:35
Message-ID: CANP8+jL-i_8G4GRM-JmQU0NS3DWrV17X6ubxmB3k9v0BirmMWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12 July 2015 at 18:50, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Sun, Jul 12, 2015 at 12:02 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> As best I can tell (evidence below), the SQL standard requires that if a
> >> single query reads a table with a TABLESAMPLE clause multiple times
> (say,
> >> because it's on the inside of a nestloop), then the exact same set of
> >> sampled rows are returned each time.
>
> > Hmm, I tend to agree that it would be good if it behaved that way.
> > Otherwise, it seems like the behavior could be quite surprising.
>
> Yeah. As a concrete example, consider
>
> select * from t1, t2 tablesample ... where t1.x = t2.x
>
> and suppose that there are multiple occurences of x = 10 in both tables.
> As things stand, if the join is done as a nestloop then a particular t2
> row with x = 10 might appear in the output joined with some of the t1 rows
> with x = 10 but not with others. On the other hand, the results of a hash
> join would not be inconsistent in that way, since t2 would be read only
> once.
>

Hmm, a non-key join to a sampled table. What would the meaning of such a
query be? The table would need to big enough to experience updates and also
be under current update activity. BERNOULLI isn't likely to have many users
because it is so slow. So overall, such a query is not useful and as such
unlikely.

The mechanism of sampling was discussed heavily before and there wasn't an
approach that met all of the goals: IIRC we would need to test visibility
twice on each tuple to get around these problems. Given that users of
TABLESAMPLE have already explicitly stated their preference for speed over
accuracy, minor tweaks to handle corner cases don't seem warranted.

If you have a simple, better way I would not object. Forgive me, I haven't
yet understood your proposal about sampling rule above.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Shulgin, Oleksandr 2015-07-14 06:32:51 Re: [PATCH] Generalized JSON output functions
Previous Message Noah Misch 2015-07-14 05:07:08 Re: PostgreSQL 9.5 Alpha 1 build fail with perl 5.22