Re: PG 7.2b4 bug?

From: Don Baccus <dhogaza(at)pacifier(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PG 7.2b4 bug?
Date: 2001-12-18 22:05:30
Message-ID: 3C1FBDAA.6080101@pacifier.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:

> Don Baccus <dhogaza(at)pacifier(dot)com> writes:
>
>>Maybe the behavior's implementation defined ... if not, I'd presume SQL3
>> states that a function in the above context is called either once per
>>row or once per query, not sometimes one or sometimes the other.

This is still bothering me so I decided to plunge into the standard
myself. First of all...

>
> AFAICT, the relevant concept in SQL99 is whether a function is
> "deterministic" or not:

Actually this argument may well apply to the function all within the
subselect or view, but I fail to see any language in the standard that
suggests that this trumps the following declaration about the execution
of a <query specification> (what many of us informally refer to as a
"SELECT"):

(from section 7.12, Query Specification)

a) If T is not a grouped table, then

Case:

(I deleted Case i, which refers to standard aggregates like COUNT)

ii) If the <select list> does not include a <set function
specification> that contains a reference to T, then
each <value expression> is applied to each row of T

yielding a table of M rows, where M is the cardinality

of T ...

(FYI a <set function specification> is a standard aggregate like COUNT,
i.e. Case ii pertains to those queries that don't fall into Case i)

ISTM that this quite clearly states that a subselect in a target list
should be applied to each row to be returned in M. I don't see any
waffle-room here. I would have to dig more deeply into the standard's
view regarding VIEW semantics but I would assume it would knit together
in a consistent manner.

For instance, earlier we saw the following exchange between Stephen and Tom:

Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com> writes:

> > However, I'd think:
> > "select (select nextval from test_seq) from multiple_rows;"
> > should give you different values and doesn't, although
> > "select (select nextval from test_seq where i IS NULL or i IS NOT NULL)
> > from multiple_rows;" does give you different values.

>
> In the first case, the subselect is visibly not dependent on the outer
> query, so it's evaluated only once; in the second case it has to be
> re-evaluated for each row using that row's value of i.

Note that the standard does not give you this freedom. It says the
<value expression> (in this case the subselect, and yes subselects are
valid <value expressions> in SQL3, at least in my draft) must be applied
to each row.

IMO this means that the optimizer can choose to evaluate the <value
expression> once only if it knows for certain that multiple calls will
return the same value. For example "my_cachable_function()", not
"my_uncachable_function()" or "nextval()".

Or IMO a view built using a non-cachable function.

In other words it can only suppress evaluation if it can be certain that
doing so doesn't change the result.

Another nit-pick, the claim's not even strictly true. "i IS NULL OR i
IS NOT NULL" can be folded to true, so the subselect's not "visibly
dependent on i". In fact, it is quite visibly *not* dependent on the
outer query. PG just isn't smart enough to fold the boolean expression
into the known value "true".

It's this kind of uncertainty that makes the current behavior so ...
ugly. You get different answers depending on various optimization
values, the complexity of the query, etc.

ISTM that the standard is quite clearly worded to avoid this unpleasantness.

...

> It looks to me like the spec does NOT attempt to nail down the behavior
> of non-deterministic functions; in the places where they talk about
> non-deterministic functions at all, it's mostly to forbid their use in
> contexts where nondeterminism would affect the final result. Otherwise
> the results are implementation-dependent.

I've been looking at a few of the "non-deterministic" clauses in the
General Rules, out of curiousity.

They generally aren't involved with the execution or non-execution of
expressions. Ordering of execution is in many cases non-deterministic
and implementation-dependent. There are plenty of General Rules of this
sort.

We also have this big gaping black hole of non-determinism due to
character set collation.

In other words:

select foo
from bar
order by foo;

is non-deterministic (we don't know the order in which the rows will be
returned) if foo is a character type. This can even be true within
implementations, for instance in PG it changes with when you change
locales (and have locale support enabled).

However, it seems clear that:

select foo, my_function()
from bar
order by foo;

requires my_function() to be called for every row - we just can't depend
on the order in which it will be applied to those rows in the case where
foo is a character type. Of course, iscachable tells the optimizer that
it's OK to just call it once but that's an extension outside SQL3's
domain.

Obviously if you run this query over and over again with the same
collation order the "order by" is deterministic. The non-determinism is
in respect to the portability of the query to implementations built on
differing character sets.

I'm just not seeing justification for claiming that Section 7.12 can be
ignored if the subselect or view happens to contain a function that's
not cachable.

--
Don Baccus
Portland, OR
http://donb.photo.net, http://birdnotes.net, http://openacs.org

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2001-12-18 22:23:42 Re: [PATCHES] Problem compiling postgres sql --with-tcl
Previous Message Hiroshi Inoue 2001-12-18 21:39:34 Re: problems with table corruption continued