Skip site navigation (1) Skip section navigation (2)

Re: [INTERFACES] Re: [HACKERS] changes in 6.4

From: David Hartwig <daybee(at)bellatlantic(dot)net>
To: Bruce Momjian <maillist(at)candle(dot)pha(dot)pa(dot)us>
Cc: hannu(at)trust(dot)ee, pgsql-interfaces(at)postgreSQL(dot)org, hackers(at)postgreSQL(dot)org
Subject: Re: [INTERFACES] Re: [HACKERS] changes in 6.4
Date: 1998-08-30 15:40:31
Message-ID: 35E9726E.C6E73049@bellatlantic.net (view raw or flat)
Thread:
Lists: pgsql-generalpgsql-hackerspgsql-interfaces

Bruce Momjian wrote:

> OK, let me try this one.
>
> Why is the system cnf'ifying the query.  Because it  wants to have a
> list of qualifications that are AND'ed, so it can just pick the most
> restrictive/cheapest, and evaluate that one first.
>
> If you have:
>
>         (a=b and c=d) or e=1
>
> In this case, without cnf'ify, it has to evaluate both of them, because
> if one is false, you can't be sure another would be true.  In the
> cnf'ify case,
>
>         (a=b or e=1) and (c=d or e=1)
>
> In this case, it can choose either, and act on just one, if a row fails
> to meet it, it can stop and not evaluate it using the other restriction.
>
> The fact is that it is only going to use fancy join/index in one of the
> two cases, so it tries to pick the best one, and does a brute-force
> qualification test on the remaining item if the first one tried is true.
>
> The problem is of course large where clauses can exponentially expand
> this.  What it really trying to do is to pick a cheapest restriction,
> but the memory explosion and query failure are serious problems.
>
> The issue is that it thinks it is doing something to help things, while
> it is actually hurting things.
>
> In the ODBC case of:
>
>         (x=3 and y=4) or
>         (x=3 and y=5) or
>         (x=3 and y=6) or ...
>
> it clearly is not going to gain anything by choosing any CHEAPEST path,
> because they are all the same in terms of cost, and the use by ODBC
> clients is hurting reliability.
>
> I am inclined to agree with David's solution of breaking apart the query
> into separate UNION queries in certain cases.  It seems to be the most
> logical solution, because the cnf'ify code is working counter to its
> purpose in these cases.
>
> Now, the question is how/where to implement this.  I see your idea of
> making the OR a join to a temp table that holds all the constants.
> Another idea would be to do actual UNION queries:
>
>         SELECT * FROM tab
>         WHERE (x=3 and y=4)
>         UNION
>         SELECT * FROM tab
>         WHERE (x=3 and y=5)
>         UNION
>         SELECT * FROM tab
>         WHERE (x=3 and y=6) ...
>
> This would work well for tables with indexes, but for a sequential scan,
> you are doing a sequential scan for each UNION.

Practically speaking, the lack of an index concern, may not be justified.   The reason
these queries are being generated, with this shape, is because remote data objects on the
client side are being told that a primary key exists on these tables.  The object is told
about these keys  in one of two ways.

1.  It queries the database for the primary key of the table.  The ODBC driver serviced
this request by querying for the attributes used in {table_name}_pkey.

2.  The user manually specifies the primary key.  In this case an actual index may not
exist.   (i.e. MS Access asks the user for this information if a primary key is not found
in a table)

The second case is the only one that would cause a problem.  Fortunately, the solution is
simple.  Add a primary key index!

My only concern is to be able to accurately identify a query with the proper signature
before rewriting it as a UNION.   To what degree should this inspection be taken?

BTW,  I would not do the rewrite on OR's without AND's since you have fixed the OR's use
of the index.

There is one other potential issue.  My experience with using arrays in tables and UNIONS
creates problems.  There are missing array comparison operators which are used by the
implied DISTINCT.

> Another idea is
> subselects.  Also, you have to make sure you return the proper rows,
> keeping duplicates where they are in the base table, but not returning
> them when the meet more than one qualification.
>
>         SELECT * FROM tab
>         WHERE (x,y) IN (SELECT 3, 4
>                         UNION
>                         SELECT 3, 5
>                         UNION
>                         SELECT 3, 6)
>
> I believe we actually support this.  This is not going to use an index
> on tab, so it may be slow if x and y are indexed.
>
> Another more bizarre solution is:
>
>         SELECT * FROM tab
>         WHERE (x,y) = (SELECT 3, 4) OR
>               (x,y) = (SELECT 3, 5) OR
>               (x,y) = (SELECT 3, 6)
>
> Again, I think we do this too.  I don't think cnf'ify does anything with
> this.  I also believe "=" uses indexes on subselects, while IN does not
> because IN could return lots of rows, and an index is slower than a
> non-index join on lots of rows.  Of course, now that we index OR's.
>
> Let me ask another question.  If I do:
>
>         SELECT * FROM tab WHERE x=3 OR x=4
>
> it works, and uses indexes.  Why can't the optimizer just not cnf'ify
> things sometimes, and just do:
>
>         SELECT * FROM tab
>         WHERE   (x=3 AND y=4) OR
>                 (x=3 AND y=5) OR
>                 (x=3 AND y=6)
>
> Why can it handle x=3 OR x=4, but not the more complicated case above,
> without trying to be too smart?  If x,y is a multi-key index, it could
> use that quite easily.  If not, it can do a sequentail scan and run the
> tests.
>
> Another issue.  To the optimizer, x=3 and x=y are totally different.  In
> x=3, it is a column compared to a constant, while in x=y, it is a join.
> That makes a huge difference.
>
> In the case of (a=b and c=d) or e=1, you pick the best path and do the
> a=b join, and throw in the e=1 entries.  You can't easily do both joins,
> because you also need the e=1 stuff.
>
> I wounder what would happen if we prevent cnf'ifying of cases where the
> OR represent only column = constant restrictions.
>
> I meant to really go through the optimizer this month, but other backend
> items took my time.
>
> Can someone run some tests on disabling the cnf'ify calls.  It is my
> understanding that with the non-cnf-ify'ed query, it can't choose an
> optimial path, and starts to do either straight index matches,
> sequential scans, or cartesian products where it joins every row to
> every other row looking for a match.
>
> Let's say we turn off cnf-ify just for non-join queries.  Does that
> help?
>
> I am not sure of the ramifications of telling the optimizer it no longer
> has a variety of paths to choose for evaluating the query.

I did not try this earlier because I thought it was too good to be true.   I was right.
I tried commenting out the normalize() function in the cnfify().   The EXPLAIN showed a
sequential scan and the resulting tuple set was empty.   Time will not allow me to dig
into this further this weekend.

Unless you come up with a better solution, I am going to submit my patch on Monday to
make the Sept. 1st deadline.  It includes a SET switch to activate the rewrite so as not
to cause problems outside the ODBC users.    We can either improve, it or yank it, by the
Oct. 1st deadline.


In response to

Responses

pgsql-hackers by date

Next:From: Massimo Dal ZottoDate: 1998-08-30 15:58:23
Subject: updated contrib modules
Previous:From: Massimo Dal ZottoDate: 1998-08-30 15:24:36
Subject: bug + patch

pgsql-interfaces by date

Next:From: Sbragion DenisDate: 1998-08-31 06:53:12
Subject: Re: [INTERFACES] Re: [HACKERS] changes in 6.4
Previous:From: Thomas G. LockhartDate: 1998-08-30 13:24:17
Subject: Re: [INTERFACES] Feature in tcl interface gone AWOL

pgsql-general by date

Next:From: Cho Yan WongDate: 1998-08-31 00:36:10
Subject: Re: [SQL] copy probs
Previous:From: Lorenzo HuertaDate: 1998-08-29 21:41:01
Subject: big numbers

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group