Re: Direct XML interfaces to optimizer and even executor?

From: Gunther Schadow <gunther(at)aurora(dot)regenstrief(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-interfaces <pgsql-interfaces(at)postgresql(dot)org>, alex <alex(at)pilosoft(dot)com>
Subject: Re: Direct XML interfaces to optimizer and even executor?
Date: 2002-05-30 23:24:13
Message-ID: 3CF6B49D.6000009@aurora.regenstrief.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-interfaces

Tom Lane wrote:

> Gunther Schadow <gunther(at)aurora(dot)regenstrief(dot)org> writes:
>
>>- Sending a parse tree in XML for processing by the optimizer.
>>- Sending an execution plan in XML directly to the executor.
>>
>
> Supporting this would imply freezing our internal representations of
> parse and plan trees (maybe we could move some bits around a little,
> but we couldn't make large changes without breaking the XML API).
> Like Alex, I don't see any gain that's likely to be worth that price;
> especially not when there are so many things I'm unhappy about in both
> representations. There are TODO items to revamp both representations
> in major ways.

I understand, thanks for responding.

Not to belabor that point too much, but the potential for change
would not necessarily be a live or die issue. Everybody would know
that the direct interface is not a standard and could change
drastically at any time. I guess the bigger problem might be with
documenting these changes other than in comments inside code.
You don't want to have people depend on something they shouldn't
and then get angry about you making changes.

I can understand Alex' lapidar response, but from what I read in the
various tuning guides, all SQL optimizers have their ideosynchrasies.
And the question I'm really still not sure about is whether in many
cases in applications you know the best access plan better than you
know how to convince the optimizer to picking it (all vacuuming etc.
granted.) Or, even worse, you may have to at runtime use application
level knowledge to choose an access plan (and then change your
SQL in cryptic ways to convey your desire to the optimizer.) Besides,
if the optimizer can change, the hacks one employs for convincing
it on the SQL level might change too.

For example, we are dealing with research queries to huge tables
(>200 million rows) of patient observations. The best choice of access
plan needs to consider various statistics that a simple histogram
won't cover. I.e., you may want to know what the likelyhood is for
any patient to have an observation of a certain kind and then use
that as your primary criterion to cut down on the size of your
result set as early as possible. For instance, one of our queries
might be "find all patients with a colorectal cancer and hemoglobin
greater than 15 mg/dL." It is extremely critical to first look for the
cancer patients (using an index) and then constrain those with the
hemoglobin, because just about every patient has hemoglobins whereas
cancer is much rarer. So, if you went through hemoglobin you'd end
up doing a complete table scan. Certainly, a simple histogram might
catch that, but there are confounding factors. Since we are searching
for patients, not for individual values, we want to use those
constraints first that have the highest selectivity for patients, not
necessarily for records in the observation table.

These and other issues are all nicely tweakable by SQL if you have
static queries. But if the queries can be in zillions of combinations,
the problem can't be solved by massaging every single SQL query.
(And yes, the problem of n-way joins with n > 6, 7, 8, etc. is very
much a possibility.)

So, Alex suggests to improve the optimizer. That's a thought, however,
it is a much heavier task to do that in general (abstract from the
kind of application domain knowledge) AND it is harder to do that
directly in the backend code than it would be if one could just
send an execution plan.)

Another use case for this that I am considering is that we are
developing a massive distributed querying facility that deal with
horizontal partitioning of data and wants to maintain privacy.
It would be great of course if postgersql had a distributed
database query interface, however, it is much more difficult a
task to implementing this so general that it could be part of the
standard postgresql code than it would be if it was on an experimental
client that directly feeds access plans into the server.

regards
-Gunther

--
Gunther Schadow, M.D., Ph.D. gschadow(at)regenstrief(dot)org
Medical Information Scientist Regenstrief Institute for Health Care
Adjunct Assistant Professor Indiana University School of Medicine
tel:1(317)630-7960 http://aurora.regenstrief.org

In response to

Responses

Browse pgsql-interfaces by date

  From Date Subject
Next Message Tom Lane 2002-05-30 23:40:07 Re: Direct XML interfaces to optimizer and even executor?
Previous Message Tom Lane 2002-05-30 22:38:11 Re: Direct XML interfaces to optimizer and even executor?