Re: Optimizing BIG joins ?

From: <kaiq(at)realtyideas(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Ed <post2menow(at)hotmail(dot)com>, pgsql-sql(at)postgresql(dot)org
Subject: Re: Optimizing BIG joins ?
Date: 2000-03-30 21:26:18
Message-ID: Pine.LNX.4.10.10003301518310.10432-100000@picasso.realtyideas.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Thu, 30 Mar 2000, Tom Lane wrote:

> "Ed" <post2menow(at)hotmail(dot)com> writes:
> > Ok, so here is the query, I hope this will give some ideas:
> > select P.Nam, P.Zip, B.Nam, E.Nam, P.WrkStaDtm
> > from PplTab as P, TchTab as T, PplTchTab as PT,
> > EduTab as E, BrnTab as B
> > where P.PpRef = PT.PplRef and P.BrnRef = B.BrnRef
> > and PT.TchRef = T.TchRef
> > and P.PplRef = E.PplRef
> > and P.Nam = Nam and P.Zip = Zip
> > and P.BrnRef = Brn and T.Nam = Tch
> > and E.BrnRef = Edu and P.WrkStaDtm = WrkStaDtm
>
> Seems pretty reasonable.
>
> > What isn't really clear to me is if Postgres, or any DB engine in general is
> > smart enough to take into account the where restriction right from the
> > beginning when perfroming the Join, OR that it first perform the complete
> > join (as I read in some books) and then after recieving the complete
> > temporarly table taking the where restriction into account ??
>
> That book misled you. No database engine is stupid enough to try to
> do things that way, because as you correctly perceive, it'd have to
> generate and then discard a huge number of uninteresting tuples.
>
> The way Postgres does it, which I believe is a pretty standard approach,
> is to join two tables together, then join another table to that result,
> then another, etc. At each step, a join pair is selected that is
> constrained by one of the available WHERE clauses. So, for example
> we might start by joining tables P and PT using P.PpRef = PT.PplRef.
> Only the tuples satisfying that restriction get generated. So if you
> have 100 tuples in each relation, you don't get 10000 tuples out but
> only maybe 100 or so (depending on what the data is like of course).
> At the next level up, perhaps we join B using P.BrnRef = B.BrnRef,
> and again we are producing ~100 tuples, not 1000000.
>
> BTW this all happens on the fly --- ordinarily, none of the intermediate
> "tables" is actually produced and stored anywhere. What we have is a
> series of filters that work on streams of tuples, so there are only
> a few tuples actually present in the memory of the backend at any
> instant. Even if there is a very large intermediate result, that just
> translates to processing time not space.
>
how do like this? can it be realized in PG in the near future?
(I feel that OO and datawarehouse support are the two next steps
after WAL is done?)
-------------------------------
cut/paste from
http://www.redbrick.com/products/white/papers/star/star.html#starschemas

--------------------------------
Picking Better Join Pairs A common optimization that provides some relief
for the star schema join problem is evaluating more combinations of
pair-wise join orderings. The basic idea is to try to get around the
pair-wise join strategy of only selecting related tables. To understand
how this works, you first need to understand what it means to join
unrelated tables. When two tables are joined and no columns "link" the
tables, every combination of the two tables' rows are produced. In
RDBMS-speak, this is called a "Cartesian product." For example, if the
ITEM table had two rows ("paper," "tape") and the COMPANY table had three
rows ("Sears," "KMart," "Walmart") the Cartesian product would contain six
rows: "paper"+ "Sears," "tape"+ "Sears," "paper"+ "KMart," "tape"+
"KMart," "paper"+ "Walmart," and "tape"+ "Walmart." Normally, the RDBMS
would never consider Cartesian products as reasonable pair-wise join
candidates, but for star schemas, there are times when these Cartesian
products improve query performance. Generating Cartesian products can
sometimes improve query performance because the fact table in a star
schema is typically much larger than its dimension tables. Remember that
one of the problems with selecting related tables to join is that the fact
table is chosen very early on for a pair-wise join. This can be a very bad
choice because it may generate a large intermediate result due to the
sheer size of the fact table. Alternatively, if a Cartesian product of all
dimension tables is first generated (by successive pair-wise joins), the
join to the fact table can be deferred until the very end. The key benefit
is that the large fact table does not find its way into any of the
intermediate join results. The key cost is the need to generate the
Cartesian product of the dimension tables. As long as the cost of
generating the Cartesian product is less than the cost of generating
intermediate results with the fact table, this optimization has some
benefit. This simple optimization trick clearly isn't a panacea. Remember
that although the query against the star schema is naturally a multi-table
join, the RDBMS is still artificially breaking it down into a set of
pair-wise joins. Worse yet, this strategy is only viable if the Cartesian
product of dimension rows selected is much smaller than the number of rows
in the fact table. To illustrate using our example schema, if there were
10,000 ITEM rows, 500 SHIP_FROM rows, 500 SHIP_TO rows, 3,000 DATE rows,
and 1,000 COMPANY rows, the final intermediate result size could be 7,500
trillion rows!2 Using this strategy, processing would probably never
complete, and the RDBMS would have to use the more traditional pair-wise
join ordering. The multiplicative nature of the Cartesian join makes the
optimization helpful only for relatively small problems. This last point
comes close to the heart of the matter: "small" and "data warehouse" are
not synonymous.
---------------------------------------------

Browse pgsql-sql by date

  From Date Subject
Next Message Joseph Shraibman 2000-03-30 22:25:38 Re: isnull
Previous Message Aguinaldo Fagundes Junior 2000-03-30 21:21:51 Foreign keys