Q on views and performance

From: "Kynn Jones" <kynnjo(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Q on views and performance
Date: 2008-02-22 20:49:59
Message-ID: c2350ba40802221249h37913d0cyaa6387a93c55fe69@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi. I'm trying to optimize the performance of a database whose main purpose
is to support two (rather similar) kinds of queries. The first kind, which
is expected to be the most common (I estimate it will account for about 90%
of all the queries performed on this DB), has the following general
structure:

(Q1) SELECT a1.word, a2.word
FROM T a1 JOIN T a2 USING ( zipk )
WHERE a1.type = <int1>
AND a2.type = <int2>;

...where <int1> and <int2> stand for some two integers. In English, this
query essentially executes an inner join between two "virtual subtables" of
table T, which are defined by the value of the type column. For brevity, I
will refer to these (virtual) subtables as T<int1> and T<int2>. (I should
point out that T holds about 2 million records, spread roughly evenly over
about 100 values of the type column. So each of these virtual subtables has
about 20K records. Also, for practical purposes T may be regarded as an
immutable, read-only table, since it gets re-built from scratch about once a
month. And, FWIW, all the columns mentioned in this post have a NOT
NULLconstraint.)

The second form is similar to the first, except that now the join is taken
between T and T<int2>:

(Q2) SELECT a1.word, a2.word
FROM T a1 JOIN T a2 USING ( zipk )
WHERE a2.type = <int2>;

(Both the forms above are somewhat oversimplified relative to the actual
situation; in our actual application, the joins are actually left outer
ones, and each query also involves an additional inner join with another
table, S. For the sake of completeness, I give the "real-world" versions of
these queries at the end of this post, but I think that for the purpose of
my question, the additional complications they entail can be neglected.)

One way to speed (Q1) would be to break T into its subtables, i.e. to create
T1, T2, T3, ... , T100 as bona fide tables. Then the query would become a
simple join without the two condition of the original's WHERE clause, which
I figure should make it noticeably faster.

But since the second kind of query (Q2) requires T, we can't get rid of this
table, so all the data would need to be stored twice, once in T and once in
some T<int*>.

In trying to come up with a way around this duplication, it occurred to me
that instead of creating tables T1, T2, etc., I could create the analogous
views V1, V2, etc. (e.g. CREATE VIEW V1 AS SELECT * FROM T WHERE type = 1).
With this design, the two queries above would become

(Q1*) SELECT V<int1>.word, V<int2>.word
FROM V<int1> JOIN V<int2> USING ( zipk );

(Q2*) SELECT T.word, V<int2>.word
FROM T JOIN V<int2> USING ( zipk );

Of course, I expect that using views V<int1> and V<int2>... would result in
a loss in performance relative to a version that used bona fide tables
T<int1> and T<int2>. My question is, how can I minimize this performance
loss?

More specifically, how can I go about building table T and the views
V<int?>'s to maximize the performance of (Q1)? For example, I'm thinking
that if T had an additional id column and were built in such a way that all
the records belonging to each V<int?> were physically contiguous, and (say)
had contiguous values in the id column, then I could define each view like
this

CREATE VIEW V<int1> AS SELECT * FROM T
WHERE <start_int1> <= id AND id < <start_int1+1>;

So my question is, what design would make querying V1, V2, V3 ... as fast as
possible? Is it possible to approach the performance of the design that
uses bona fide tables T1, T2, T3, ... instead of views V1, V2, V3 ...?

Thank you very much for reading this long post, and many thanks in advance
for your comments!

Kynn

P.S. Here are the actual form of the queries. They now include an initial
join with table S, and the join with T<int2> (or V<int2>) is a left outer
join. Interestingly, even though the queries below that use views (i.e.
Q1*** and Q2***) are not much more complex-looking than before, the other
two (Q1** and Q2**) are. I don't know if this is because my ineptitude with
SQL, but I am not able to render (Q1**) and (Q2**) without resorting to the
subquery sq.

(Q1**) SELECT a1.word, sq.word FROM
S JOIN T a1 USING ( word )
LEFT JOIN ( SELECT * FROM T a2
WHERE a2.type = <int2> ) sq USING ( zipk )
WHERE a1.type = <int1>;

(Q2**) SELECT a1.word, sq.word FROM
S JOIN T a1 USING ( word )
LEFT JOIN ( SELECT * FROM T a2
WHERE a2.type = <int2> ) sq USING ( zipk )

---------------------------------------------

(Q1***) SELECT V<int1>.word, V<int2>.word FROM
S JOIN V<int1> USING ( word )
LEFT JOIN V<int2> USING ( zipk );

(Q2***) SELECT T.word, V<int2>.word
FROM S JOIN T USING ( word )
LEFT JOIN V<int2> USING ( zipk );

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Dean Gibson (DB Administrator) 2008-02-23 01:48:48 Re: Q on views and performance
Previous Message Geoffrey 2008-02-22 19:27:49 Re: loading same instance of dump to two different servers simultaneously?