Re: query produces 1 GB temp file

From: John A Meinel <john(at)arbash-meinel(dot)com>
To: Dirk Lutzebaeck <lutzeb(at)aeccom(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: query produces 1 GB temp file
Date: 2005-02-05 19:26:09
Message-ID: 42051DD1.7020802@arbash-meinel.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Dirk Lutzebaeck wrote:

>Hi,
>
>here is a query which produces over 1G temp file in pgsql_tmp. This
>is on pgsql 7.4.2, RHEL 3.0, XEON MP machine with 32GB RAM, 300MB
>sort_mem and 320MB shared_mem.
>
>Below is the query and results for EXPLAIN and EXPLAIN ANALYZE. All
>tables have been analyzed before.
>
>Can some please explain why the temp file is so huge? I understand
>there are a lot of rows.
>
>Thanks in advance,
>
>Dirk
>
>
...

> -> Nested Loop (cost=0.00..8346.73 rows=3 width=1361) (actual time=34.104..18016.005 rows=703677 loops=1)
>
>
Well, there is this particular query where it thinks there will only be
3 rows, but in fact there are 703,677 of them. And the previous line:

> -> Sort (cost=8346.75..8346.76 rows=3 width=1361) (actual time=75357.448..75499.263 rows=22439 loops=1)
>
>
Seem to indicate that after sorting you still have 22,439 rows, which
then gets pared down again down to 1000.

I'm assuming that the sort you are trying to do is extremely expensive.
You are sorting 700k rows, which takes up too much memory (1GB), which
forces it to create a temporary table, and write it out to disk.

I didn't analyze it a lot, but you might get a lot better performance
from doing a subselect, rather than the query you wrote.

You are joining 4 tables (bi, en, df AS ft, es) I don't know which
tables are what size. In the end, though, you don't really care about
the en table or es tables (they aren't in your output).

So maybe one of you subselects could be:

where bi.en = (select en from es where es.co = bi.co and es.spec=122293729);

I'm pretty sure the reason you need 1GB of temp space is because at one
point you have 700k rows. Is it possible to rewrite the query so that it
does more filtering earlier? Your distinct criteria seems to filter it
down to 20k rows. So maybe it's possible to do some sort of a distinct
in part of the subselect, before you start joining against other tables.

If you have that much redundancy, you might also need to think of doing
a different normalization.

Just some thoughts.

Also, I thought using the "oid" column wasn't really recommended, since
in *high* volume databases they aren't even guaranteed to be unique. (I
think it is a 32-bit number that rolls over.) Also on a database dump
and restore, they don't stay the same, unless you take a lot of extra
care that they are included in both the dump and the restore. I believe
it is better to create your own "id" per table (say SERIAL or BIGSERIAL).

John
=:->

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2005-02-05 19:42:34 Re: GiST indexes and concurrency (tsearch2)
Previous Message Tom Lane 2005-02-05 19:01:32 Re: GiST indexes and concurrency (tsearch2)