Skip site navigation (1) Skip section navigation (2)

Re: cost and actual time

From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Chantal Ackermann <chantal(dot)ackermann(at)biomax(dot)de>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>,pgsql-performance(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: cost and actual time
Date: 2003-02-20 10:00:19
Message-ID: 94095vcb2p7db03jmbhnu1agh3k2iqrodf@4ax.com (view raw or flat)
Thread:
Lists: pgsql-performance
On Wed, 19 Feb 2003 10:38:54 +0100, Chantal Ackermann
<chantal(dot)ackermann(at)biomax(dot)de> wrote:
>Nested Loop: 53508.86 msec
>Merge Join: 113066.81 msec
>Hash Join:  439344.44 msec

Chantal,

you might have reached the limit of what Postgres (or any other
database?) can do for you with these data structures.  Time for
something completely different:  Try calculating the counts in
advance.

    CREATE TABLE occ_stat (
        did INT NOT NULL,
        gid INT NOT NULL,
        cnt INT NOT NULL
    ) WITHOUT OIDS;

    CREATE INDEX occ_stat_dg ON occ_stat(did, gid);
    CREATE INDEX occ_stat_gd ON occ_stat(gid, did);

There is *no* UNIQUE constraint on (did, gid).  You get the numbers
you're after by
    SELECT did, sum(cnt) AS cnt
      FROM occ_stat
     WHERE gid = 'whatever'
     GROUP BY did
     ORDER BY cnt DESC;

occ_stat is initially loaded by

    INSERT INTO occ_stat
    SELECT did, gid, count(*)
      FROM g_o INNER JOIN d_o ON (g_o.sid = d_o.sid)
     GROUP BY did, gid;

Doing it in chunks
    WHERE sid BETWEEN a::bigint AND b::bigint
might be faster.

You have to block any INSERT/UPDATE/DELETE activity on d_o and g_o
while you do the initial load.  If it takes too long, see below for
how to do it in the background; hopefully the load task will catch up
some day :-)

Keeping occ_stat current:

    CREATE RULE d_o_i AS ON INSERT
        TO d_o DO (
            INSERT INTO occ_stat
            SELECT NEW.did, g_o.gid, 1
              FROM g_o
             WHERE g_o.sid = NEW.sid);

    CREATE RULE d_o_d AS ON DELETE
        TO d_o DO (
            INSERT INTO occ_stat
            SELECT OLD.did, g_o.gid, -1
              FROM g_o
             WHERE g_o.sid = OLD.sid);

On UPDATE do both.  Create a set of similar rules for g_o.

These rules will create a lot of duplicates on (did, gid) in occ_stat.
Updating existing rows and inserting only new combinations might seem
obvious, but this method has concurrency problems (cf. the thread
"Hard problem with concurrency" on -hackers).  So occ_stat calls for
reorganisation from time to time:

    BEGIN;
    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
    CREATE TEMP TABLE t (did INT, gid INT, cnt INT) WITHOUT OIDS;
    
    INSERT INTO t
    SELECT did, gid, sum(cnt)
      FROM occ_stat
     GROUP BY did, gid
    HAVING count(*) > 1;
    
    DELETE FROM occ_stat
     WHERE t.did = occ_stat.did
       AND t.gid = occ_stat.gid;
    
    INSERT INTO occ_stat SELECT * FROM t;
    
    DROP TABLE t;
    COMMIT;
    VACUUM ANALYZE occ_stat;  -- very important!!

Now this should work, but the rules could kill INSERT/UPDATE/DELETE
performance.  Depending on your rate of modifications you might be
forced to push the statistics calculation to the background.

    CREATE TABLE d_o_change (
        sid BIGINT NOT NULL,
        did INT NOT NULL,
        cnt INT NOT NULL
    ) WITHOUT OIDS;

    ... ON INSERT TO d_o DO (
        INSERT INTO d_o_change VALUES (NEW.sid, NEW.did, 1));

    ... ON DELETE TO d_o DO (
        INSERT INTO d_o_change VALUES (OLD.sid, OLD.did, -1));

    ... ON UPDATE TO d_o
        WHERE OLD.sid != NEW.sid OR OLD.did != NEW.did
        DO both

And the same for g_o.

You need a task that periodically scans [dg]_o_change and does ...

    BEGIN;
    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
    SELECT <any row (or some rows) from x_o_change>;
    INSERT INTO occ_stat <see above>;
    DELETE <the selected row(s) from x_o_change>;
    COMMIT;

Don't forget to VACUUM!

If you invest a little more work, I guess you can combine the
reorganisation into the loader task ...

I have no idea whether this approach is better than what you have now.
With a high INSERT/UPDATE/DELETE rate it may lead to a complete
performance disaster.  You have to try ...

Servus
 Manfred

In response to

Responses

pgsql-performance by date

Next:From: Mike NielsenDate: 2003-02-20 10:05:32
Subject: Re: Peluang Usaha yang Luar Biasa
Previous:From: Mario WeilguniDate: 2003-02-20 09:27:46
Subject: Re: Write ahead logging

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group