Re: cost and actual time

From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Chantal Ackermann <chantal(dot)ackermann(at)biomax(dot)de>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-performance(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: cost and actual time
Date: 2003-02-20 10:00:19
Message-ID: 94095vcb2p7db03jmbhnu1agh3k2iqrodf@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Wed, 19 Feb 2003 10:38:54 +0100, Chantal Ackermann
<chantal(dot)ackermann(at)biomax(dot)de> wrote:
>Nested Loop: 53508.86 msec
>Merge Join: 113066.81 msec
>Hash Join: 439344.44 msec

Chantal,

you might have reached the limit of what Postgres (or any other
database?) can do for you with these data structures. Time for
something completely different: Try calculating the counts in
advance.

CREATE TABLE occ_stat (
did INT NOT NULL,
gid INT NOT NULL,
cnt INT NOT NULL
) WITHOUT OIDS;

CREATE INDEX occ_stat_dg ON occ_stat(did, gid);
CREATE INDEX occ_stat_gd ON occ_stat(gid, did);

There is *no* UNIQUE constraint on (did, gid). You get the numbers
you're after by
SELECT did, sum(cnt) AS cnt
FROM occ_stat
WHERE gid = 'whatever'
GROUP BY did
ORDER BY cnt DESC;

occ_stat is initially loaded by

INSERT INTO occ_stat
SELECT did, gid, count(*)
FROM g_o INNER JOIN d_o ON (g_o.sid = d_o.sid)
GROUP BY did, gid;

Doing it in chunks
WHERE sid BETWEEN a::bigint AND b::bigint
might be faster.

You have to block any INSERT/UPDATE/DELETE activity on d_o and g_o
while you do the initial load. If it takes too long, see below for
how to do it in the background; hopefully the load task will catch up
some day :-)

Keeping occ_stat current:

CREATE RULE d_o_i AS ON INSERT
TO d_o DO (
INSERT INTO occ_stat
SELECT NEW.did, g_o.gid, 1
FROM g_o
WHERE g_o.sid = NEW.sid);

CREATE RULE d_o_d AS ON DELETE
TO d_o DO (
INSERT INTO occ_stat
SELECT OLD.did, g_o.gid, -1
FROM g_o
WHERE g_o.sid = OLD.sid);

On UPDATE do both. Create a set of similar rules for g_o.

These rules will create a lot of duplicates on (did, gid) in occ_stat.
Updating existing rows and inserting only new combinations might seem
obvious, but this method has concurrency problems (cf. the thread
"Hard problem with concurrency" on -hackers). So occ_stat calls for
reorganisation from time to time:

BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
CREATE TEMP TABLE t (did INT, gid INT, cnt INT) WITHOUT OIDS;

INSERT INTO t
SELECT did, gid, sum(cnt)
FROM occ_stat
GROUP BY did, gid
HAVING count(*) > 1;

DELETE FROM occ_stat
WHERE t.did = occ_stat.did
AND t.gid = occ_stat.gid;

INSERT INTO occ_stat SELECT * FROM t;

DROP TABLE t;
COMMIT;
VACUUM ANALYZE occ_stat; -- very important!!

Now this should work, but the rules could kill INSERT/UPDATE/DELETE
performance. Depending on your rate of modifications you might be
forced to push the statistics calculation to the background.

CREATE TABLE d_o_change (
sid BIGINT NOT NULL,
did INT NOT NULL,
cnt INT NOT NULL
) WITHOUT OIDS;

... ON INSERT TO d_o DO (
INSERT INTO d_o_change VALUES (NEW.sid, NEW.did, 1));

... ON DELETE TO d_o DO (
INSERT INTO d_o_change VALUES (OLD.sid, OLD.did, -1));

... ON UPDATE TO d_o
WHERE OLD.sid != NEW.sid OR OLD.did != NEW.did
DO both

And the same for g_o.

You need a task that periodically scans [dg]_o_change and does ...

BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT <any row (or some rows) from x_o_change>;
INSERT INTO occ_stat <see above>;
DELETE <the selected row(s) from x_o_change>;
COMMIT;

Don't forget to VACUUM!

If you invest a little more work, I guess you can combine the
reorganisation into the loader task ...

I have no idea whether this approach is better than what you have now.
With a high INSERT/UPDATE/DELETE rate it may lead to a complete
performance disaster. You have to try ...

Servus
Manfred

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Mike Nielsen 2003-02-20 10:05:32 Re: Peluang Usaha yang Luar Biasa
Previous Message Mario Weilguni 2003-02-20 09:27:46 Re: Write ahead logging