Performance HOWTO - pseudo counter example

From: Jean-Michel POURE <jm(dot)poure(at)freesurf(dot)fr>
To: "S Grannis" <sjg(at)email(dot)com>, pgsql-general(at)postgresql(dot)org
Cc: <pgsql-admin(at)postgresql(dot)org>
Subject: Performance HOWTO - pseudo counter example
Date: 2002-04-29 15:45:08
Message-ID: 200204291745.08315.jm.poure@freesurf.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-general

Le Jeudi 25 Avril 2002 20:54, S Grannis a écrit :
> I think our work-around for now will be to SELECT the column we wish to
> analyze into a flat file and then run a Perl script to do the actual
> counting.

Dear all,

I wrote a small howto to solve S Grannis performance questions on Count()
function. The idea is to create and maintain a pseudo-count table using
triggers and PLpgSQL.

Unfortunately, I could not test the PLpgSQL scripts in 16 Million records for
lack of space reason (only on 1 Million records). Code is included to
generate fake test data. Could someone help me test the howto on 16 million
records?

Thank you for your feedback,
Cheers,
Jean-Michel

*************************************************************************
Performance HOWTO - pseudo counter example
This document is released under PostgreSQL license
*************************************************************************

This tutorial demonstrates how to create fast pseudo-counters in PostgreSQL
using PLpgSQL and triggers.

1) Performance background
This small howto is insprired in reply to an email on
pgsql-general(at)postgresql(dot)org complaining about PostgreSQL speed.

The user needs to run COUNT statements on a large database of 65.000.000
records. The table structure is basically as follows:

CREATE TABLE "data" (
"data_oid" serial8,
"data_yd" int4
);
In our example, data_yd is a year value between 1950 and 2050.

The user needs to run the following query:
SELECT COUNT (data_yd) FROM data WHERE data_yd = foo_year.
where foo_year is a date between 1950 and 2050.

The query takes more than two hours to execute on a double-processor computer
running PostgreSQL and GNU/Linux. The proposed solution creates a
pseudo-counter on PostgreSQL using PLpgSQL and triggers. The aim is to return
a result in 0.005 second. Initilisation itself of the pseudo-counter table
should take less than 30 minutes.

1) INSTALLATION
a) Database creation

Open a terminal windows, connect as 'postgres' user:
root(at)localhost>su postgres

Create an empty database:
postgresql(at)localhost>psql template1;
template1=\CREATE DATABASE pseudo_counter;
template1=\q

b) PLpgSQL declaration

PLpgSQL is compiled by default in PostgreSQL. But you should enable PLpgSQL on
the database itself:

postgresql(at)localhost>CREATELANG plpgsql pseudo_counter

c) Data table

We first need to create the table stucture:
CREATE TABLE "data" (
"data_oid" serial8,
"data_yd" int4,
"data_counterenabled" bool DEFAULT 'f'
) WITH OIDS;
CREATE INDEX data_yd_idx ON data USING btree (data_yd);
CREATE INDEX data_counter_idx ON data USING btree (data_counterenabled);

And create a PLpgSQL function to add fake records:

CREATE FUNCTION "init_fakedata"("int8", "int4", "int4") RETURNS "bool" AS '
DECLARE
iLoop int4;
tStart timestamp ;

BEGIN
tStart = now ();
IF ($1>0) AND ($2 >0) AND ($3 >0) AND ($3>$2) THEN
FOR iLoop in 1 .. $1 LOOP
INSERT INTO data (data_yd)
VALUES (
int8 (random () * ($3-$2) +$2)
);
END LOOP;
RETURN ''t'';
ELSE
RETURN ''f'';
END IF;
END;
' LANGUAGE 'plpgsql';

To insert 16 million records with a year range between 1950 and 2050, enter:
SELECT init_fakedata(16000000, 1950, 2050); which should take a while... and
fill mor than 3.2 Gb on disc.

If you cannot wait that long :

For testing, insert 126.953 records :
SELECT init_fakedata(126953, 1950, 2050);
This takes 40s on my server.

Then, repeat 8 times:
INSERT INTO data (data_yd)
SELECT data_yd FROM data;

This should produce 64999936 fake records more quickly as no random function
is
used.

Enter:
CHECKPOINT;
VACUUM ANALYSE;
to clear data cache and update statistics.

d) Pseudo-count table
Now, let's create a pseudo-count table.
CREATE TABLE "pseudo_count" (
"count_oid" serial,
"count_year" int4,
"count_value" int4 DEFAULT 0
) WITH OIDS;
CREATE INDEX count_value_idx ON pseudo_count USING btree (count_value);
CREATE INDEX count_year_idx ON pseudo_count USING btree (count_year);

... and initialize it with the required data (values in the 1950 - 2050 range)
:
CREATE FUNCTION "init_pseudocount"("int4", "int4") RETURNS "bool" AS '
DECLARE
iLoop int4;

BEGIN
IF (($1>0) AND ($2>0) AND ($2>=$1)) THEN
FOR iLoop in $1 .. $2 LOOP
INSERT INTO pseudo_count (count_year)
VALUES (iLoop);
END LOOP;
RETURN ''t'';
ELSE
RETURN ''f'';
END IF;
END;
' LANGUAGE 'plpgsql';

Example :
SELECT init_pseudocount(1950, 2050) ;
will create the required records for years 1900 to 2100.

e) PLpgSQL function and trigger

Pseudo count is handled using a single trigger 'tg_data' running 'tg_data()'
function. TG_OP is used to catch the trigger context ('insert', 'update' or
'delete').

CREATE FUNCTION "tg_data"() RETURNS "opaque" AS 'DECLARE
rec record;
BEGIN

IF (TG_OP=''UPDATE'') THEN
IF (new.data_counterenabled = ''t'') AND (old.data_counterenabled = ''f'')
THEN
UPDATE pseudo_count
SET count_value = count_value +1
WHERE count_year = new.data_yd
AND count_value >= 0;
END IF;

IF (new.data_counterenabled = ''f'') AND (old.data_counterenabled = ''t'')
THEN
UPDATE pseudo_count
SET count_value = count_value -1
WHERE count_year = new.data_yd
AND count_value > 0;
END IF;

IF (old.data_yd <> new.data_yd) THEN
UPDATE pseudo_count
SET count_value = count_value -1
WHERE count_year = old.data_yd
AND count_value > 0;

UPDATE pseudo_count
SET count_value = count_value + 1
WHERE count_year = new.data_yd
AND count_value >= 0 ;
END IF;
END IF;

IF (TG_OP=''DELETE'') THEN
UPDATE pseudo_count
SET count_value = count_value - 1
WHERE count_year = old.data_yd
AND count_value >= 0 ;
END IF;

IF (TG_OP=''UPDATE'') THEN
RETURN new;
ELSE
RETURN old;
END IF;

END;
' LANGUAGE 'plpgsql';

CREATE TRIGGER "tg_data" BEFORE DELETE OR UPDATE ON "data" FOR EACH
ROW EXECUTE PROCEDURE tg_data();

2) USAGE

a) Initialisation
Set "pseudo_countenabled" flag on:

UPDATE data
SET data_counterenabled = 't'
WHERE data_counterenabled = 'f'

b) Run pseudo-count queries

Instead of :
SELECT COUNT (data_yd)
FROM data
WHERE data_yd = foo_year.

you now can run:
SELECT count_value
FROM pseudo_count
WHERE pseudo_date = foo_year

The anwer comes in 0.005 second.

c) Limits
Before loading large amount of data, triggers on table 'data' should be
dropped and recreated afterwards.

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Mike Baker 2002-04-29 20:42:07 ALTER TABLE hangs
Previous Message berger, ralf 2002-04-29 14:15:55 Re: How do I get encrypted password access from a cgi app

Browse pgsql-general by date

  From Date Subject
Next Message Uros Gruber 2002-04-29 16:34:04 Strange indexing
Previous Message Steve Lane 2002-04-29 15:37:54 Re: Postgres utils chewing RAM