Re: Hash join on int takes 8..114 seconds

From: "Andrus" <kobruleht2(at)hot(dot)ee>
To: <tv(at)fuzzy(dot)cz>
Cc: "Richard Huxton" <dev(at)archonet(dot)com>, "PFC" <lists(at)peufeu(dot)com>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Hash join on int takes 8..114 seconds
Date: 2008-11-21 19:00:09
Message-ID: EDAB641B4D004D1BB5126A75CFBDD79F@andrusnotebook
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Thomas,

Thank you.

> Just the most important points:
>
> 1) "dok" table contains 1235086 row versions in 171641 pages (with 8kB
> pages this means 1.4GB MB of data), but there are 1834279 unused item
> pointers (i.e. about 60% of the space is wasted)
>
> 2) "rid" table contains 3275189 roiws in 165282 (with 8kB pages this means
> about 1.3GB of data), but there are 1878923 unused item pointers (i.e.
> about 30% of the space is wasted)
>
> 3) don't forget to execute analyze after vacuuming (or vacuum analyze)

autovacuum is running.
So if I understand properly, I must ran
VACUUM FULL ANALYZE dok;
VACUUM FULL ANALYZE rid;

Those commands cause server probably to stop responding to other client like
vacuum full pg_shdepend
did.

Should vacuum_cost_delay = 2000 allow other users to work when running those
commands ?

> 4) I'm not sure why the sizes reported by you (for example 2.3GB vs 1.5GB
> for "doc" table) - the difference seems too large for me.

I used pg_total_relation_size(). So 2.3 GB includes indexes also:

8 44286 dok_tasudok_idx 245 MB
10 44283 dok_klient_idx 142 MB
18 44288 dok_tasumata_idx 91 MB
19 44289 dok_tellimus_idx 89 MB
20 44284dok_krdokumnr_idx 89 MB
21 44285 dok_kuupaev_idx 84 MB
22 43531 makse_pkey 77 MB
23 43479 dok_pkey 74 MB
24 44282 dok_dokumnr_idx 74 MB
26 18663923 dok_yksus_pattern_idx 43 MB
27 18801591 dok_sihtyksus_pattern_idx 42 MB

> Anyway the amount of wasted rows seems significant to me - I'd try to
> solve this first. Either by VACUUM FULL or by CLUSTER. The CLUSTER will
> lock the table exclusively, but the results may be better (when sorting by
> a well chosen index). Don't forget to run ANALYZE afterwards.

How to invoke those commands so that other clients can continue work?
I'm using 8.1.4.
Log files show that autovacuum is running.

I'm planning the following solution:

1. Set

vacuum_cost_delay=2000

2. Run the following commands periodically in this order:

VACUUM FULL;
vacuum full pg_shdepend;
CLUSTER rid on (toode);
CLUSTER dok on (kuupaev);
REINDEX DATABASE mydb;
REINDEX SYSTEM mydb;
ANALYZE;

Are all those command required or can something leaved out ?

> Several other things to consider:
>
> 1) Regarding the toode column - why are you using CHAR(20) when the values
> are actually shorter? This may significantly increase the amount of space
> required.

There may be some products whose codes may be up to 20 characters.
PostgreSQL does not hold trailing spaces in db, so this does *not* affect to
space.

> 2) I've noticed the CPU used is Celeron, which may negatively affect the
> speed of hash computation. I'd try to replace it by something faster - say
> INTEGER as an artificial primary key of the "toode" table and using it as
> a FK in other tables. This might improve the "Bitmap Heap Scan on rid"
> part, but yes - it's just a minor improvement compared to the "Hash Join"
> part of the query.

Natural key Toode CHAR(20) is used widely in different queries. Replacing it
with
INT surrogate key requires major application rewrite.

Should I add surrogate index INT columns to toode and rid table and measure
test query speed in this case?

> Materialized views seem like a good idea to me, but maybe I'm not seeing
> something. What do you mean by "reports are different"? If there is a lot
> of rows for a given product / day, then creating an aggregated table with
> (product code / day) as a primary key is quite simple. It may require a
> lot of disk space, but it'll remove the hash join overhead. But if the
> queries are very different, then it may be difficult to build such
> materialized view(s).

log file seems that mostly only those queries are slow:

SELECT ...
FROM dok JOIN rid USING (dokumnr)
JOIN ProductId USING (ProductId)
WHERE rid.ProductId LIKE :p1 || '%' AND dok.SaleDate>=:p2

:p1 and :p2 are parameters different for different queries.

dok contains several years of data. :p2 is usually only few previous months
or last year ago.
SELECT column list contains fixed list of known columns from all tables.

How to create index or materialized view to optimize this types of queries ?

Andrus.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message PFC 2008-11-21 19:08:27 Re: Hash join on int takes 8..114 seconds
Previous Message PFC 2008-11-21 18:31:42 Re: Hash join on int takes 8..114 seconds