Re: Caching (was Re: choosing the right platform)

From: "Ron Mayer" <ron(at)intervideo(dot)com>
To: "Josh Berkus" <josh(at)agliodbs(dot)com>, "Matthew Nuzum" <cobalt(at)bearfruit(dot)org>, "'Pgsql-Performance'" <pgsql-performance(at)postgresql(dot)org>
Cc: <ron(at)intervideo(dot)com>
Subject: Re: Caching (was Re: choosing the right platform)
Date: 2003-04-10 23:06:40
Message-ID: POEDIPIPKGJJLDNIEMBEOEAACJAA.ron@intervideo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


Short summary...

I think sort_mem matters quite a bit (20-40%) on
my data-warehousing applications.

Am I doing something wrong to need so much sort_mem?

Josh wrote:
>> ** It's better to not use huge amount of sort-mem...
>
>...However, I have never seen a database use more than 32mb sort mem
>in a single process, so I don't think the 2GB limit will hurt you much ...

Do you think this is true in data warehousing applications as well?

During the ETL part of data warehousing, large sorts are often
used to get the "new" values that need to be inserted
into "dimension" tables, like this:
INSERT INTO dimension_val (id,val)
SELECT nextval('val_seq'),val
FROM (SELECT DISTINCT val FROM import_table
EXCEPT
SELECT val FROM dimension_val) as a;
As far as I can tell, this query typically does two sorts,
one for the distinct, and one for the except.

In a data warehouse we have here, we load about 3 million rows
each week; load time improved from about 9 to 7 hours
by breaking up such queries into expressions that only require
one sort at a time, and surrounding the expressions with
"set sort_mem=something_big" statements to give it enough
space to not hit the disk.

SET SORT_MEM=300000;
CREATE TEMPORARY TABLE potential_new_values AS
SELECT DISTINCT val FROM import_table;
...
SET SORT_MEM=1000;

Anyone else have similar experience, or am I doing something
wrong to need so much SORT_MEM?

Ron

PS:

Below is an example of another real-world query from the same
reporting system that benefits from a sort_mem over 32M.
Explain analyze (below) shows a 40% improvement by having
the sort fit in memory.

10Meg and 32Meg take over 22 seconds. 100Meg takes 14.

====================================================================================================
logs2=#
logs2=#
logs2=# set sort_mem=10000;
SET VARIABLE
logs2=# explain analyze select distinct category from c_transaction_credit;
NOTICE: QUERY PLAN:

Unique (cost=71612.82..72838.69 rows=49035 width=17) (actual time=20315.47..22457.21 rows=2914 loops=1)
-> Sort (cost=71612.82..71612.82 rows=490348 width=17) (actual time=20315.46..21351.42 rows=511368 loops=1)
-> Seq Scan on c_transaction_credit (cost=0.00..14096.48 rows=490348 width=17) (actual time=0.08..2932.72 rows=511368
loops=1)
Total runtime: 22475.63 msec

EXPLAIN
logs2=# set sort_mem=32000;
SET VARIABLE
logs2=# explain analyze select distinct category from c_transaction_credit;
NOTICE: QUERY PLAN:

Unique (cost=60442.82..61668.69 rows=49035 width=17) (actual time=22657.31..24794.19 rows=2914 loops=1)
-> Sort (cost=60442.82..60442.82 rows=490348 width=17) (actual time=22657.30..23714.43 rows=511368 loops=1)
-> Seq Scan on c_transaction_credit (cost=0.00..14096.48 rows=490348 width=17) (actual time=0.07..3020.83 rows=511368
loops=1)
Total runtime: 24811.65 msec

EXPLAIN
logs2=# set sort_mem=100000;
SET VARIABLE
logs2=# explain analyze select distinct category from c_transaction_credit;
NOTICE: QUERY PLAN:

Unique (cost=60442.82..61668.69 rows=49035 width=17) (actual time=12205.19..14012.57 rows=2914 loops=1)
-> Sort (cost=60442.82..60442.82 rows=490348 width=17) (actual time=12205.18..12710.16 rows=511368 loops=1)
-> Seq Scan on c_transaction_credit (cost=0.00..14096.48 rows=490348 width=17) (actual time=0.08..3001.05 rows=511368
loops=1)
Total runtime: 14187.96 msec

EXPLAIN
logs2=#

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Josh Berkus 2003-04-11 00:13:19 Re: Help analyzing 7.2.4 EXPLAIN
Previous Message Jean-Luc Lachance 2003-04-10 18:59:55 Re: Caching (was Re: choosing the right platform)