From: | "Francisco Reyes" <lists(at)stringsutils(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Reducing memory usage of insert into select operations? |
Date: | 2008-07-18 21:46:18 |
Message-ID: | 5f70023f6a96824ce8b56acec7cca5c5@stringsutils.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 4:55 pm 07/18/08 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The only thing I can think of is that you had a huge number of rows
> with all the same hash value, so that there wasn't any way to split
> the batch into smaller sections. What are the join keys exactly in
> this query, and what can you tell us about their data distributions?
I can't put actual table or column names so I am putting the actual select
and explain, with all names changed..
insert into customer_transactions
(record_id, date, type, amount, ids, groupid)
select
ca.record_id, coh.date, coh.type, coh.amount, coh.ids, ids.groupid
from
customer_original_historical coh,
cards ca,
customer_ids ids
where
ca.natural_key = coh.natural_key
and ids.ids = coh.ids
and coh.yearmo > '200703';
Hash Join (cost=712213.57..27293913.33 rows=234402352 width=24)
Hash Cond: (coh.id = ids.id)
-> Hash Join (cost=551387.26..18799378.16 rows=234402352
width=22)
Hash Cond: (coh.user_id = ca.user_id)
-> Seq Scan on customer_original_historical coh
(cost=0.00..6702501.40 rows=234402352 width=47)
Filter: (yearmo > '200703'::bpchar)
-> Hash (cost=268355.67..268355.67 rows=14637567 width=32)
-> Seq Scan on cards ca
(cost=0.00..268355.67 rows=14637567 width=32)
-> Hash (cost=77883.25..77883.25 rows=5055525 width=6)
-> Seq Scan on customer_ids ids
(cost=0.00..77883.25 rows=5055525 width=6)
There was a single table, customer_original_historical, which was using a
natural key with a text field.
Most queries used the customer_original_historical by itself or joined
against a single other table which we shoudl call "area".
The new schema I am testing is to split the one single file into 12 files
per month.
In addition I replaced the natural keys with a synthetic integer key.
I also replaced the "area" table with a customer_ids table which only has
two columns: synthetic key for historical and a region.
In order to have 12 tables per month I grouped all the regions into 12
groups. Queries are usually within a single region so what I am trying to
benchmark is if dividing 24 months of data into 24 sets of 12 regions will
perform better than a single large file.
The distribution of the joins is:
There are about 1000,000 unique natural keys. Each natural key has in
average 15 rows per month.
ids are regions where the natural_keys are. Figure 10s of thousands of
natural_keys to an id.
Is that along the lines of what you were looking for?
From | Date | Subject | |
---|---|---|---|
Next Message | Oleg Bartunov | 2008-07-18 22:27:52 | Re: tsearch dictionary list? |
Previous Message | Adrian Klaver | 2008-07-18 21:04:04 | Re: Calling Python functions with parameters |