Quick Links

Re: Reducing memory usage of insert into select operations?

From:	"Francisco Reyes" <lists(at)stringsutils(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	<pgsql-general(at)postgresql(dot)org>
Subject:	Re: Reducing memory usage of insert into select operations?
Date:	2008-07-18 21:46:18
Message-ID:	5f70023f6a96824ce8b56acec7cca5c5@stringsutils.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On 4:55 pm 07/18/08 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The only thing I can think of is that you had a huge number of rows
> with all the same hash value, so that there wasn't any way to split
> the batch into smaller sections. What are the join keys exactly in
> this query, and what can you tell us about their data distributions?

I can't put actual table or column names so I am putting the actual select
and explain, with all names changed..

insert into customer_transactions
(record_id, date, type, amount, ids, groupid)
select
ca.record_id, coh.date, coh.type, coh.amount, coh.ids, ids.groupid
from
customer_original_historical coh,
cards ca,
customer_ids ids
where
ca.natural_key = coh.natural_key
and ids.ids = coh.ids
and coh.yearmo > '200703';

Hash Join (cost=712213.57..27293913.33 rows=234402352 width=24)
Hash Cond: (coh.id = ids.id)
-> Hash Join (cost=551387.26..18799378.16 rows=234402352
width=22)
Hash Cond: (coh.user_id = ca.user_id)
-> Seq Scan on customer_original_historical coh
(cost=0.00..6702501.40 rows=234402352 width=47)
Filter: (yearmo > '200703'::bpchar)
-> Hash (cost=268355.67..268355.67 rows=14637567 width=32)
-> Seq Scan on cards ca
(cost=0.00..268355.67 rows=14637567 width=32)
-> Hash (cost=77883.25..77883.25 rows=5055525 width=6)
-> Seq Scan on customer_ids ids
(cost=0.00..77883.25 rows=5055525 width=6)

There was a single table, customer_original_historical, which was using a
natural key with a text field.

Most queries used the customer_original_historical by itself or joined
against a single other table which we shoudl call "area".

The new schema I am testing is to split the one single file into 12 files
per month.

In addition I replaced the natural keys with a synthetic integer key.
I also replaced the "area" table with a customer_ids table which only has
two columns: synthetic key for historical and a region.

In order to have 12 tables per month I grouped all the regions into 12
groups. Queries are usually within a single region so what I am trying to
benchmark is if dividing 24 months of data into 24 sets of 12 regions will
perform better than a single large file.

The distribution of the joins is:
There are about 1000,000 unique natural keys. Each natural key has in
average 15 rows per month.
ids are regions where the natural_keys are. Figure 10s of thousands of
natural_keys to an id.

Is that along the lines of what you were looking for?

In response to

Re: Reducing memory usage of insert into select operations? at 2008-07-18 20:55:12 from Tom Lane

Browse pgsql-general by date

	From	Date	Subject
Next Message	Oleg Bartunov	2008-07-18 22:27:52	Re: tsearch dictionary list?
Previous Message	Adrian Klaver	2008-07-18 21:04:04	Re: Calling Python functions with parameters