Quick Links

Re: Populating large DB from Perl script

From:	"Dawid Kuroczko" <qnex42(at)gmail(dot)com>
To:	"Kynn Jones" <kynnjo(at)gmail(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Populating large DB from Perl script
Date:	2007-11-02 18:57:59
Message-ID:	758d5e7f0711021157g7eca84e8y7c682ba83e86cb04@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On 11/1/07, Kynn Jones <kynnjo(at)gmail(dot)com> wrote:
> Hi. This is a recurrent problem that I have not been able to find a
> good solution for. I have large database that needs to be built from
> scratch roughly once every month. I use a Perl script to do this.
>
> The tables are very large, so I avoid as much as possible using
> in-memory data structures, and instead I rely heavily on temporary
> flat files.
>
> The problem is the population of tables that refer to "internal" IDs
> on other tables. By "internal" I mean IDs that have no meaning
> external to the database; they exist only to enable relational
> referencing. They are always defined as serial integers. So the
> script either must create and keep track of them, or it must populate
> the database in stages, letting Pg assign the serial IDs, and query
> the database for these IDs during subsequent stages.

If it is possible, perhaps you could load "raw" data into temporary
table and then create ids using these tables.
For instance:
CREATE TEMP TABLE foo_raw (host text, city text, who text, value int);
INSERT INTO hosts (host) SELECT DISTINCT host FROM foo; -- group by perhaps?
INSERT INTO [...]
INSERT INTO foo SELECT host_id,city_id,who_id,value
FROM foo_raw
JOIN hosts USING (host)
JOIN cities USING (city)
JOIN who USING (who);
This may or may not work, depending on your setup.

But perhaps a better approach, while needing more work would be:

Your script establishes two DB connections, one for "processing data"
and one for maintaining IDs. Now whenever you need to get and ID do:
1) query memcached if found, return it
2) query database if found return it and insert into memcached
3) insert into database, and insert into memcached, and perhaps commit it.
Befriend thyself with Cache::* perl modules. :)

Regards,
Dawid

In response to

Populating large DB from Perl script at 2007-11-01 18:57:36 from Kynn Jones

Browse pgsql-general by date

	From	Date	Subject
Next Message	andy	2007-11-02 20:33:20	Re: Populating large DB from Perl script
Previous Message	Kynn Jones	2007-11-02 18:45:05	Re: Populating large DB from Perl script