Re: Populating large DB from Perl script

From: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>
To: "Kynn Jones" <kynnjo(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Populating large DB from Perl script
Date: 2007-11-02 18:57:59
Message-ID: 758d5e7f0711021157g7eca84e8y7c682ba83e86cb04@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 11/1/07, Kynn Jones <kynnjo(at)gmail(dot)com> wrote:
> Hi. This is a recurrent problem that I have not been able to find a
> good solution for. I have large database that needs to be built from
> scratch roughly once every month. I use a Perl script to do this.
>
> The tables are very large, so I avoid as much as possible using
> in-memory data structures, and instead I rely heavily on temporary
> flat files.
>
> The problem is the population of tables that refer to "internal" IDs
> on other tables. By "internal" I mean IDs that have no meaning
> external to the database; they exist only to enable relational
> referencing. They are always defined as serial integers. So the
> script either must create and keep track of them, or it must populate
> the database in stages, letting Pg assign the serial IDs, and query
> the database for these IDs during subsequent stages.

If it is possible, perhaps you could load "raw" data into temporary
table and then create ids using these tables.
For instance:
CREATE TEMP TABLE foo_raw (host text, city text, who text, value int);
INSERT INTO hosts (host) SELECT DISTINCT host FROM foo; -- group by perhaps?
INSERT INTO [...]
INSERT INTO foo SELECT host_id,city_id,who_id,value
FROM foo_raw
JOIN hosts USING (host)
JOIN cities USING (city)
JOIN who USING (who);
This may or may not work, depending on your setup.

But perhaps a better approach, while needing more work would be:

Your script establishes two DB connections, one for "processing data"
and one for maintaining IDs. Now whenever you need to get and ID do:
1) query memcached if found, return it
2) query database if found return it and insert into memcached
3) insert into database, and insert into memcached, and perhaps commit it.
Befriend thyself with Cache::* perl modules. :)

Regards,
Dawid

In response to

Browse pgsql-general by date

  From Date Subject
Next Message andy 2007-11-02 20:33:20 Re: Populating large DB from Perl script
Previous Message Kynn Jones 2007-11-02 18:45:05 Re: Populating large DB from Perl script