Re: Normalization tools for postgres?

From: Richard Huxton <dev(at)archonet(dot)com>
To: Dane Springmeyer <blake(at)hailmail(dot)net>, PG-General Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Re: Normalization tools for postgres?
Date: 2007-11-21 20:15:13
Message-ID: 474491D1.9090403@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-php

Dane Springmeyer wrote:
> On Nov 21, 2007, at 1:37 AM, Richard Huxton wrote:
>> Dane Springmeyer wrote:

>>> These mostly reside in MS access in flat tables and and I am
>>> importing them into postgres.
>>
>> If you are familiar/comfortable with Access and VB, I'd probably do
>> the work there.
>
> Unfortunately I am not. I am only familiar with very basic SQL in Access
> and not with VB. And I am not interested in investing any time learning
> microsoft products. I'd rather put time in learning postgres and php or
> other languages to manipulate data in postgres.

Fair enough. It's easy enough to get started with PHP. Oh, you don't
need to run it in a webserver, you can do so from the command-line too.
Any of the scripting languages will do nicely for this sort of thing -
Perl, Python, Ruby etc.

>> 1. Import as-is into a table called e.g. raw_data
>>
>> 2. Identify/add the primary-key (presumably ID in this case) in raw_data
>> If no ID, you can add a column of type SERIAL to raw_data and let it
>> be populated automatically.
>>
>> 3. CREATE TABLE lookups.region (id SERIAL, description text NOT NULL
>> UNIQUE, PRIMARY KEY id);
>> INSERT INTO lookups.region (description) SELECT DISTINCT region from
>> raw_data;
>>
>> 4. Repeat #3 for other lookups
>>
>> 5. CREATE TABLE processed_data (...);
>> INSERT INTO processed_data (id, region_id, ...)
>> SELECT raw.id, lkp_reg.id, ...
>> FROM raw_data raw
>> JOIN lookups.region lkp_reg ON raw.region = lkp_reg.description
>> JOIN lookups.whatever...

> Wow. That was EXTREMELY helpful. With those concepts I've not been able
> to do EXACTLY what I was shooting for AND now understand the SQL well
> enough to start thinking of better ways to do it as well. Thank you.

Good.

> Here is the SQL which inserts the sample data and processes it into 4
> different tables. Perhaps I after you take a look I could post this back
> to the group?

I've cc:ed the list on this one for you - plenty of smart people on that
list.

> CREATE TABLE raw_data (

Later on, you might want to consider CREATE TEMPORARY TABLE... but don't
worry for the moment.

> wid int4,
> region character(35),
> drain character(65),
> eco character(29)

These should almost certainly be "varchar" (or "character varying") -
unless you actually want each field padded with spaces. In fact, since
this is just a temporary thing I'd make them all type "text".

> );
>
> INSERT INTO raw_data VALUES (11210, 'SW Oregon / N Cali Coast', 'Rogue
> River', 'California Coast');
> INSERT INTO raw_data VALUES (11100, 'SW Oregon / N Cali Coast', 'Coastal
> grouping of Chetco River and Pistol River', 'California Coast');

You might want to read the manuals regarding the COPY <table> FROM STDIN
command. Good for bulk-loading data.

Also, you could download the pg-odbc driver and link to PG from Access
to copy the data over.

>
> drop table region;

There's an "IF EXISTS" clause you can add to DROP TABLE - prevents errors.

> CREATE TABLE region (id SERIAL, name text NOT NULL UNIQUE, PRIMARY KEY
> (id));
> INSERT INTO region (name) SELECT DISTINCT region from raw_data;
> Select * from region;
[snip repeats for other tables]
> drop table processed_data;
> CREATE TABLE processed_data (wid int4 NOT NULL UNIQUE, region_id int4
> NOT NULL, eco_id int4 NOT NULL, drain_id int4 NOT NULL, PRIMARY KEY (wid));
> INSERT INTO processed_data (wid, region_id, eco_id, drain_id)
> SELECT w.wid, r.id, e.id, d.id
> FROM raw_data as w, region as r, eco as e, drain as d
> WHERE w.region = r.name AND w.drain = d.name AND w.eco = e.name;

Good. You'll want to read up on foreign-keys too. You can define them
when you build the table, or add them after. Something like (not checked):
ALTER TABLE processed_data ADD CONSTRAINT valid_region
FOREIGN KEY (region_id) REFERENCES region (id);

>> 6. Wrap the entire lot above in BEGIN...END so it either all works or
>> it doesn't then VACUUM FULL the database at key points.
>
> I'm not familiar with these ideas, but I'll look into them in the docs...

Read up on VACUUM, ANALYSE and the autovacuum daemon (agent).

>> If you're comfortable with a bit of VB/Perl/Python/PHP/plpgsql then
>> you can automate that fairly simply. If not, a bit of cut+paste will
>> see you there.
>
> Yes, I think I'll experiment with trying to produce this SQL text with
> php. Seems like in a very short time I could have a custom script to
> parse any table given column names and types which I'd like to break
> out. Very nice.

The fun comes when you have to clean up the data - correct spelling
mistakes, remove duplicates etc.

--
Richard Huxton
Archonet Ltd

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Garber, Mikhail 2007-11-21 20:20:51 Read-only availability of a standby server?
Previous Message Marc Munro 2007-11-21 19:44:50 Coordinating database user accounts with active directory

Browse pgsql-php by date

  From Date Subject
Next Message Emmanuel Nnko 2007-12-28 15:31:45 Installation Error for installation of Apache-1.3.39+php-5.2.5+OpenSSL-0.9.8g+PostgreSQL-8.1.10
Previous Message gunartha 2007-11-21 19:30:56 string function