Re: Recommended Protocol: Adding Rows to Table

From: David Johnston <polobo(at)yahoo(dot)com>
To: Rich Shepard <rshepard(at)appl-ecosys(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Recommended Protocol: Adding Rows to Table
Date: 2011-11-25 16:19:21
Message-ID: 9FD7A75C-1ABE-4143-9E07-3B872693CDD8@yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Nov 25, 2011, at 11:05, Rich Shepard <rshepard(at)appl-ecosys(dot)com> wrote:

> The data originated in a spreadsheet and, based on my experience, contains
> duplicate records. After reformatting there are 143,260 rows to insert in
> the table. The approach I tried seems to have problems (explained below) and
> I would like to learn the proper way to insert rows in either an empty table
> or one with existing rows since I'll need to do this procedure for my
> projects.
>
> The table was created with the primary key and I used INSERT INTO ... to
> load the data. Many duplicate records, so I split the file into smaller ones
> and re-ran the command to load them. I'd then remove the reported duplicate
> rows from the source (text) file. Between runs, I'd issue the DROP TABLE
> command within psql and check it was empty using 'select count(*) from
> waterchem;'.
>
> It appeared that I removed duplicates from the first couple of smaller
> files so I combined them into one file named ok.sql. But, when I tested the
> combined file it, too, reported many duplicate records. Something wrong
> here.
>
> Perhaps a better approach is to put the CREATE TABLE command above the
> INSERT INTO section of the file (without specifying a PK), load that using
> the command 'psql -d <database> -f waterchem.sql', then add the PK and
> remove duplicates as postgres presents them. Or, perhaps there is a much
> more efficient way to accomplish this task.
>
> Rather than my flailing around and spending a lot of time failing to load
> all non-duplicate rows into the table I'd like to learn the proper way to
> accomplish this task. Suggestions, recommendations, and your experiences are
> requested.
>
> TIA,
>
> Rich
>
>

Simplistically you load all the data into a staging table that has no natural primary key and then write a query that will result in only a single record for whatever you define as a primary key. Insert the results of that query into the final table.

If you only use a subset of columns to uniquely determine the PK than you need to decide how to resolve situations where the PK fields are duplicated but the extra fields are unique. That step is why the first description is simplistic.

David J.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2011-11-25 16:36:10 Re: error when defining a search configuration named "default"
Previous Message Rich Shepard 2011-11-25 16:05:47 Recommended Protocol: Adding Rows to Table