Re: Bulk inserts into two (related) tables

From: Jeremy Finzel <finzelj(at)gmail(dot)com>
To: Rich Shepard <rshepard(at)appl-ecosys(dot)com>
Cc: "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Bulk inserts into two (related) tables
Date: 2019-05-21 21:57:47
Message-ID: CAMa1XUhNmzNOzmFrY8N83Jii_H53AzrhT57BMTSiNG-sQSi6=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, May 21, 2019 at 12:24 PM Rich Shepard <rshepard(at)appl-ecosys(dot)com>
wrote:

> On Tue, 21 May 2019, Francisco Olarte wrote:
>
> > From how you say it, I assume you have some data in your original
> > dumps which can relate boths, lets assume it's org_name, but may be an
> > org-code. If you do not have it it means you cannot match people to
> > orgs in your data, all is lost.
>
> Francisco,
>
> Not yet with these new data.
>
> I'll manually insert the org_id numbers from the organizations table into
> the people table.
>

To me, this is the key to your problem what will either make this a time
saver or time waster. Somehow you are accounting for what uniquely
identifies organizations, right?

Say there are 5 fields that correspond to an organization. I assume then
you are creating only one new org_id for each unique combination of these
fields?

Then take Francisco's suggestion, only use an md5 of the organization
fields to create yourself a unique identifier. Then you can use ctid
(unique internal identifier for each row) to join back. You use SQL like
this:

SELECT md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS hash_identifier
FROM table;

Assume in example below that your unique "org" rows are the first 3 fields:

1. Load your data file into a loading table like so:
CREATE TABLE loader (org_name text, org_stuff_1 text, org_stuff_2 text,
person_name text);
\copy loader from 'my_data.csv' with csv header
ALTER TABLE loader ADD COLUMN org_id INT;

Example data:
INSERT INTO loader VALUES ('a', ' ', ' ', 'Jerry');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Bob');
INSERT INTO loader VALUES ('a', ' ', 'b', 'Janice');
INSERT INTO loader VALUES ('a', ' ', 'c', 'Chris');
INSERT INTO loader VALUES ('b', ' ', 'c', 'Jason');
INSERT INTO loader VALUES ('a', ' ', ' ', 'Alice');

2. Load org table:
test=# CREATE TABLE organizations (org_id serial primary key, org_name
text, org_stuff_1 text, org_stuff_2 text);
CREATE TABLE
test=# INSERT INTO organizations (org_name, org_stuff_1, org_stuff_2)
test-# SELECT DISTINCT org_name, org_stuff_1, org_stuff_2
test-# FROM loader;
INSERT 0 4

3. Build mapping directly and update:
-- build hash of org fields in loader table, take ctid in order to map back
later
WITH map_source AS (
SELECT ctid, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS
hash_identifier
FROM loader)

-- build hash of org fields in organizations table to join back to loader
and bring in org_id of course
, map_org AS (
SELECT org_id, md5(row(org_name, org_stuff_1, org_stuff_2)::text) AS
hash_identifier
FROM organizations)

-- map by joining together on hash_identifier
, final_map AS (
SELECT org_id, ctid
FROM map_source l
INNER JOIN map_org o USING (hash_identifier)
)

-- Perform update
UPDATE loader l
SET org_id = fm.org_id
FROM final_map fm
WHERE fm.ctid = l.ctid;

Final data ready for the person table to be populated:
test=# table organizations;
org_id | org_name | org_stuff_1 | org_stuff_2
--------+----------+-------------+-------------
1 | a | | b
2 | a | |
3 | a | | c
4 | b | | c
(4 rows)

test=# table loader;
org_name | org_stuff_1 | org_stuff_2 | person_name | org_id
----------+-------------+-------------+-------------+--------
a | | | Jerry | 2 |
a | | b | Bob | 1 |
a | | b | Janice | 1 |
a | | c | Chris | 3 |
b | | c | Jason | 4 |
a | | | Alice | 2 |
(6 rows)

Hope this helps!
Thanks,
Jeremy

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Peter Geoghegan 2019-05-21 22:23:00 Re: pg_upgrade can result in early wraparound on databases with high transaction load
Previous Message Fabrízio de Royes Mello 2019-05-21 19:47:28 Re: Re: Refresh Publication takes hours and doesn´t finish