From: | Paul A Jungwirth <pj(at)illuminatedcomputing(dot)com> |
---|---|
To: | Eugene Dzhurinsky <jdevelop(at)gmail(dot)com> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Import large data set into a table and resolve duplicates? |
Date: | 2015-02-15 18:19:42 |
Message-ID: | CA+renyVAexDNedKWgpGKEOXO3un_0MGCwFZPNUy5dkbf4kNepQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi Eugene,
> Now I need to import the patch into the database, and produce another file as
> - if the passed "series" field exists in the database, then return ID:series
> - otherwise insert a new row to the table and generate new ID and return ID:series
> for each row in the source file.
I think Francisco's approach is good, and I agree that ~200k rows is
hardly anything. My approach is similar but uses CTEs to combine a lot
of Francisco's queries into one. I still have a separate COPY command
though. (It'd be great if you could COPY into a CTE, but I guess
COPYing into a temporary table is pretty close.) Anyway, when I run
this on my machine, the import finishes in a few seconds:
# Makefile
database=dictionary
port=5432
words=/usr/share/dict/american-english
SHELL=/bin/bash
initial.txt:
for i in {1..3}; do \
cat "${words}" | while read line; do \
echo $$i "$$line"; \
done; \
done > initial.txt
tables: initial.txt
sudo su postgres -c 'psql -p ${port} ${database} -f tables.sql < initial.txt'
a.txt:
for i in {1,4}; do \
cat "${words}" | while read line; do \
echo $$i "$$line"; \
done; \
done > a.txt
b.txt:
for i in {4,5}; do \
cat "${words}" | while read line; do \
echo $$i "$$line"; \
done; \
done > b.txt
a: a.txt
sudo su postgres -c 'psql -p ${port} ${database} -f import.sql < a.txt'
b: b.txt
sudo su postgres -c 'psql -p ${port} ${database} -f import.sql < b.txt'
clean:
rm -f initial.txt a.txt b.txt
.PHONY: tables a b clean
# tables.sql
DROP TABLE IF EXISTS dictionary;
CREATE TABLE dictionary (id SERIAL PRIMARY KEY, series VARCHAR NOT NULL);
\copy dictionary (series) from pstdin
CREATE UNIQUE INDEX idx_series ON dictionary (series);
# import.sql
CREATE TEMPORARY TABLE staging (
series VARCHAR NOT NULL
);
\copy staging (series) from pstdin
CREATE INDEX idx_staging_series ON staging (series);
WITH already AS (
SELECT id, staging.series
FROM staging
LEFT OUTER JOIN dictionary
ON dictionary.series = staging.series
),
adding as (
INSERT INTO dictionary
(series)
SELECT series::text
FROM already
WHERE id IS NULL
RETURNING id, series
)
SELECT id, series
FROM adding
UNION
SELECT id, series
FROM already WHERE id IS NOT NULL
;
Good luck!
Paul
From | Date | Subject | |
---|---|---|---|
Next Message | 2015-02-16 10:44:08 | Fwd: Data corruption after restarting replica | |
Previous Message | Eugene Dzhurinsky | 2015-02-15 18:11:31 | Re: Import large data set into a table and resolve duplicates? |