From: | Brent Wood <Brent(dot)Wood(at)niwa(dot)co(dot)nz> |
---|---|
To: | Seb <spluque(at)gmail(dot)com> |
Cc: | "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: bulk loading table via join of 2 large staging tables |
Date: | 2013-12-31 02:23:30 |
Message-ID: | B30242D206AB9543A3406649674DB4199592BC49@welwexmb01.niwa.local |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
This should help...
In each temporary table convert the time parts to a timestamp, then create an index on each of these, then join on the timestamp.
ALTER table mmc add column timer timestamp without time zone;
UPDATE mmc set timer = (utc_year || '-' || utc_month || '-' || utc_day || ' ' ||
utc_hour || ':' || utc_minute || ':' || utc_second)::timestamp;
CREATE index mmc_timer_idx on mmc(timer);
ALTER table gyro add column timer timestamp without time zone;
UPDATE gyro set timer = (utc_year || '-' || utc_month || '-' || utc_day || ' ' ||
utc_hour || ':' || utc_minute || ':' || utc_second)::timestamp;
CREATE index gyro_timer_idx on gyro(timer);
so something like this should work if you use postgis - which I recommend for GPS data
SELECT DISTINCT ON (project_id, platform_id, supplier_id, timer)
2 AS project_id,
1 AS platform_id,
6 AS supplier_id,
m.timer,
m.latitude,
m.longitude,
ST_SetSRID(ST_MAKEPOINT(m.longitude, m.latitude),4326) as location,
m.sog AS speed_over_ground,
m.cog AS course_over_ground,
g.heading
FROM rmc m,
gyro g
WHERE m.timer = g.timer;
One comment: If either table has times recorded at better than 1 sec precision (ie - more than one value per second) you might join with the avg() value and group by to bring the output into 1 sec values.
Cheers
Brent Wood
Programme leader: Environmental Information Delivery
NIWA
DDI: +64 (4) 3860529
________________________________________
From: pgsql-general-owner(at)postgresql(dot)org [pgsql-general-owner(at)postgresql(dot)org] on behalf of Seb [spluque(at)gmail(dot)com]
Sent: Tuesday, December 31, 2013 2:53 PM
To: pgsql-general(at)postgresql(dot)org
Subject: [GENERAL] bulk loading table via join of 2 large staging tables
Hi,
I have two large CSV files that need to be merged and loaded into a
single table of a database in Postgresql 9.3. I thought I'd do this by
first staging the data in these files in two temporary tables:
---<--------------------cut here---------------start------------------->---
CREATE TEMPORARY TABLE rmc (
utc_year character varying(6),
utc_month character varying(4),
utc_day character varying(4),
utc_hour character varying(4),
utc_minute character varying(4),
utc_second character varying(8),
latitude numeric,
longitude numeric,
sog numeric,
cog numeric);
CREATE TEMPORARY TABLE gyro (
utc_year character varying(6),
utc_month character varying(4),
utc_day character varying(4),
utc_hour character varying(4),
utc_minute character varying(4),
utc_second character varying(8),
heading numeric);
---<--------------------cut here---------------end--------------------->---
And the target table in the database looks like this:
---<--------------------cut here---------------start------------------->---
Table "public.navigation_series"
Column | Type | Modifiers
----------------------+-----------------------------+----------------------------------------------------------------------------------
navigation_record_id | integer | not null default nextval('navigation_series_navigation_record_id_seq'::regclass)
project_id | integer |
platform_id | integer |
supplier_id | integer |
time | timestamp without time zone | not null
longitude | numeric |
latitude | numeric |
speed_over_ground | numeric |
course_over_ground | numeric |
heading | numeric |
Indexes:
"navigation_series_pkey" PRIMARY KEY, btree (navigation_record_id)
"navigation_series_project_id_platform_id_supplier_id_time_key" UNIQUE CONSTRAINT, btree (project_id, platform_id, supplier_id, "time")
Foreign-key constraints:
"navigation_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(project_id) ON UPDATE CASCADE ON DELETE RESTRICT
"navigation_series_platform_id_fkey" FOREIGN KEY (platform_id) REFERENCES platforms(platform_id) ON UPDATE CASCADE ON DELETE RESTRICT
"navigation_series_supplier_id_fkey" FOREIGN KEY (supplier_id) REFERENCES suppliers(supplier_id) ON UPDATE CASCADE ON DELETE RESTRICT
---<--------------------cut here---------------end--------------------->---
Loading the temporary tables was very quick (about 3 min; input files
were 580 Mb and 3.5 Gb) in psql, using:
\copy gyro FROM 'gyro.csv' CSV
\copy rmc FROM 'rmc.csv' CSV
I then created a temporary view with:
CREATE TEMPORARY VIEW rmc_gyro AS
SELECT DISTINCT ON (project_id, platform_id, supplier_id, "time")
2 AS project_id,
1 AS platform_id,
6 AS supplier_id,
(utc_year || '-' || utc_month || '-' || utc_day || ' ' ||
utc_hour || ':' || utc_minute || ':' || utc_second)::timestamp AS "time",
longitude, latitude,
sog AS speed_over_ground,
cog AS course_over_ground,
heading
FROM rmc
FULL JOIN gyro USING (utc_year, utc_month, utc_day, utc_hour, utc_minute, utc_second)
ORDER BY project_id, platform_id, supplier_id, "time";
But at this point even just selecting a few rows of data from the view
is too slow (I haven't seen the output after many hours).
Given that the process involves a full join, I'm not sure I can do this
in chunks (say breaking down the files into smaller pieces). Any
suggestions would be greatly appreciated.
Cheers,
--
Seb
--
Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Please consider the environment before printing this email.
NIWA is the trading name of the National Institute of Water & Atmospheric Research Ltd.
From | Date | Subject | |
---|---|---|---|
Next Message | Joe Van Dyk | 2013-12-31 04:56:22 | Re: Replication failed after stalling |
Previous Message | Andrew Sullivan | 2013-12-31 02:20:38 | Re: bulk loading table via join of 2 large staging tables |