Re: importing db as text files

From: "Gregory S(dot) Williamson" <gsw(at)globexplorer(dot)com>
To: "expect" <expect(at)ihubbell(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: importing db as text files
Date: 2003-08-13 21:59:29
Message-ID: 71E37EF6B7DCC1499CEA0316A2568328DC9819@loki.globexplorer.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I tend to use perl to preprocess dumps (in our case from Informix). It tends to work much faster than shell scripts (although your mileage may vary). I have to fix missing numeric values (if the column allows nulls why can't the copy command accept an empty field, I wonder?), missing dates. In other cases is massages date formats, spatial data, etc. For example, a crude program below to clean DOQQ metadata:

Greg W.
==================================
firenze% more infxunl2psql
#!/usr/dist/bin/perl -w

$FILE = $ARGV[0];
$OUTPUT = $ARGV[1];
$MODE = $ARGV[2];

open (INFILE,"$FILE");
open (OUTFILE,">$OUTPUT");

foreach $line (<INFILE>)
{
chop($line);
chop($line);
if (($MODE cmp "DOQ") == 0) {
($t_source_filename, $t_quadrangle_name, $t_west_longitude, $t_east_longitude, $t_north_latitude, $t_south_latitude, $t_production_date, $t_raster_order, $t_band_organization, $t_band_content, $t_bits_per_pixel, $t_samples_and_lines, $t_horizontal_datum, $t_horizontal_coordinate_system, $t_coordinate_zone, $t_horizontal_units, $t_horizontal_resolution, $t_secondary_horizontal_datum, $t_xy_origin, $t_secondary_xy_origin, $t_nw_quad_corner_xy, $t_ne_quad_corner_xy, $t_se_quad_corner_xy, $t_sw_quad_corner_xy, $t_secondary_nw_quad_xy, $t_secondary_ne_quad_xy, $tsecondary_se_quad_xy, $t_secondary_sw_quad_xy, $t_rmse_xy, $t_image_source, $t_source_dem_date, $t_agency, $t_producer, $t_production_system, $t_standard_version, $t_metadata_date, $t_data_file_size, $byte_count) = split(/\|/,$line);
if (length($t_production_date) == 0) {
$t_production_date = "\\N"; # psql seems to dump a blank data with this nomenclature
}
if (length($t_coordinate_zone) == 0) { # an integer
$t_coordinate_zone = 0;
}
if (length($t_band_content) == 0) {
$t_band_content = 0; # also an int
}
if (length($t_bits_per_pixel) == 0) {
$t_bits_per_pixel = 0; # reasonable default for an int ?
}
if (length($t_horizontal_resolution) == 0) {
$t_horizontal_resolution = 0.0;
}
if (length($t_secondary_horizontal_datum) == 0) {
$t_secondary_horizontal_datum = "\'\'";
}
if (length($t_rmse_xy) == 0) {
$t_rmse_xy = 0.0;
}
if (length($t_metadata_date) == 0) {
$t_metadata_date = "\\N";
}
if (length($t_data_file_size) == 0) {
$t_data_file_size = 0; # a big int
}
if (length($byte_count) == 0) {
$byte_count = 0; # reasonable default ? for an int
}
$out_line = $t_source_filename . "|" . $t_quadrangle_name . "|" . $t_west_longitude . "|" . $t_east_longitude . "|" . $t_north_latitude . "|" . $t_south_latitude . "|" . $t_production_date . "|" . $t_raster_order . "|" . $t_band_org
anization . "|" . $t_band_content . "|" . $t_bits_per_pixel . "|" . $t_samples_and_lines . "|" . $t_horizontal_datum . "|" . $t_horizontal_coordinate_system . "|" . $t_coordinate_zone . "|" . $t_horizontal_units . "|" . $t_horizontal_resolu
tion . "|" . $t_secondary_horizontal_datum . "|" . $t_xy_origin . "|" . $t_secondary_xy_origin . "|" . $t_nw_quad_corner_xy . "|" . $t_ne_quad_corner_xy . "|" . $t_se_quad_corner_xy . "|" . $t_sw_quad_corner_xy . "|" . $t_secondary_nw_quad_xy . "|" . $t_secondary_ne_quad_xy . "|" . $tsecondary_se_quad_xy . "|" . $t_secondary_sw_quad_xy . "|" . $t_rmse_xy . "|" . $t_image_source . "|" . $t_source_dem_date . "|" . $t_agency . "|" . $t_producer . "|" . $t_production_system . "|" . $t_standard_version . "|" . $t_metadata_date . "|" . $t_data_file_size . "|" . $byte_count;
print OUTFILE "$out_line\n";
}
else {
print OUTFILE "$line\n";
}
}
close INFILE;
close OUTFILE;

-----Original Message-----
From: Jason Godden [mailto:jasongodden(at)optushome(dot)com(dot)au]
Sent: Wednesday, August 13, 2003 2:35 PM
To: expect; pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] importing db as text files

Hi expect,

Best way in my opinion is to use the copy table command. This way Pg will
actually 'massage' the data (string escapes and all) for you.

If you have complex files best to use a combo of sed/awk to pre-massage the
field delimiters and import the data to a temporary table. In most instances
it is best to pipe the data to the psql command using copy table from stdin
as from file requires that you are the process owner.

Always use a temporary import table and perform validation/further conversion
in that table (IMO).

eg (from one of my earlier posts)

cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g"
| psql -dmydatabase -c "copy importtable from stdin delimiter '\t';"

The first sed will replace all instances of "," with a tab character
(I achieve tab in my shell scripts with a line like TAB=$(printf "\t"))

The second sed will remove the start and end " because the first only checks
for ",".

The third sed will remove the Windows carriage returns (if any - depends on
the platform the file was generated from.

Note here that my file format never contains any " in a field so I can safely
run the second sed. If your import file contains these then you will need to
change the regex or use awk.

Regards,

Jason

On Thu, 14 Aug 2003 07:14 am, expect wrote:
> What's the big deal with importing text files? I have a 70 MB file to
> import and it's been one problem after another. I used the copy command
> and it appears that it's just not possible. I finally massaged the file
> into a .sql file and ran that using \i db.sql but that failed too because I
> overlooked ' in names like D'Adario. The other problem I encountered was
> that a numeric field had to have data in it, pg would not default to the
> default value. So instead of massaging all the data again I decided to
> change the data type for that column. This is my first experience with
> postgresql and I'm wondering if I should expect to encounter similar pain
> as I go further into this? So far it's been very painful trying to do what
> I thought would be easy and what I think should be easy.
>
> PostgreSQL 7.3.4 on linux redhat 9
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Fredy Ramirez 2003-08-13 22:01:40 Transaction question
Previous Message Ron Johnson 2003-08-13 21:53:55 Re: How to get the total number of rows with a query