Quick Links

Re: Better way to bulk-load millions of CSV records into

From:	Marc Spitzer <marc(at)oscar(dot)eng(dot)cv(dot)net>
To:	PgSQL Novice ML <pgsql-novice(at)postgresql(dot)org>
Subject:	Re: Better way to bulk-load millions of CSV records into
Date:	2002-05-22 22:07:43
Message-ID:	20020522180743.A38860@oscar.eng.cv.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-novice

On Wed, May 22, 2002 at 01:51:45PM -0500, Ron Johnson wrote:
> On Wed, 2002-05-22 at 13:11, Marc Spitzer wrote:
> > On Wed, May 22, 2002 at 12:48:58PM -0500, Ron Johnson wrote:
> > > On Wed, 2002-05-22 at 11:18, Marc Spitzer wrote:
> > > > On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote:
> [snip]
> > for i in load_data/* ;do
> > echo "datafile $i"
> > awk -F, 'BEGIN{OFS=","}{if ($15~/[.]/){$15="-1"; $0=$0} print $0}' $i >$i.tmp
> > mv $i.tmp $i
> > grep -E "[0-9]+([.][0-9]+)+" $i
> > grep -vE "[0-9]+([.][0-9]+)+" $i >$i.tmp
> > mv $i.tmp $i
> > echo "copy call_me_bob from '/home/marc/projects/bobs_house/$i' using Delimiters ',' with null $
> > done
> [snip]
>
> I'm not an awk programmer. What does that command do?
>

it splits a line on the commas ( -F, ) and prints the output with
a field seperator of comma (OFS-",") then for each row it checks
the 15th field for junk(things woth a '.' in them) and substatutes
-1 for the junk.

> Also, all my fields have double-quotes around them. Is there
> a tool (or really clever use of sed) that will strip them

tr will take care of that, if you just want to delete them

tr -d "\"" <file >newffile should do it, my syntax might be a little
off check the man page.

Another question is do you need the quotes around any of the
fields? Is it possable that you have a field seperator(comma
for example) embeded in any of the fields. If the answere is
yes or perhaps then it gets harder to deal with.

a quick check can be done using this script:
(assumes the field seperator is a comma)
awk -F, '{print NF }' <data_file |sort |uniq -c

this will tell you how many lines have a given
number of fields, if things are good they should
all have the same number and it should agree with
what you think it is. NF is defined in the awk
man page.

> away from the fields that don't need them? I actually have
> _comma_ delimited files, and any fields with commas in them
> need the double quotes...
>

In that case there is perl and DBI + DBD::CSV, that handles all
the special cases for you. I think there is an awk library that
does this also if you want to look for it.

Or if possable you could uses a different charater as a field
seperator, semi-colon, colon, and pipe come to mind.

Good luck

marc

> --
> +---------------------------------------------------------+
> | Ron Johnson, Jr. Home: ron(dot)l(dot)johnson(at)cox(dot)net |
> | Jefferson, LA USA http://ronandheather.dhs.org:81 |
> | |
> | "I have created a government of whirled peas..." |
> | Maharishi Mahesh Yogi, 12-May-2002, |
> ! CNN, Larry King Live |
> +---------------------------------------------------------+
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org

In response to

Re: Better way to bulk-load millions of CSV records into at 2002-05-22 18:51:45 from Ron Johnson

Browse pgsql-novice by date

	From	Date	Subject
Next Message	Brian Johnson	2002-05-22 23:59:54	Moving data to different hard drive /or/ incremental backup
Previous Message	John Taylor	2002-05-22 21:46:19	Re: optimising data load