Skip site navigation (1) Skip section navigation (2)

Re: COPY-able csv log outputs

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: FAST PostgreSQL <fastpgs(at)fast(dot)fujitsu(dot)com(dot)au>, pgsql-patches(at)postgresql(dot)org
Subject: Re: COPY-able csv log outputs
Date: 2007-05-21 03:22:53
Message-ID: 4651108D.9070806@dunslane.net (view raw or flat)
Thread:
Lists: pgsql-patches

Greg Smith wrote:
> I got a chance to review this patch over the weekend.  Basic API seems 
> good, met all my requirements, no surprises with how the GUC variable 
> controlled the feature.
>
> The most fundamental issue I have with the interface is that using 
> COPY makes it difficult to put any unique index on the resulting 
> table.  I like to have a unique index on my imported log table because 
> it rejects the dupe records if you accidentally import the same 
> section of log file twice.  COPY tosses the whole thing if there's an 
> index violation, which is a problem during a regular import because 
> you will occasionally come across lines with the same timestamp that 
> are similar in every way except for their statment; putting an index 
> on the timestamp+statement seems impractical.

Does the format not include the per-process line number? (I know i 
briefly looked at this patch previously, but I forget the details.) One 
reason I originally included line numbers in log_line-prefix was to 
handle this sort of problem.

>
> I've had a preference for INSERT from the beginning here that this 
> reinforces.

COPY is our standard bulk insert mechanism. I think arguing against it 
would be a very hard sell.

> I'm planning to just work around this issue by doing the COPY into a 
> temporary table and then INSERTing from there.  I didn't want to just 
> let the concern pass by without mentioning it though.  It crosses my 
> mind that inserting some sort of unique log file line ID number would 
> prevent the dupe issue and make for better ordering (it's possible to 
> have two lines with the same timestamp show up in the wrong order 
> now), not sure that's a practical idea to consider.

I guess that answers my question. We should definitely provide a unique 
line key.
>
> The basic coding of the patch seemed OK to me, but someone who is much 
> more familiar than myself with the mechanics of pipes should take a 
> look at that part of the patch before committing; it's complicated 
> code and I can't comment on it.  There are some small formatting 
> issues that need to be fixed, particularly in the host+port mapping.  
> I can fix those myself and submit a slightly updated patch.  There's 
> some documentation improvements I want to make before this goes in as 
> well.
>
> The patch is actually broken fairly hard right now because of the 
> switch from INSERT to COPY FROM CSV as the output format at the last 
> minute.  It outputs missing fields as NULL (fine for INSERT) that 
> chokes the CSV import when the session_start timestamp is missing.  
> All of those NULL values need to be just replaced with nothing for 
> proper CSV syntax; there should just the comma for the next field.  I 
> worked around this with
>
> copy pglog from '/opt/pgsql/testlog.csv' with CSV null as 'NULL';
>
>

I missed that before - yes it should be fixed.

cheers

andrew

In response to

Responses

pgsql-patches by date

Next:From: Tom LaneDate: 2007-05-21 03:47:50
Subject: Re: COPY-able csv log outputs
Previous:From: Greg SmithDate: 2007-05-21 01:24:09
Subject: Re: COPY-able csv log outputs

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group