Re: COPY-able csv log outputs

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: FAST PostgreSQL <fastpgs(at)fast(dot)fujitsu(dot)com(dot)au>, pgsql-patches(at)postgresql(dot)org
Subject: Re: COPY-able csv log outputs
Date: 2007-05-21 03:22:53
Message-ID: 4651108D.9070806@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

Greg Smith wrote:
> I got a chance to review this patch over the weekend. Basic API seems
> good, met all my requirements, no surprises with how the GUC variable
> controlled the feature.
>
> The most fundamental issue I have with the interface is that using
> COPY makes it difficult to put any unique index on the resulting
> table. I like to have a unique index on my imported log table because
> it rejects the dupe records if you accidentally import the same
> section of log file twice. COPY tosses the whole thing if there's an
> index violation, which is a problem during a regular import because
> you will occasionally come across lines with the same timestamp that
> are similar in every way except for their statment; putting an index
> on the timestamp+statement seems impractical.

Does the format not include the per-process line number? (I know i
briefly looked at this patch previously, but I forget the details.) One
reason I originally included line numbers in log_line-prefix was to
handle this sort of problem.

>
> I've had a preference for INSERT from the beginning here that this
> reinforces.

COPY is our standard bulk insert mechanism. I think arguing against it
would be a very hard sell.

> I'm planning to just work around this issue by doing the COPY into a
> temporary table and then INSERTing from there. I didn't want to just
> let the concern pass by without mentioning it though. It crosses my
> mind that inserting some sort of unique log file line ID number would
> prevent the dupe issue and make for better ordering (it's possible to
> have two lines with the same timestamp show up in the wrong order
> now), not sure that's a practical idea to consider.

I guess that answers my question. We should definitely provide a unique
line key.
>
> The basic coding of the patch seemed OK to me, but someone who is much
> more familiar than myself with the mechanics of pipes should take a
> look at that part of the patch before committing; it's complicated
> code and I can't comment on it. There are some small formatting
> issues that need to be fixed, particularly in the host+port mapping.
> I can fix those myself and submit a slightly updated patch. There's
> some documentation improvements I want to make before this goes in as
> well.
>
> The patch is actually broken fairly hard right now because of the
> switch from INSERT to COPY FROM CSV as the output format at the last
> minute. It outputs missing fields as NULL (fine for INSERT) that
> chokes the CSV import when the session_start timestamp is missing.
> All of those NULL values need to be just replaced with nothing for
> proper CSV syntax; there should just the comma for the next field. I
> worked around this with
>
> copy pglog from '/opt/pgsql/testlog.csv' with CSV null as 'NULL';
>
>

I missed that before - yes it should be fixed.

cheers

andrew

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2007-05-21 03:47:50 Re: COPY-able csv log outputs
Previous Message Greg Smith 2007-05-21 01:24:09 Re: COPY-able csv log outputs