Re: NOLOGGING option, or ?

From: Steve Atkins <steve(at)blighty(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: NOLOGGING option, or ?
Date: 2005-06-02 15:05:50
Message-ID: 20050602150550.GA17081@gp.word-to-the-wise.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 02, 2005 at 07:33:13AM -0700, Luke Lonergan wrote:
> Oliver,
>
> > Haven't you just replaced one preprocessing step with another, then?
>
> Generally not. The most common problem with the current choice of escape
> character is that there are *lots* of data load scenarios with backslash in
> the text strings.

I can only think of one where it's common. Windows filenames. But if
you're going to support arbitrary data in a load then whatever escape
character you choose will appear sometimes.

> The extra preprocessing to escape them is unnecessary on
> other databases and, in effect, causes the load to be even slower because
> you have to prepare the data ahead of time.

> Also, note that this patch can also do escape processing and the net result
> will still be 5+ times faster than what is there.

I strongly suspect that a patch to improve performance without changing
behaviour would be accepted with no questions asked.

One that allowed specifying the field and record delimiters and the
escape character and null symbol might require more discussion about
an appropriate syntax at the very least. So you may want to separate
the two.

> In the data warehousing industry, data conversion and manipulation is
> normally kept distinct from data loading. Conversion is done by tools
> called ETL (Extract Transform Load) and the database will have a very fast
> path for direct loading of the resulting data. PostgreSQL is definitely a
> strange database right now in that there is a default filter applied to the
> data on load.

> It's even more strange because the load path is so slow, and now that we've
> found that the slowness is there mostly because of non-optimized parsing and
> attribute conversion routines. The question of how to do escape processing
> is a separate one, but is wrapped up in the question of whether to introduce
> a new loading routine or whether to optimize the old one.

There are already two loader routines. One of them is text-based and is
designed for easy generation of data load format using simple text
manipulation tools by using delimiters. It also allows (unlike your
suggestion) for loading of arbitrary data from a text file.

Because it allows for arbitrary data and uses delimiters to separate
fields it has to use an escaping mechanism.

If you want to be able to load arbitrary data and not have to handle
escape characters there's are two obvious ways to do it.

The first is that used by MIME and suggested by you. That is to use a
separator that you believe will not appear in the data. That can be
done by using a long multicharacter separator containing random
characters and assuming that sequence won't appear, it can be done by
parsing the input data twice, looking for strings that don't appear
for use as delimiters or it can take advantage of knowledge about
what characters can and can not appear in the input data. (I can't
imagine any case involving data-mining of web logs where the last
is likely to be relevant).

The other is to use length+data format for each tuple, avoiding all
issues of escapes in the data and allowing arbitrary data to be
represented. That's how the binary load format PG supports works, I
believe. If you're really concerned about speed of load that may be a
better format for your front-end to generate, perhaps?

Cheers,
Steve

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vishal Kashyap @ [SaiHertz] 2005-06-02 15:19:18 Re: Google's Summer of Code ...
Previous Message Jonah H. Harris 2005-06-02 14:54:57 Re: Google's Summer of Code ...