Re: SUBSTRING performance for large BYTEA

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: SUBSTRING performance for large BYTEA
Date: 2007-08-18 20:32:33
Message-ID: 87mywoshwe.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:

> Well this is a guess, but:
>
> Set existing column to storage external
> update existing column with existing data:
>
> UPDATE foo SET bar = bar;

Well, not quite. That would actually reuse the toast pointer without
decompressing it. We try to be clever about not decompressing and duplicating
toast pointers unnecessarily on updates -- in this case too clever.

You could do this:

postgres=# ALTER TABLE foo ALTER bar TYPE bytea, ALTER bar SET STORAGE external;
ALTER TABLE

(Note that you have to include the 'ALTER bar SET STORAGE external' in the
same command or the storage will get reset to the default 'extended' for bytea
even if it was previously set to 'external'.)

When I tested this though I noticed it did *not* decompress compressed data
which was small enough to store internally. This may actually be desirable for
your case since anything small enough to be stored internally is probably not
worth bothering decompressing so it can be streamed out. It will still not be
compressed next time you update it so it's not really helpful for the long
term.

If you want to decompress everything you have to do something like:

postgres=# ALTER TABLE foo ALTER bar TYPE bytea USING t||'', ALTER bar SET STORAGE external;
ALTER TABLE

However note that this will require extra memory for both the decompressed
original value and the new value after "appending" the empty string.

Another option would be to update only the records which need to be
decompressed with something like

UPDATE foo SET bar=bar||'' WHERE length(bar) > pg_column_size(bar)

This at least gives you the option of doing them in small groups or even one
by one. I would suggest vacuuming between each update.

I do have to wonder how you're getting the data *in* though. If it's large
enough to have to stream out like this then how do you initially load the
data?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Karsten Hilbert 2007-08-18 21:17:40 Re: SUBSTRING performance for large BYTEA
Previous Message Steve Manes 2007-08-18 20:21:50 Re: Writing most code in Stored Procedures