Re: allocation limit for encoding conversion

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: allocation limit for encoding conversion
Date: 2019-09-24 20:19:41
Message-ID: 974.1569356381@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2019-08-16 17:31:49 -0400, Tom Lane wrote:
>> I fear that allowing pg_do_encoding_conversion to return strings longer
>> than 1GB is just going to create failure cases somewhere else.
>>
>> However, it's certainly true that 4x growth is a pretty unlikely worst
>> case. Maybe we could do something like
>> 1. If string is short (say up to a few megabytes), continue to do it
>> like now. This avoids adding overhead for typical cases.
>> 2. Otherwise, run some lobotomized form of encoding conversion that
>> just computes the space required (as an int64, I guess) without saving
>> the result anywhere.
>> 3. If space required > 1GB, fail.
>> 4. Otherwise, allocate just the space required, and convert.

> It's probably too big a hammer for this specific case, but I think at
> some point we ought to stop using fixed size allocations for this kind
> of work. Instead we should use something roughly like our StringInfo,
> except that when exceeding the current size limit, the overflowing data
> is stored in a separate allocation. And only once we actually need the
> data in a consecutive form, we allocate memory that's large enough to
> store the all the separate allocations in their entirety.

That sounds pretty messy :-(.

I spent some time looking at what I proposed above, and concluded that
it's probably impractical. In the first place, we'd have to change
the API spec for encoding conversion functions. Now maybe that would
not be a huge deal, because there likely aren't very many people outside
the core code who are defining their own conversion functions, but it's
still a negative. More importantly, unless we wanted to duplicate
large swaths of code, we'd end up inserting changes about like this
into the inner loops of encoding conversions:

- *dest++ = code;
+ if (dest)
+ *dest++ = code;
+ outcount++;

which seems like it'd be bad for performance.

So I now think that Alvaro's got basically the right idea, except
that I'm still afraid to allow strings larger than MaxAllocSize
to run around loose in the backend. So in addition to the change
he suggested, we need a final check on strlen(result) not being
too large. We can avoid doing a useless strlen() if the input len
is small, though.

It then occurred to me that we could also repalloc the output buffer
down to just the required size, which is pointless if it's small
but not if we can give back several hundred MB. This is conveniently
mergeable with the check to see whether we need to check strlen or not.

... or at least, that's what I thought we could do. Testing showed
me that AllocSetRealloc never actually gives back any space, even
when it's just acting as a frontend for a direct malloc. However,
we can fix that for little more than the price of swapping the order
of the is-it-a-decrease and is-it-a-large-chunk stanzas, as in the
0002 patch below.

I also put back the missing overflow check --- although that's unreachable
in a 64-bit machine, it's not at all in 32-bit. The patch is still
useful in 32-bit though, since it still doubles the size of string
we can cope with.

I think this is committable, though surely another pair of eyeballs
on it wouldn't hurt. Also, is it worth having a different error
message for the case where the output does exceed MaxAllocSize?

regards, tom lane

Attachment Content-Type Size
v2-0001-cope-with-large-encoding-conversions.patch text/x-diff 3.7 KB
v2-0002-actually-recover-space-in-repalloc.patch text/x-diff 5.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-09-24 20:45:10 Re: abort-time portal cleanup
Previous Message Alvaro Herrera 2019-09-24 20:12:14 Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)