Re: DBT-3 with SF=20 got failed

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: DBT-3 with SF=20 got failed
Date: 2015-09-09 15:54:03
Message-ID: 55F0561B.3000503@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/09/2015 03:55 PM, Robert Haas wrote:
> On Tue, Sep 8, 2015 at 5:02 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Also, I'm not sure what other places do you have in mind (could you list
>> some examples?) but I'd bet we limit the allocation to 1GB because of the
>> palloc() limit and not because of fear of over-estimates.
>
> I don't really think those two things are different from each other.
> The palloc() limit is a means of enforcing a general policy of
> limiting all allocations to 1GB except in places where we've made a
> very conscious decision to allow a specific exception. This limit
> happens to dovetail nicely with the varlena size limit, so in many
> cases it is the exactly correct limit just for that reason. But even
> when, as here, that's not at issue, it's still a useful limit, because
> there are many ways that some garbage value can get passed to palloc
> -- bad planner estimates, corrupted tuples, bugs in other parts of our
> code. And at least on my old MacBook Pro (I haven't tested the
> current one), passing a sufficiently-large value to malloc() causes a
> kernel panic. That's probably a particularly bad bug, but there are
> lots of systems where "accidentally" allocating an unreasonable amount
> of space will have all kinds of unpleasant consequences. So, I
> believe that palloc()'s limit improves the overall stability of the
> system considerably even if it causes some occasional annoyance.

I'm not really buying this. The 1GB has nothing to do with platform
limits, it's there exactly to make it varlena-like (which has exactly
the same limit), and because it allows using 32-bit int to track all the
bytes. Neither of these is relevant here.

It has nothing to do with malloc() limits on various platforms, and if
there really are such limits that we think we should worry about, we
should probably address those properly. Not by and-aiding all the
various places independently.

And most importantly, these platform limits would apply to both the
initial allocation and to the subsequent resize. It's utterly useless to
just "fix" the initial allocation and then allow failure when we try to
resize the hash table.

> Most of the time, you can just palloc() and not worry too much about
> whether you're going to blow up the machine: you won't, because you
> aren't going to allocate more than 1GB. Any place that wants to
> allocate more than that needs to be someplace where we can be pretty
> sure that we're not going to accidentally allocate some completely
> unreasonable amount of memory, like say 1TB. Nothing in this
> discussion convinces me that this is such a place. Note that

We're not going to allocate a completely unreasonable amount of memory,
because there already are some checks in place.

Firstly, you can't really get buckets larger than ~25% of work_mem,
because we the pointer has only 8B, while the HJ tuple has 16B plus the
data (IIRC). For wider tuples the size of buckets further decreases.

Secondly, we limit the number of buckets to INT_MAX, so about 16GB
(because buckets are just pointers). No matter how awful estimate you
get (or how insanely high you set work_mem) you can't exceed this.

> tuplesort.c and tuplestore.c, the only existing callers of
> repalloc_huge, only allocate such large amounts of memory when they
> actually have enough tuples to justify it - it is always based on the
> actual number of tuples, never an estimate. I think that would be a
> sound principle here, too. Resizing the hash table to such a large
> size based on the actual load factor is very reasonable; starting with
> such a large size seems less so. Admittedly, 512MB is an arbitrary
> point: and if it so happened that the limit was 256MB or 1GB or 128MB
> or even 2GB I wouldn't advocate for changing it just for fun. But
> you're saying we should just remove that limit altogether, and I think
> that's clearly unreasonable. Do you really want to start out with a
> TB or even PB-sized hash table when the actual number of tuples is,
> say, one? That may sound crazy, but I've seen enough bad query plans
> to know that, yes, we are sometimes off by nine orders of magnitude.
> This is not a hypothetical problem.

No, I'm not saying anything like that - I actually explicitly stated
that I'm not against such change (further restricting the initial hash
table size), if someone takes the time to do a bit of testing and
provide some numbers.

Moreover as I explained there already are limits in place (25% of
work_mem or 16GB, whichever is lower), so I don't really see the bugfix
as unreasonable.

Maybe if we decide to lift this restriction (using int64 to address the
buckets, which removes the 16GB limit) this issue will get much more
pressing. But I guess hash tables handling 2B buckets will be enough for
the near future.

>>> More importantly, removing the cap on the allocation size makes the
>>> problem a lot worse. You might be sad if a bad planner estimate
>>> causes the planner to allocate 1GB when 64MB would have been enough,
>>> but on modern systems it is not likely to be an enormous problem. If
>>> a similar mis-estimation causes the planner to allocate 16GB rather
>>> than 1GB, the opportunity for you to be sad is magnified pretty
>>> considerably. Therefore, I don't really see the over-estimation bug
>>> fix as being separate from this one.
>>
>> Perhaps. But if you want to absolutely prevent such sadness then maybe you
>> should not set work_mem that high?
>
> I think that's a red herring for a number of reasons. One, the
> allocation for the hash buckets is only a small portion of the total
> memory. Two, the fact that you are OK with the hash table growing to
> a certain size does not mean that you want it to start out that big on
> the strength of a frequently-flawed planner estimate.

True, although I don't think the herring is entirely red. It might just
as well be a mackerel ;-)

>> Anyway, I do see this as a rather orthogonal problem - an independent
>> improvement, mostly unrelated to the bugfix. Even if we decide to redesign
>> it like this (and I'm not particularly opposed to that, assuming someone
>> takes the time to measure how expensive the additional resize actually is),
>> we'll still have to fix the repalloc().
>>
>> So I still fail to see why we shouldn't apply this fix.
>
> In all seriousness, that is fine. I respect your opinion; I'm just
> telling you mine, which happens to be different.

Likewise.

Let me repeat my previous proposal:

1) Let's apply the proposed bugfix (and also backpatch it), because
the current code *is* broken.

2) Do a bunch of experiments with limiting the initial hash size,
decide whether the impact on well-estimated cases is acceptable.

I'm strongly opposed to just limiting the initial size without actually
measuring how expensive the resize is, as that simply adds cost to the
well-estimated cases (which may easily be the vast majority of all the
plans, as we tend to notice just the poorly estimated ones).

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robbie Harwood 2015-09-09 16:44:49 Re: [PATCH v2] GSSAPI encryption support
Previous Message Andrew Dunstan 2015-09-09 15:53:58 Re: jsonb_concat: make sure we always return a non-scalar value