Re: DBT-3 with SF=20 got failed

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: DBT-3 with SF=20 got failed
Date: 2015-06-11 23:29:21
Message-ID: 557A19D1.9050107@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 06/11/15 16:28, Robert Haas wrote:
> On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> wrote:
>> The attached patch replaces this palloc0() by MemoryContextAllocHuge() + memset().
>> Indeed, this hash table is constructed towards the relation with nrows=119994544,
>> so, it is not strange even if hash-slot itself is larger than 1GB.
>
> You forgot to attach the patch, I think. It looks to me like the size
> of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
> That's a lot of buckets, but maybe not unreasonably many if you've got
> enough memory.

Actually, HashJoinTuple is just a pointer, so it's 8 bytes, so 1GB is
enough for 134217728 million rows, which is more than the 119994544 rows
from the plan.

Also, looking at the error message again:

ERROR: invalid memory alloc request size 1073741824

but this time with beer goggles, I noticed that the amount reported is
exactly 1GB. The backtrace also shows the error happens right inside
ExecHashTableCreate (and not in the resize which may happen later),
which means it gets the nbuckets from ExecChooseHashTableSize directly.

The resize is probably still broken as I mentioned before, but this
crash before reaching that code as the estimates are high enough to
trigger the issue. But ExecChooseHashTableSize is supposed to keep all
the checks from previous versions, and I think it actually does.

But I don't see there any checks regarding the 1GB boundary. What I see
is this:

max_pointers = (work_mem * 1024L) / sizeof(void *);
max_pointers = Min(max_pointers, INT_MAX / 2);

...

dbuckets = Min(dbuckets, max_pointers);

That has nothing to do with 1GB, and it's in the code since the time
work_mem was limited by 2GB, so perhaps there was some reasoning that
it's sufficient (because the tuples stored in the hash table will need
more than 1/2 of the memory, or something like that).

But today this issue is more likely, because people have more RAM and
use higher work_mem values, so the max_pointers value gets much higher.
In the extreme it may get to INT_MAX/2, so ~1 billion, so the buckets
would allocate ~8B on 64-bit machines (on 32-bit machines we'd also get
twice the number of pointers, compared to 64 bits, but that's mostly
irrelevant, because of the memory size limits).

It's also true, that the hash-join improvements in 9.5 - namely the
decrease of NTUP_PER_BUCKET from 10 to 1, made this error more likely.
With 9.4 we'd use only 16777216 buckets (128MB), because that gets us
below 10 tuples per bucket. But now we're shooting for 1 tuple per
bucket, so we end up with 131M buckets, and that's 1GB.

I see two ways to fix this:

(1) enforce the 1GB limit (probably better for back-patching, if that's
necessary)

(2) make it work with hash tables over 1GB

I'm in favor of (2) if there's a good way to do that. It seems a bit
stupid not to be able to use fast hash table because there's some
artificial limit. Are there any fundamental reasons not to use the
MemoryContextAllocHuge fix, proposed by KaiGai-san?

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Qingqing Zhou 2015-06-11 23:58:00 Re: On columnar storage
Previous Message Alvaro Herrera 2015-06-11 23:03:16 On columnar storage