Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Serge Rielau <serge(at)rielau(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Prabhat Sahu <prabhat(dot)sahu(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Oleg Golovanov <rentech(at)mail(dot)ru>
Subject: Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)
Date: 2017-11-21 15:29:37
Message-ID: CA+TgmobA8JzmBQLok=xG6grM8XBHE5QeFXY-GXZocg6CB7rYbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 17, 2017 at 9:22 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I think that it's reasonable for us to make it a goal of the executor
> to have operations that have a smooth cost function, in order to
> manage the risk of misestimation well, and to make it a goal to have
> operations that are otherwise adaptive to misestimation.

Hash joins are a place where we could have a smoother cost function
than we do. When we run out of memory, instead of switching from
(say) a single batch to two batches, switch to 64 batches, but
initially keep 63 of them in memory and only write the very last one
to disk. Every time we again run out of memory, dump another batch to
disk. If we end up dumping more than half or so of the batches to
disk, switch to an even larger number of batches to make it even more
fine-grained. The current system is bad because you jump from
spooling NO tuples to a tuplestore to spooling HALF of the inner AND
outer tuples to a tuplestore. If the hash table is just a little too
big to fit, we could write 1/64 or 2/64 or 3/64 of the inner and outer
tuples to a tuplestore instead of HALF of them, which would be a huge
win.

That having been said, I think the place where our plans most commonly
go wrong is where we incorrectly estimate the number of tuples by
multiple orders of magnitude - 100x is common, 1000x is common, a
million x is not uncommon, even a billion x is not unheard-of. And I
don't think there's any way to make a hash join happy if it thinks
it's going to need 1 batch and it ends up needing a million batches.
At that, even if the cost function is very smooth, you've moved so far
along the curve that you're probably not in a good place. So, while I
think that smoothing out the cost functions is a good idea, I think we
also need to consider what more can be done to improve the estimates -
and especially to avoid estimates that are off by huge multiples.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chapman Flack 2017-11-21 15:41:09 Does XMLSERIALIZE output xmlattributes in a stable order?
Previous Message Huong Dangminh 2017-11-21 15:25:08 RE: Re: User defined data types in Logical Replication