Re: Performance degradation in TPC-H Q18

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Performance degradation in TPC-H Q18
Date: 2017-03-06 20:32:00
Message-ID: 20170306203200.kczd7xldxirsbgwl@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-03-04 11:09:40 +0530, Robert Haas wrote:
> On Sat, Mar 4, 2017 at 5:56 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > attached is a patch to address this problem, and the one reported by
> > Dilip. I ran a lot of TPC-H and other benchmarks, and so far this
> > addresses all the performance issues, often being noticeably faster than
> > with the dynahash code.
> >
> > Comments?
>
> I'm still not convinced that raising the fillfactor like this is going
> to hold up in testing, but I don't mind you committing it and we'll
> see what happens.

I didn't see anything in testing, but I agree that it's debatable. But
I'd rather commit it now, when we all know it's new code. Raising it in
a new release will be a lot harder.

> I think DEBUG1 is far too high for something that could occur with
> some frequency on a busy system; I'm fairly strongly of the opinion
> that you ought to downgrade that by a couple of levels, say to DEBUG3
> or so.

I actually planned to remove it entirely, before committing. It was more
left in for testers to be able to see when the code triggers.

> > On 2017-03-03 11:23:00 +0530, Kuntal Ghosh wrote:
> >> On Fri, Mar 3, 2017 at 8:41 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >> > On Fri, Mar 3, 2017 at 1:22 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >> >> the resulting hash-values aren't actually meaningfully influenced by the
> >> >> IV. Because we just xor with the IV, most hash-value that without the IV
> >> >> would have fallen into a single hash-bucket, fall into a single
> >> >> hash-bucket afterwards as well; just somewhere else in the hash-range.
> >> >
> >> > Wow, OK. I had kind of assumed (without looking) that setting the
> >> > hash IV did something a little more useful than that. Maybe we should
> >> > do something like struct blah { int iv; int hv; }; newhv =
> >> > hash_any(&blah, sizeof(blah)).
> >
> > The hash invocations are already noticeable performancewise, so I'm a
> > bit hesitant to go there. I'd rather introduce a decent 'hash_combine'
> > function or such.
>
> Yes, that might be better. I wasn't too sure the approach I proposed
> would actually do a sufficiently-good job mixing it the bits from the
> IV anyway. It's important to keep in mind that the values we're using
> as IVs aren't necessarily going to be uniformly distributed in any
> meaningful way. They're just PIDs, so you might only have 1-3 bits of
> difference between one value and another within the same parallel
> query. If you don't do something fairly aggressive to make that
> change perturb the final hash value, it probably won't.

FWIW, I played with some better mixing, and it helps a bit with
accurately sized hashtables and multiple columns.

What's however more interesting is that a better mixed IV and/or better
iteration now *slightly* *hurts* performance with grossly misestimated
sizes, because resizing has to copy more rows... Not what I predicted.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-03-06 21:00:59 Re: dump a comment of a TSDictionary
Previous Message Peter Eisentraut 2017-03-06 20:26:11 Re: Automatic cleanup of oldest WAL segments with pg_receivexlog