Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

From: Joshua Tolley <eggyknap(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bryce Cutt <pandasuit(at)gmail(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Date: 2008-12-23 14:51:51
Message-ID: 20081223145146.GA5882@uber
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 23, 2008 at 09:22:27AM -0500, Robert Haas wrote:
> On Tue, Dec 23, 2008 at 2:21 AM, Bryce Cutt <pandasuit(at)gmail(dot)com> wrote:
> > Because there is no nice way in PostgreSQL (that I know of) to derive
> > a histogram after a join (on an intermediate result) currently
> > usingMostCommonValues is only enabled on a join when the outer (probe)
> > side is a table scan (seq scan only actually). See
> > getMostCommonValues (soon to be called
> > ExecHashJoinGetMostCommonValues) for the logic that determines this.

So my test case of "do a whole bunch of hash joins in a test query"
isn't really valid. Makes sense. I did another, more haphazard test on a
query with fewer joins, and saw noticeable speedups.

> It's starting to seem to me that the case where this patch provides a
> benefit is so narrow that I'm not sure it's worth the extra code.

Not that anyone asked, but I don't consider myself qualified to render
judgement on that point. Code size is, I guess, a maintainability issue,
and I'm not terribly experienced maintaining PostgreSQL :)

> Is it realistic to think that the MCVs of the base relation might
> still be applicable to the joinrel? It's certainly easy to think of
> counterexamples, but it might be a good approximation more often than
> not.

It's equivalent to our assumption that distributions of values in
columns in the same table are independent. Making that assumption in
this case would probably result in occasional dramatic speed
improvements similar to the ones we've seen in less complex joins,
offset by just-as-occasional dramatic slowdowns of similar magnitude. In
other words, it will increase the variance of our results.

- Josh

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Emmanuel Cecchet 2008-12-23 14:59:30 Re: incoherent view of serializable transactions
Previous Message Kevin Grittner 2008-12-23 14:51:03 Re: incoherent view of serializable transactions