Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Bryce Cutt" <pandasuit(at)gmail(dot)com>
Cc: "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, "Joshua Tolley" <eggyknap(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date: 2009-01-02 04:03:06
Message-ID: 603c8f070901012003s60a6cfb9l1d91ba6326f85b0b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 30, 2008 at 12:29 AM, Bryce Cutt <pandasuit(at)gmail(dot)com> wrote:
> Here is the next patch version.

Thanks for posting this update. This is definitely getting better,
but I still see some style issues. We can work on fixing those once
the rest of the details have been finalized.

However, one question in this area - isn't
ExecHashFreezeNextMCVPartition actually a most common TUPLE partition,
rather than a most common VALUE partition (and similarly for
ExecHashGetMCVPartition)? I'm not quite sure what to do about this as
the names are already quite long - is there some better name for the
functions and structure members than MostCommonTuplePartition? Maybe
we could call it the in-memory partition and abbreviate it IMPartition
throughout. I think that might make things more clear.

> The code can now find the the MCVs in more cases. Even if the probe
> side is an operator other than a seq scan (such as another hashjoin)
> the code can now find the stats tuple for the underlying relation.

You're using varnoold in a way that directly contradicts the comment
in primnodes.h (essentially, that it's not used for anything other
than debugging). I don't think this is a bad thing, but you have to
patch the comment.

Have you done any performance testing on the impact of this change?

> The new idea of limiting the number of MCVs to a percentage of memory
> has not been added yet.

That's a pretty important change, I think, though it would be nice to
have one of the committers chime in here. For those who may not have
been following the thread closely, the current implementation's memory
usage can go quite a bit higher than work_mem - the in-memory open
hash table can be up to 1MB or so (if statistics_target = 10K) plus it
can contain up to work_mem of tuples plus each batch can contain
another work_mem of tuples. The proposal is to carve out 1-3% of
work_mem for the in-memory hash table and leave the rest for the
batches, thus hopefully not affecting the # of batches very much. If
it doesn't look like the whole MCV list will fit, we'll take a shot at
guessing what length prefix of it will.

...Robert

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2009-01-02 04:11:33 Re: posix_fadvise v22
Previous Message Robert Treat 2009-01-02 04:00:22 Re: Copyright update