Quick Links

Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

From:	"Bryce Cutt" <pandasuit(at)gmail(dot)com>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, "Joshua Tolley" <eggyknap(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2008-12-30 05:29:06
Message-ID:	1924d1180812292129v2d973b05rf319f3dc4049a5be@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Here is the next patch version.

The naming and style concerns have been addressed. The patch now only
touches 5 files. 4 of those files are hashjoin specific and 1 is to
add a couple lines to a hashjoin specific struct in another file.

The code can now find the the MCVs in more cases. Even if the probe
side is an operator other than a seq scan (such as another hashjoin)
the code can now find the stats tuple for the underlying relation.

The new idea of limiting the number of MCVs to a percentage of memory
has not been added yet.

- Bryce Cutt

On Mon, Dec 29, 2008 at 8:55 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I think that setting aside a minimum percentage of work_mem may be a
>> reasonable approach. For instance, setting aside 1% at even 1 MB
>> work_mem would be 10 KB which is enough to store about 40 MCV tuples of
>> the TPC-H database. Such a small percentage would be very unlikely (but
>> still possible) to change the number of batches used. Then, given the
>> memory allocation and the known tuple size + overhead, only that number
>> of MCVs are selected for the MCV table regardless how many there are.
>> The MCV table size would then increase as work_mem is changed up to a
>> maximum given by the number of MCVs.
>
> Sounds fine. Maybe 2-3% would be better.
>
>> The code when building the MCV hash table keeps track of the order of
>> insertion of the best MCVs. It then flushes the MCV partitions in
>> decreasing order of frequency of MCVs. Thus, by the end of the build
>> partitioning phase the MCV hash table should only store the most
>> frequent MCV tuples. Even with many-to-many joins as long as we keep
>> all build tuples that have a given MCV in memory, then everything is
>> fine. You would get into problems if you only flushed some of the
>> tuples of a certain MCV but that will not happen.
>
> OK, I'll read it again - I must not have understood.
>
> It would be good to post an updated patch soon, even if not everything
> has been addressed.
>
> ...Robert
>

Attachment	Content-Type	Size
histojoin_v4.patch	application/octet-stream	21.8 KB

In response to

Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets at 2008-12-30 04:55:02 from Robert Haas

Responses

Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets at 2009-01-02 04:03:06 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nikhil Sontakke	2008-12-30 07:04:02	Re: plpgsql: numeric assignment to an integer variable errors out
Previous Message	Robert Haas	2008-12-30 04:55:02	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets