From: | "Joshua Tolley" <eggyknap(at)gmail(dot)com> |
---|---|
To: | "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca> |
Cc: | pgsql-hackers(at)postgresql(dot)org, "Bryce Cutt" <pandasuit(at)gmail(dot)com> |
Subject: | Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets |
Date: | 2008-11-01 22:41:48 |
Message-ID: | e7e0a2570811011541x28612963w1f17dcb6d2fe846a@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon <ramon(dot)lawrence(at)ubc(dot)ca> wrote:
> We propose a patch that improves hybrid hash join's performance for large
> multi-batch joins where the probe relation has skew.
>
> Project name: Histojoin
> Patch file: histojoin_v1.patch
>
> This patch implements the Histojoin join algorithm as an optional feature
> added to the standard Hybrid Hash Join (HHJ). A flag is used to enable or
> disable the Histojoin features. When Histojoin is disabled, HHJ acts as
> normal. The Histojoin features allow HHJ to use PostgreSQL's statistics to
> do skew aware partitioning. The basic idea is to keep build relation tuples
> in a small in-memory hash table that have join values that are frequently
> occurring in the probe relation. This improves performance of HHJ when
> multiple batches are used by 10% to 50% for skewed data sets. The
> performance improvements of this patch can be seen in the paper (pages
> 25-30) at:
>
> http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf
>
> All generators and materials needed to verify these results can be provided.
>
> This is a patch against the HEAD of the repository.
>
> This patch does not contain platform specific code. It compiles and has
> been tested on our machines in both Windows (MSVC++) and Linux (GCC).
>
> Currently the Histojoin feature is enabled by default and is used whenever
> HHJ is used and there are Most Common Value (MCV) statistics available on
> the probe side base relation of the join. To disable this feature simply
> set the enable_hashjoin_usestatmcvs flag to off in the database
> configuration file or at run time with the 'set' command.
>
> One potential improvement not included in the patch is that Most Common
> Value (MCV) statistics are only determined when the probe relation is
> produced by a scan operator. There is a benefit to using MCVs even when the
> probe relation is not a base scan, but we were unable to determine how to
> find statistics from a base relation after other operators are performed.
>
> This patch was created by Bryce Cutt as part of his work on his M.Sc.
> thesis.
>
> --
> Dr. Ramon Lawrence
> Assistant Professor, Department of Computer Science, University of British
> Columbia Okanagan
> E-mail: ramon(dot)lawrence(at)ubc(dot)ca
I'm interested in trying to review this patch. Having not done patch
review before, I can't exactly promise grand results, but if you could
provide me with the data to check your results? In the meantime I'll
go read the paper.
- Josh / eggyknap
From | Date | Subject | |
---|---|---|---|
Next Message | Simon Riggs | 2008-11-01 22:53:27 | Re: Well done, Hackers |
Previous Message | Bruce Momjian | 2008-11-01 22:39:45 | Re: Updates of SE-PostgreSQL 8.4devel patches (r1168) |