Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

From: "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Joshua Tolley" <eggyknap(at)gmail(dot)com>, "Bryce Cutt" <pandasuit(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date: 2008-12-28 22:28:20
Message-ID: 6EEA43D22289484890D119821101B1DF2C182C@exchange20.mercury.ad.ubc.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I thought about this, but upon due reflection I think it's the wrong
> approach. Raising work_mem is a pretty common tuning step - it's 4MB
> even on my small OLTP systems, and in a data-warehousing environment
> where this optimization will bring the most benefit, it could easily
> be higher. Furthermore, if someone DOES change the statistics target
> for that column to 10,000, there's a pretty good chance that they had
> a reason for doing so (or at the very least it's not for us to assume
> that they were doing something stupid). I think we need some kind of
> code to try to tune this based on the actual situation.
>
> We might try to size the in-memory hash table to be the largest value
> that won't increase the total number of batches, but if the number of
> batches is large then this won't be the right decision. Maybe we
> should insist on setting aside some minimum percentage of work_mem for
> the in-memory hash table, and fill it with however many MCVs we think
> will fit.

I think that setting aside a minimum percentage of work_mem may be a
reasonable approach. For instance, setting aside 1% at even 1 MB
work_mem would be 10 KB which is enough to store about 40 MCV tuples of
the TPC-H database. Such a small percentage would be very unlikely (but
still possible) to change the number of batches used. Then, given the
memory allocation and the known tuple size + overhead, only that number
of MCVs are selected for the MCV table regardless how many there are.
The MCV table size would then increase as work_mem is changed up to a
maximum given by the number of MCVs.

> I agree. However, there's no reason at all to assume that the tuples
> we flush out of the table are any better or worse than the new ones we
> add back in later. In fact, although it's far from a guarantee, if
> the order of the tuples in the table is random, then we're more likely
> to encounter the most common values first. We might as well just keep
> the ones we had rather than dumping them out and adding in different
> ones. Err, except, maybe we can't guarantee correctness that way, in
> the case of a many-to-many join?

The code when building the MCV hash table keeps track of the order of
insertion of the best MCVs. It then flushes the MCV partitions in
decreasing order of frequency of MCVs. Thus, by the end of the build
partitioning phase the MCV hash table should only store the most
frequent MCV tuples. Even with many-to-many joins as long as we keep
all build tuples that have a given MCV in memory, then everything is
fine. You would get into problems if you only flushed some of the
tuples of a certain MCV but that will not happen.

--
Ramon Lawrence

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-12-28 23:19:48 Re: [patch] Reformat permissions in \l+ (like \z does)
Previous Message Tom Lane 2008-12-28 22:00:58 Re: TODO items for window functions