Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
Cc: "Joshua Tolley" <eggyknap(at)gmail(dot)com>, "Bryce Cutt" <pandasuit(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date: 2008-12-28 14:47:47
Message-ID: 603c8f070812280647i6256130bre142f975659bc762@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I totally agree that 10,000 MCVs changes things. Ideally, these 10,000
> MCVs should be kept in memory because they will join with the most
> tuples. However, the size of the MCV hash table (as you point out) can
> be bigger than work_mem *by itself* not even considering the tuples in
> the table or in the in-memory batch.
>
> So, basically, we have a decision to make whether to try support a
> larger number of MCVs or cap it at a reasonable number like a 100. You
> can come up with situations where using all 10,000 MCVs is good (for
> instance if all MCVs have frequency 1/10000), but I expect 100 MCVs will
> capture the majority of the cases as usually the top 100 MCVs are
> significantly more frequent than later MCVs.

I thought about this, but upon due reflection I think it's the wrong
approach. Raising work_mem is a pretty common tuning step - it's 4MB
even on my small OLTP systems, and in a data-warehousing environment
where this optimization will bring the most benefit, it could easily
be higher. Furthermore, if someone DOES change the statistics target
for that column to 10,000, there's a pretty good chance that they had
a reason for doing so (or at the very least it's not for us to assume
that they were doing something stupid). I think we need some kind of
code to try to tune this based on the actual situation.

We might try to size the in-memory hash table to be the largest value
that won't increase the total number of batches, but if the number of
batches is large then this won't be the right decision. Maybe we
should insist on setting aside some minimum percentage of work_mem for
the in-memory hash table, and fill it with however many MCVs we think
will fit.

> The issue with building the MCV table is that the hash operator will not
> be receiving tuples in MCV frequency order. It is possible that the MCV
> table is filled up with tuples of less frequent MCVs when a more
> frequent MCV tuple arrives. In that case, we would like to keep the
> more frequent MCV and bump one of the less frequent MCVs.

I agree. However, there's no reason at all to assume that the tuples
we flush out of the table are any better or worse than the new ones we
add back in later. In fact, although it's far from a guarantee, if
the order of the tuples in the table is random, then we're more likely
to encounter the most common values first. We might as well just keep
the ones we had rather than dumping them out and adding in different
ones. Err, except, maybe we can't guarantee correctness that way, in
the case of a many-to-many join?

I don't think there's any way to get around the possibility of a
hash-table overflow completely. Besides many-to-many joins, there's
also the possibility of hash collisions. The code assumes that
anything that hashes to the same 32-bit value as an MCV is in fact an
MCV, which is obviously false, but doesn't seem worth worrying about
since the chances of a collision are very small and the equality test
might be expensive. But clearly we want to minimize overflows as much
as we can.

...Robert

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2008-12-28 14:49:27 Re: WIP: Automatic view update rules
Previous Message Bernd Helmle 2008-12-28 14:29:58 Re: WIP: Automatic view update rules