Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
Cc: "Joshua Tolley" <eggyknap(at)gmail(dot)com>, "Bryce Cutt" <pandasuit(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date: 2008-12-25 15:47:57
Message-ID: 603c8f070812250747t3193a27fj8c50ff4c34cb7daa@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> There is almost zero penalty for selecting incorrect MCV tuples to
> buffer in memory. Since the number of MCVs is approximately 100, the
> "overhead" is keeping these 100 tuples in memory where they *might* not
> be MCVs. The cost is the little extra memory and the checking of the
> MCVs which is very fast.

I looked at this some more. I'm a little concerned about the way
we're maintaining the in-memory hash table. Since the highest legal
statistics target is now 10,000, it's possible that we could have two
orders of magnitude more MCVs than what you're expecting. As I read
the code, that could lead to construction of an in-memory hash table
with 64K slots. On a 32-bit machine, I believe that works out to 16
bytes per partition (12 and 4), which is a 1MB hash table. That's not
necessarily problematic, except that I don't think you're considering
the size of the hash table itself when evaluating whether you are
blowing out work_mem, and the default size of work_mem is 1MB.

I also don't really understand why we're trying to control the size of
the hash table by flushing tuples after the fact. Right now, when the
in-memory table fills up, we just keep adding tuples to it, which in
turn forces us to flush out other tuples to keep the size down. This
seems quite inefficient - not only are we doing a lot of unnecessary
allocating and freeing, but those flushed slots in the hash table
degrade performance (because they don't stop the scan for an empty
slot). It seems like we could simplify things considerably by adding
tuples to the in-memory hash table only to the point where the next
tuple would blow it out. Once we get to that point, we can skip the
isAMostCommonValue() test and send any future tuples straight to temp
files. (This would also reduce the memory consumption of the
in-memory table by a factor of two.)

We could potentially improve on this even further if we can estimate
in advance how many MCVs we can fit into the in-memory hash table
before it gets blown out. If, for example, we have only 1MB of
work_mem but there 10,000 MCVs, getMostCommonValues() might decide to
only hash the first 1,000 MCVs. Even if we still blow out the
in-memory hash table, the earlier MCVs are more frequent than the
later MCVs, so the ones that actually make it into the table are
likely to be more beneficial. I'm not sure exactly how to do this
tuning though, since we'd need to approximate the size of the
tuples... I guess the query planner makes some effort to estimate that
but I'm not sure how to get at it.

> However, the join with Part and LineItem *should* show a benefit but may
> not because of a limitation of the patch implementation (not the idea).
> The MCV optimization is only enabled currently when the probe side is a
> sequential scan. This limitation is due to our current inability to
> determine a stats tuple of the join attribute on the probe side for
> other operators. (This should be possible - help please?).

Not sure how to get at this either, but I'll take a look and see if I
can figure it out.

Merry Christmas,

...Robert

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jaime Casanova 2008-12-25 17:43:21 Re: WIP: Automatic view update rules
Previous Message Hitoshi Harada 2008-12-25 14:59:10 Re: Window-functions patch handling of aggregates