Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

From: "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Joshua Tolley" <eggyknap(at)gmail(dot)com>, "Bryce Cutt" <pandasuit(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date: 2008-12-27 20:10:26
Message-ID: 6EEA43D22289484890D119821101B1DF2C182B@exchange20.mercury.ad.ubc.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: Robert Haas [mailto:robertmhaas(at)gmail(dot)com]
> I looked at this some more. I'm a little concerned about the way
> we're maintaining the in-memory hash table. Since the highest legal
> statistics target is now 10,000, it's possible that we could have two
> orders of magnitude more MCVs than what you're expecting. As I read
> the code, that could lead to construction of an in-memory hash table
> with 64K slots. On a 32-bit machine, I believe that works out to 16
> bytes per partition (12 and 4), which is a 1MB hash table. That's not
> necessarily problematic, except that I don't think you're considering
> the size of the hash table itself when evaluating whether you are
> blowing out work_mem, and the default size of work_mem is 1MB.

I totally agree that 10,000 MCVs changes things. Ideally, these 10,000
MCVs should be kept in memory because they will join with the most
tuples. However, the size of the MCV hash table (as you point out) can
be bigger than work_mem *by itself* not even considering the tuples in
the table or in the in-memory batch. Supporting that many MCVs would
require more modifications to the hash join algorithm.

100 MCVs should be able to fit in memory though. Since the number of
batches is rounded to a power of 2, there is often some hash_table_bytes
that are not used by the in-memory batch that can be "used" to store the
MCV table. The absolute size of the memory used should also be
reasonable (depending on the tuple size in bytes).

So, basically, we have a decision to make whether to try support a
larger number of MCVs or cap it at a reasonable number like a 100. You
can come up with situations where using all 10,000 MCVs is good (for
instance if all MCVs have frequency 1/10000), but I expect 100 MCVs will
capture the majority of the cases as usually the top 100 MCVs are
significantly more frequent than later MCVs.

I now also see that the code should be changed to keep track of the MCV
bytes separately from hashtable->spaceUsed as this is used to determine
when to dynamically increase the number of batches.

> I also don't really understand why we're trying to control the size of
> the hash table by flushing tuples after the fact. Right now, when the
> in-memory table fills up, we just keep adding tuples to it, which in
> turn forces us to flush out other tuples to keep the size down. This
> seems quite inefficient - not only are we doing a lot of unnecessary
> allocating and freeing, but those flushed slots in the hash table
> degrade performance (because they don't stop the scan for an empty
> slot). It seems like we could simplify things considerably by adding
> tuples to the in-memory hash table only to the point where the next
> tuple would blow it out. Once we get to that point, we can skip the
> isAMostCommonValue() test and send any future tuples straight to temp
> files. (This would also reduce the memory consumption of the
> in-memory table by a factor of two.)

In the ideal case, we select a number of MCVs to support that we know
will always fit in memory. The flushing is used to deal with the case
where we are doing a many-to-many join and there may be multiple tuples
with the given MCV value in the build relation.

The issue with building the MCV table is that the hash operator will not
be receiving tuples in MCV frequency order. It is possible that the MCV
table is filled up with tuples of less frequent MCVs when a more
frequent MCV tuple arrives. In that case, we would like to keep the
more frequent MCV and bump one of the less frequent MCVs.

> We could potentially improve on this even further if we can estimate
> in advance how many MCVs we can fit into the in-memory hash table
> before it gets blown out. If, for example, we have only 1MB of
> work_mem but there 10,000 MCVs, getMostCommonValues() might decide to
> only hash the first 1,000 MCVs. Even if we still blow out the
> in-memory hash table, the earlier MCVs are more frequent than the
> later MCVs, so the ones that actually make it into the table are
> likely to be more beneficial. I'm not sure exactly how to do this
> tuning though, since we'd need to approximate the size of the
> tuples... I guess the query planner makes some effort to estimate that
> but I'm not sure how to get at it.

The number of batches (nbatch), inner_rel_bytes, and hash_table_bytes
are calculated in ExecChooseHashTableSize in nodeHash.c.

The number of bytes "free" not allocated to the in-memory batch is then:

hash_table_bytes - inner_rel_bytes/nbatch

Depending on the power of 2 rounding of nbatch, this may be almost 0 or
quite large. You could change the calculation of nbatch or try to
resize the in-memory batch, but that opens up a can of worms. It may be
best to assume a small number of MCVs 10 or 100.

>
> > However, the join with Part and LineItem *should* show a benefit but
may
> > not because of a limitation of the patch implementation (not the
idea).
> > The MCV optimization is only enabled currently when the probe side
is a
> > sequential scan. This limitation is due to our current inability to
> > determine a stats tuple of the join attribute on the probe side for
> > other operators. (This should be possible - help please?).
>
> Not sure how to get at this either, but I'll take a look and see if I
> can figure it out.

After more digging, we can extract the original relation id and
attribute id of the join attribute using the instance variables varnoold
and varoattno of Var. It is documented that these variables are just
kept around for debugging, but they are definitely useful here.

New code would be:
relid = getrelid(variable->varnoold, estate->es_range_table);
relattnum = variable->varoattno;

Thanks for working with us on the patch.

Happy Holidays Everyone,

Ramon Lawrence

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2008-12-27 22:50:59 Re: Frames vs partitions: is SQL2008 completely insane?
Previous Message Andrew Chernow 2008-12-27 18:50:13 Re: new libpq SSL connection option