mergejoin null handling (was Re: [PERFORM] merge join killing performance)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: Matthew Wakeling <matthew(at)flymine(dot)org>, pgsql-hackers(at)postgreSQL(dot)org
Subject: mergejoin null handling (was Re: [PERFORM] merge join killing performance)
Date: 2010-05-25 20:06:49
Message-ID: 28813.1274818009@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> writes:
> So, Tom, so you think it's possible that the planner isn't noticing
> all those nulls and thinks it'll just take a row or two to get to the
> value it needs to join on?

I dug through this and have concluded that it's really an oversight in
the patch I wrote some years ago in response to this:
http://archives.postgresql.org/pgsql-performance/2005-05/msg00219.php

That patch taught nodeMergejoin that a row containing a NULL key can't
possibly match anything on the other side. However, its response to
observing a NULL is just to advance to the next row of that input.
What we should do, if the NULL is in the first merge column and the sort
order is nulls-high, is realize that every following row in that input
must also contain a NULL and so we can just terminate the mergejoin
immediately. The original patch works well for cases where there are
just a few nulls in one input and the important factor is to not read
all the rest of the other input --- but it fails to cover the case where
there are many nulls and the important factor is to not read all the
rest of the nulls. The problem can be demonstrated if you modify the
example given in the above-referenced message so that table t1 contains
lots of nulls rather than just a few: explain analyze will show that
all of t1 gets read by the mergejoin, and that's not necessary.

I'm inclined to think this is a performance bug and should be
back-patched, assuming the fix is simple (which I think it is, but
haven't coded/tested yet). It'd probably be reasonable to go back to
8.3; before that, sorting nulls high versus nulls low was pretty poorly
defined and so there'd be risk of breaking cases that gave the right
answers before.

Comments?

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Joseph Adams 2010-05-25 20:11:08 Re: Fwd: Hiding data in postgresql
Previous Message Jan Wieck 2010-05-25 19:58:14 Re: Exposing the Xact commit order to the user

Browse pgsql-performance by date

  From Date Subject
Next Message Merlin Moncure 2010-05-25 21:03:33 Re: shared_buffers advice
Previous Message Joshua Tolley 2010-05-25 19:01:49 Re: prepared query performs much worse than regular query