Skip site navigation (1) Skip section navigation (2)

mergejoin null handling (was Re: [PERFORM] merge join killing performance)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: Matthew Wakeling <matthew(at)flymine(dot)org>, pgsql-hackers(at)postgreSQL(dot)org
Subject: mergejoin null handling (was Re: [PERFORM] merge join killing performance)
Date: 2010-05-25 20:06:49
Message-ID: 28813.1274818009@sss.pgh.pa.us (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-performance
Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> writes:
> So, Tom, so you think it's possible that the planner isn't noticing
> all those nulls and thinks it'll just take a row or two to get to the
> value it needs to join on?

I dug through this and have concluded that it's really an oversight in
the patch I wrote some years ago in response to this:
http://archives.postgresql.org/pgsql-performance/2005-05/msg00219.php

That patch taught nodeMergejoin that a row containing a NULL key can't
possibly match anything on the other side.  However, its response to
observing a NULL is just to advance to the next row of that input.
What we should do, if the NULL is in the first merge column and the sort
order is nulls-high, is realize that every following row in that input
must also contain a NULL and so we can just terminate the mergejoin
immediately.  The original patch works well for cases where there are
just a few nulls in one input and the important factor is to not read
all the rest of the other input --- but it fails to cover the case where
there are many nulls and the important factor is to not read all the
rest of the nulls.  The problem can be demonstrated if you modify the
example given in the above-referenced message so that table t1 contains
lots of nulls rather than just a few: explain analyze will show that
all of t1 gets read by the mergejoin, and that's not necessary.

I'm inclined to think this is a performance bug and should be
back-patched, assuming the fix is simple (which I think it is, but
haven't coded/tested yet).  It'd probably be reasonable to go back to
8.3; before that, sorting nulls high versus nulls low was pretty poorly
defined and so there'd be risk of breaking cases that gave the right
answers before.

Comments?

			regards, tom lane

In response to

pgsql-performance by date

Next:From: Merlin MoncureDate: 2010-05-25 21:03:33
Subject: Re: shared_buffers advice
Previous:From: Joshua TolleyDate: 2010-05-25 19:01:49
Subject: Re: prepared query performs much worse than regular query

pgsql-hackers by date

Next:From: Joseph AdamsDate: 2010-05-25 20:11:08
Subject: Re: Fwd: Hiding data in postgresql
Previous:From: Jan WieckDate: 2010-05-25 19:58:14
Subject: Re: Exposing the Xact commit order to the user

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group