Re: Bug in numeric multiplication

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in numeric multiplication
Date: 2015-11-19 22:58:17
Message-ID: 1120.1447973897@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
> On 18 November 2015 at 22:19, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Still, this proves that we are onto a real problem.

> Agreed. I had a go at turning that example into something using
> log(base, num) so that the result would be visible in a pure SQL test,
> but I didn't have any luck.

I experimented with that idea some, and found a test case that would
trigger the Assert during log_var's final div_var_fast call:

select log(
exp(.999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999995050012831598249352382434889825483159817)
,
exp(.99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999009999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999));

However, the emitted answer was the same with or without my proposed
patch. So I dug deeper by putting in some debug printf's, and found that
the overflow occurs at this point:

div[81] will overflow: div[qi] = 218943 qdigit = 7673 maxdiv = 210375
div[]:
0000 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 -001 9999 9999 9999 5049 9871 6840 1750 6476 1756 5110 1745 1684 0183 0860 9567 3786 7660 2671 2843 5974 6515 4013 7886 0978 7230 5372 8162 7077 2013 6185 3746 3095 8528 1595 3071 7541 7920 7950 218943 -2103498897 -2103499629 -2103499629 -2103499629 -2103504578

Note that div[qi+1], and indeed all the remaining dividend digits, are
large negative values. This is what you'd expect if the carry propagation
step hasn't run for awhile, which is a precondition for div[qi] being
large enough to cause an issue. When we compute 218943 * 10000, we will
indeed get an overflow, and the result will wrap around to some large
negative value (2^32 less than it should be). Then we will add that to
div[qi+1], and we'll get *another* overflow, wrapping what nominally
should have been a negative sum around to a positive value (2^32 more than
it should be). So the two overflows cancel and we get exactly the correct
new value of div[qi+1].

I do not know whether it's possible to devise a test case where you don't
get offsetting overflows. It may be that there's no operational bug here.
Still, the code is surely not behaving per face value.

> I had a go at trying to find a simpler approach and came up with the
> attached, which computes the value intended for div[qi+1] near the top
> of the loop (using doubles) so that it can detect if it will overflow,
> and if so it enters the normalisation block. It still needs some work
> to prove that the normalised value for fnextdigit won't overflow, but
> it feels like a promising, easier-to-follow approach to me. What do
> you think?

I'll look at this later ...

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2015-11-19 23:22:36 ROLES and ALTER DEFAULT PRIVILEGES
Previous Message Peter Geoghegan 2015-11-19 22:53:47 Re: Using quicksort for every external sort run