numeric_normalize() is a few bricks shy of a load

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: numeric_normalize() is a few bricks shy of a load
Date: 2014-11-05 23:19:26
Message-ID: 24216.1415229566@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

9.4 introduced a function numeric_normalize() whose charter is to
produce a string representing a numeric, such that two numerics' string
representations will compare equal if and only if the numeric values
compare equal. (This is used, so far, only by GIN indexes on JSONB.)
Thomas Fanghaenel pointed out to me that this function is utterly,
completely broken: it will reduce distinct values such as 1 and 1000
to the same string, and what's much worse for the GIN use-case,
it can reduce values that should compare equal to distinct strings.
Here's a test case demonstrating the latter assertion:

regression=# create table t1 (f1 jsonb);
CREATE TABLE
regression=# insert into t1 values('{"x": 12000}');
INSERT 0 1
regression=# insert into t1 values('{"x": 12000.00}');
INSERT 0 1
regression=# select * from t1 where f1 @> '{"x": 12000}';
f1
-----------------
{"x": 12000}
{"x": 12000.00}
(2 rows)

regression=# create index on t1 using gin (f1);
CREATE INDEX
regression=# set enable_seqscan TO 0;
SET
regression=# select * from t1 where f1 @> '{"x": 12000}';
f1
--------------
{"x": 12000}
(1 row)

regression=# select * from t1 where f1 @> '{"x": 12000.00}';
f1
-----------------
{"x": 12000.00}
(1 row)

Since JSONB GIN indexes are always considered lossy, the
reduction-to-same-string case is masked by index rechecks, resulting
only in an inefficient index that fails to distinguish keys it could
distinguish. However, the reduction-to-different-strings case results
in wrong answers, as illustrated above.

I think there's no choice but to fix this in 9.4. The only reason
it even needs discussion is that this would invalidate index entries
in beta testers' JSONB GIN indexes, such that queries would not find
entries that they did find before. I'm not sure we can do anything
about this except recommend that beta testers REINDEX such indexes
after updating to 9.4next.

Thoughts?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-11-05 23:19:58 Re: WIP: dynahash replacement for buffer table
Previous Message Josh Berkus 2014-11-05 23:15:26 Re: Amazon Redshift