| From: | Richard Guo <guofenglinux(at)gmail(dot)com> |
|---|---|
| To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
| Cc: | Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Wrong results from inner-unique joins caused by collation mismatch |
| Date: | 2026-04-24 15:44:32 |
| Message-ID: | CAMbWs4_pqvDepQapm6+vF=cPAQkKgerKEbO60dbfULx6+heitQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, Apr 24, 2026 at 11:53 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Richard Guo <guofenglinux(at)gmail(dot)com> writes:
> > My first thought was to fix this by:
>
> > + if (!IndexCollMatchesExprColl(ind->indexcollations[c],
> > + exprInputCollation((Node *) rinfo->clause)))
> > + continue;
>
> > However, this caused an unexpected plan diff in join.out where a
> > left-join removal over (name, text) stopped working, because name and
> > text use different collations. So this check is too strict: a
> > mismatch between two deterministic collations should be OK for
> > uniqueness proof, as a deterministic collation treats two strings as
> > equal iff they are byte-wise equal (see CREATE COLLATION).
> Yes, we'd be taking a serious performance hit if we insisted on
> exact collation matches for this purpose. I agree that disallowing
> non-matching non-deterministic collations is the right fix.
Thanks for taking a look!
> > Hence, I got attached patch. Thoughts?
> I don't love doing it like this, for two reasons:
>
> 1. I think there are other places in the planner that will need
> substantially this same logic. I recommend breaking out a
> subroutine defined more or less as "do these collations have
> equivalent notions of equality".
Right. I just found several other places that need this same logic,
which are in query_is_distinct_for(). Without it, we produce wrong
results:
select * from t t1 join
(select distinct a from t) t2 on t1.a = t2.a COLLATE "ci";
a | a
---+---
A | a
a | a
(2 rows)
select * from t t1 join
(select a from t group by a) t2 on t1.a = t2.a COLLATE "ci";
a | a
---+---
A | a
a | a
(2 rows)
> 2. I find the test next to unreadable as written --- for example,
> it's more difficult than it should be to figure out what happens
> if one collation is deterministic and the other not. Using a
> subroutine would help here by letting you break down the test
> into multiple steps.
Agreed. Will wrap the logic in a subroutine.
- Richard
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Alexander Lakhin | 2026-04-24 17:00:00 | Re: meson: Make test output much more useful on failure (both in CI and locally) |
| Previous Message | Tom Lane | 2026-04-24 14:53:17 | Re: Wrong results from inner-unique joins caused by collation mismatch |