Re: Permute underscore separated components of columns before fuzzy matching

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mikhail Gribkov <youzhick(at)gmail(dot)com>
Cc: Arne Roland <A(dot)Roland(at)index(dot)de>, Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Permute underscore separated components of columns before fuzzy matching
Date: 2023-11-17 21:23:04
Message-ID: 799067.1700256184@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Mikhail Gribkov <youzhick(at)gmail(dot)com> writes:
> Honestly I'm not entirely sure fixing only two switched words is worth the
> effort, but the declared goal is clearly achieved.

> I think the patch is good to go, although you need to fix code formatting.

I took a brief look at this. I concur that we shouldn't need to be
hugely concerned about the speed of this code path. However, we *do*
need to be concerned about its maintainability, and I think the patch
falls down badly there: it adds a chunk of very opaque and essentially
undocumented code, that people will need to reverse-engineer anytime
they are studying this function. That could be alleviated perhaps
with more work on comments, but I have to wonder whether it's worth
carrying this logic at all. It's a rather strange behavior to add,
and I wonder if many users will want it.

One thing that struck me is that no care is being taken for adjacent
underscores (that is, "foo__bar" and similar cases). It seems
unlikely that treating the zero-length substring between the
underscores as a word to permute is helpful; moreover, it adds
an edge case that the string-moving logic could easily get wrong.
I wonder if the code should treat any number of consecutive
underscores as a single separator. (Somewhat related: I think it
will behave oddly when the first or last character is '_', since the
outer loop ignores those positions.)

> And it would be much more convenient to work with your patch if every next
> version file will have a unique name (maybe something like "_v2", "_v3"
> etc. suffixes)

Please. It's very confusing when there are multiple identically-named
patches in a thread.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2023-11-17 21:36:44 Re: Lifetime of commit timestamps
Previous Message Gurjeet Singh 2023-11-17 21:22:57 Re: Change GUC hashtable to use simplehash?