Re: Doing better at HINTing an appropriate column within errorMissingColumn()

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, Ian Barwick <ian(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Jim Nasby <jim(at)nasby(dot)net>, Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Subject: Re: Doing better at HINTing an appropriate column within errorMissingColumn()
Date: 2014-11-17 20:04:33
Message-ID: CAM3SWZRx=rc+53vYsgGnAShU69unjLmqTQtPqe+4jKjkRCeeNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 17, 2014 at 10:15 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I'm grumpy about the distanceName() function. That seems too generic.
> If we're going to keep this as it is, I suggest something like
> computeRTEColumnDistance(). But see below.

Fair point.

> On a related note, I'm also grumpy about this comment:
>
> + /* Charge half as much per deletion as per insertion or per substitution */
> + return varstr_levenshtein_less_equal(actual, len, match, match_len,
> + 2, 1, 2, max);
>
> The purpose of a code comment is to articulate WHY we did something,
> rather than simply to restate what the code quite obviously does. I
> haven't heard a compelling argument for why this should be 2, 1, 2
> rather than the default 1, 1, 1; and I'm inclined to do the latter
> unless you can make some very good argument for this combination of
> weights. And if you can make such an argument, then there should be
> comments so that the next person to come along and look at this code
> doesn't go, huh, that's whacky, and change it.

Okay. I agree that that deserves a comment. The actual argument for
this formulation is that it just seems to work better that way. For
example:

"""
postgres=# \d orderlines
Table "public.orderlines"
Column | Type | Modifiers
-------------+----------+-----------
orderlineid | integer | not null
orderid | integer | not null
prod_id | integer | not null
quantity | smallint | not null
orderdate | date | not null

postgres=# select qty from orderlines ;
ERROR: 42703: column "qty" does not exist
LINE 1: select qty from orderlines ;
^
HINT: Perhaps you meant to reference the column "orderlines"."quantity".
"""

The point is that the fact that the user supplied "qty" string has so
many fewer characters than what was obviously intended - "quantity" -
deserves to be weighed less. If you change the costing to weigh
character deletion as being equal to substitution/addition, this
example breaks. I also think it's pretty common to have noise words in
every attribute (e.g. every column in the "orderlines" table matches
"orderlines_*"), which might otherwise mess things up by overcharging
for deletion. Having extra characters in the correctly spelled column
name seems legitimately less significant to me.

Or, in other words: having actual characters from the misspelling
match the correct spelling (and having actual characters given not
fail to match) is most important. What was given by the user is more
important than what was not given but should have been, which is not
generally true for uses of Levenshtein distance. I reached this
conclusion through trying out the patch with a couple of real schemas,
and seeing what works best.

It's hard to express that idea tersely, in a comment, but I guess I'll try.

> + int location, FuzzyAttrMatchState *rtestate)
>
> I suggest calling this "fuzzystate" rather than "rtestate"; it's not
> the state of the RTE, but the state of the fuzzy matching.

The idea here was to differentiate this state from the overall range
table state (in general, FuzzyAttrMatchState may be one or the other).
But okay.

> Within the scanRTEForColumn block, we have a rather large chunk of
> code protected by if (rtestate), which contains the only call to
> distanceName(). I suggest that we move all of this logic to a
> separate, static function, and merge distanceName into it. I also
> suggest testing against NULL explicitly instead of implicitly. So
> this block of code would end up as something like:
>
> if (fuzzystate != NULL)
> updateFuzzyAttrMatchState(rte, attcolname, colname, &fuzzystate);

Okay.

> In searchRangeTableForCol, I'm fairly certain that you've changed the
> behavior by adding a check for if (rte->rtekind == RTE_JOIN) before
> the call to scanRTEForColumn(). Why not instead put this check into
> updateFuzzyAttrMatchState? Then you can be sure you're not changing
> the behavior in any other case.

I thought that I had avoided changing things (beyond what was
advertised as changed in relation to this most recent revision)
because I also changed things WRT multiple matches per RTE. It's
fuzzy. Anyway, yeah, I could do it there instead.

> On a similar note, I think the dropped-column test should happen early
> as well, probably again in updateFuzzyAttrMatchState(). There's
> little point in adding a suggestion only to throw it away again.

Agreed.

> + /*
> + * Charge extra (for inexact matches only) when an alias was
> + * specified that differs from what might have been used to
> + * correctly qualify this RTE's closest column
> + */
> + if (wrongalias)
> + rtestate.distance += 3;
>
> I don't understand what situation this is catering to. Can you
> explain? It seems to account for a good deal of complexity.

Two cases:

1. Distinguishing between the case where there was an exact match to a
column that isn't visible (i.e. the existing reason for
errorMissingColumn() to call here), and the case where there is a
visible column, but our alias was the wrong one. I guess that could
live in errorMissingColumn(), but overall it's more convenient to do
it here, so that errorMissingColumn() handles things almost uniformly
and doesn't really have to care.

2. For non-exact (fuzzy) matches, it seems more useful to give one
match rather than two when the user gave an alias that matches one
particular RTE. Consider this:

"""
postgres=# select ordersid from orders o join orderlines ol on
o.orderid = ol.orderid;
ERROR: 42703: column "ordersid" does not exist
LINE 1: select ordersid from orders o join orderlines ol on o.orderi...
^
HINT: Perhaps you meant to reference the column "o"."orderid" or the
column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:3166

postgres=# select ol.ordersid from orders o join orderlines ol on
o.orderid = ol.orderid;
ERROR: 42703: column ol.ordersid does not exist
LINE 1: select ol.ordersid from orders o join orderlines ol on o.ord...
^
HINT: Perhaps you meant to reference the column "ol"."orderid".
LOCATION: errorMissingColumn, parse_relation.c:3147
"""

One suggestion is better than two if it's evident that that single
suggestion is a better fit. And, more broadly, the fact that an alias
was given and matches ought to be weighed.

> ERROR: column "oid" does not exist
> LINE 1: select oid > 0, * from altwithoid;
> ^
> +HINT: Perhaps you meant to reference the column "altwithoid"."col".
>
> That seems like a stretch. I think I suggested before using a
> distance threshold of at most 3 or half the word length, whichever is
> less. For a three-letter column name that means not suggesting
> anything if more than one character is different. What you
> implemented here is close to that, yet somehow we've got a suggestion
> slipping through that has 2 out of 3 characters different. I'm not
> quite sure I see how that's getting through, but I think it shouldn't.

It's because I don't apply the test on smaller strings. I felt that it
was riskier to apply an absolute quality test when the user-supplied
column reference does not exceed 6 characters. Consider my "qty" vs
"quantity" example. If I make this change:

--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -929,7 +929,7 @@ searchRangeTableForCol(ParseState *pstate, const
char *alias, char *colname,
* seen when 6 deletions are required against actual attribute
name, or 3
* insertions/substitutions.
*/
- if (state->distance > 6 && state->distance > strlen(colname) / 2)
+ if (state->distance > strlen(colname) / 2)
{
state->rsecond = state->rfirst = NULL;
state->second = state->first = InvalidAttrNumber;

Then a lot of the examples you complain about are fixed. But the "qty"
example is broken. Plus, this happens when the regression tests are
run:

*** /home/pg/postgresql/src/test/regress/expected/alter_table.out
2014-11-17 11:50:16.476426191 -0800
--- /home/pg/postgresql/src/test/regress/results/alter_table.out
2014-11-17 11:57:40.776410110 -0800
***************
*** 1343,1349 ****
ERROR: column "f1" does not exist
LINE 1: select f1 from c1;
^
- HINT: Perhaps you meant to reference the column "c1"."f2".
drop table p1 cascade;
NOTICE: drop cascades to table c1
create table p1 (f1 int, f2 int);

And:

*** /home/pg/postgresql/src/test/regress/expected/join.out 2014-11-17
11:50:16.480426191 -0800
--- /home/pg/postgresql/src/test/regress/results/join.out 2014-11-17
11:57:08.916411263 -0800
***************
*** 3452,3458 ****
ERROR: column atts.relid does not exist
LINE 1: select atts.relid::regclass, s.* from pg_stats s join
^
- HINT: Perhaps you meant to reference the column "atts"."indexrelid".
--
-- Test LATERAL
--

(So no hint given in either case)

> ERROR: column fullname.text does not exist
> LINE 1: select fullname.text from fullname;
> ^
> +HINT: Perhaps you meant to reference the column "fullname"."last".

> Basically the same problem again. I think the distance threshold in
> this case should be half the shorter column name, i.e. 0.

Well, there is always going to be the most marginal possible case that
still gets to see a suggestion. These are non-organic examples from
the regression tests. I'm more worried about having the suggestions
work well for organic/representative cases than I am about suppressing
non-useful suggestions in non-organic/non-representative cases. As I
mentioned, the costing is more or less derived by what I found to work
well in what I thought to be representative cases.

> Your new test cases include no negative test cases; that is, cases
> where the machinery declines to suggest a hint because of, say, 3
> equally good possibilities. They probably should have something like
> that.

I'll think about that.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2014-11-17 20:47:37 Re: [TODO] Track number of files ready to be archived in pg_stat_archiver
Previous Message Tomas Vondra 2014-11-17 20:03:07 Re: 9.5: Better memory accounting, towards memory-bounded HashAgg