Re: Fuzzy substring searching with the pg_trgm extension

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fuzzy substring searching with the pg_trgm extension
Date: 2016-01-11 23:31:30
Message-ID: 20160111233130.GA792371@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I gave a quick look through the patch and noticed a few minor things
while trying to understand it.

I think the test corpus isn't particularly interesting for how big it
is. I'd rather have (a) a small corpus (say 100 words) with which to do
detailed regression testing, and (b) some larger document for more
extensive testing. I'm not sure (b) is actually necessary.

Overall I think the new functions could stand a lot more commentary.

> --- 104,125 ----
> #define GETARR(x) ( (trgm*)( (char*)x+TRGMHDRSIZE ) )
> #define ARRNELEM(x) ( ( VARSIZE(x) - TRGMHDRSIZE )/sizeof(trgm) )
>
> + #ifdef DIVUNION
> + #define CALCSML(count, len1, len2) ((float4) (count)) / ((float4) ((len1) + (len2) - (count)))
> + #else
> + #define CALCSML(count, len1, len2) ((float4) (count)) / ((float4) (((len1) > (len2)) ? (len1) : (len2)))
> + #endif

This macro has a multiple evaluation gotcha. Maybe we don't really care
because of the limited applicability but I'd put a comment on top.


> /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */
> ! res = (*(int *) &tmpsml == *(int *) &trgm_limit || tmpsml > trgm_limit) ? true : false;
> }
> else if (ISALLTRUE(key))
> { /* non-leaf contains signature */
> --- 288,304 ----
> switch (strategy)
> {
> case SimilarityStrategyNumber:
> ! case SubstringSimilarityStrategyNumber:
> ! /* Similarity search is exact. Substring similarity search is inexact */
> ! *recheck = (strategy == SubstringSimilarityStrategyNumber) ? true : false;
> ! nlimit = (strategy == SimilarityStrategyNumber) ? trgm_limit : trgm_substring_limit;
>
> if (GIST_LEAF(entry))
> { /* all leafs contains orig trgm */
> ! float4 tmpsml = cnt_sml(qtrg, key, *recheck);
>
> /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */
> ! res = (*(int *) &tmpsml == *(int *) &nlimit || tmpsml > nlimit) ? true : false;

What's the compiler bug about? This code and comment were introduced in
cbfa4092bb (May 2004) without any explanation. Do we still need to keep
it, if &nlimit is now a local variable instead of a global? FWIW the
oldest GCC in the buildfarm is 3.4.2/3.4.3 (except for Gaur which uses
GCC 2.95 under HPPA)

BTW you don't need a ternary op just to say "? true : false" -- just
make it
res = foo == bar;
which immediately assigns the correct boolean value. (This pattern
occurs in many places both in the original and in your submitted code.
I see no reason to perpetuate it.)


> + Datum
> + set_substring_limit(PG_FUNCTION_ARGS)
> + {
> + float4 nlimit = PG_GETARG_FLOAT4(0);
> +
> + if (nlimit < 0 || nlimit > 1.0)
> + elog(ERROR, "wrong limit, should be between 0 and 1");

Should be an ereport().

> ! static void
> ! protect_out_of_mem(int slen)
> ! {
> ! /*
> ! * Guard against possible overflow in the palloc requests below. (We
> ! * don't worry about the additive constants, since palloc can detect
> ! * requests that are a little above MaxAllocSize --- we just need to
> ! * prevent integer overflow in the multiplications.)
> ! */
> ! if ((Size) (slen / 2) >= (MaxAllocSize / (sizeof(trgm) * 3)) ||
> ! (Size) slen >= (MaxAllocSize / pg_database_encoding_max_length()))
> ! ereport(ERROR,
> ! (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
> ! errmsg("out of memory")));
> ! }

I think the comment should now be above the function, not inside it.

> ***************
> *** 254,259 **** generate_trgm(char *str, int slen)
> --- 292,534 ----
> return trg;
> }
>
> + /* Trigram with position */
> + typedef struct
> + {
> + trgm t;
> + int i;
> + } ptrgm;

This struct definition could stand a better name, and better member
names. Also, our policy (not always enforced) is to have these near the
top of the file rather than in the middle. If the struct is used in
many places then IMO it's better that it go at the top; if it's used
only in one function I'd say it's okay to have it just after that
function.

> + /*
> + * Make array of positional trigrams from two trigram arrays. The first array
> + * is required substing which positions don't matter and replaced with -1.
> + * The second array is haystack where we search and have to store its
> + * positions.
> + */
> + static ptrgm *
> + make_positional_trgm(trgm *trg1, int len1, trgm *trg2, int len2)
> + {
> + ptrgm *result;
> + int i, len = len1 + len2;
> +
> + result = (ptrgm *) palloc(sizeof(ptrgm) * len);
> +
> + for (i = 0; i < len1; i++)
> + {
> + memcpy(&result[i].t, &trg1[i], sizeof(trgm));
> + result[i].i = -1;
> + }
> +
> + for (i = 0; i < len2; i++)
> + {
> + memcpy(&result[i + len1].t, &trg2[i], sizeof(trgm));
> + result[i + len1].i = i;
> + }
> +
> + return result;
> + }

So the output trigram array is the concatenation of both input trigrams?
That's not quite evident from the comment above the function ... I think
some rephrasing there is in order.

> --- 856,869 ----
> }
> }
>
> ! /* if inexact then len2 is equal to count */
> ! if (inexact)
> ! return CALCSML(count, len1, count);
> ! else
> ! return CALCSML(count, len1, len2);

This is where a ternary op would make better code.
CALCSML(count, len1, inexact ? count : len2);
but a better comment would actually explain why we do this, rather than
just stating we do.

> + <row>
> + <entry><function>show_substring_limit()</function><indexterm><primary>show_substring_limit</primary></indexterm></entry>
> + <entry><type>real</type></entry>
> + <entry>
> + Returns the current similarity threshold used by the <literal>&lt;%</>
> + operator. This sets the minimum substring similarity between
> + two phrases.
> + </entry>
> + </row>
> + <row>
> + <entry><function>set_substring_limit(real)</function><indexterm><primary>set_substring_limit</primary></indexterm></entry>
> + <entry><type>real</type></entry>
> + <entry>
> + Sets the current substring similarity threshold that is used by
> + the <literal>&lt;%</> operator. The threshold must be between
> + 0 and 1 (default is 0.6). Returns the same value passed in.
> + </entry>

I don't quite understand why aren't we using a custom GUC variable here.
These already have SHOW and SET support ...

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-01-11 23:33:19 Re: Making plpython 2 and 3 coexist a bit better
Previous Message Stas Kelvich 2016-01-11 23:11:12 Re: Speedup twophase transactions