Re: Fuzzy substring searching with the pg_trgm extension

From: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fuzzy substring searching with the pg_trgm extension
Date: 2016-01-18 09:46:23
Message-ID: 569CB46F.3060505@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

I have did some changes to the patch:
- changed version from 1.3 to 1.2
- added check_only parameter to the function
calc_substring_similarity(). If check_only == true then a
calc_substring_similarity() calculates similarity which greater than
pg_trgm.substring_limit variable. This behavior is used in the
substring_similarity_op() and it should increase performance of the
operator. In the substring_similarity() function check_only == false.

On 12.01.2016 02:31, Alvaro Herrera wrote:
> I gave a quick look through the patch and noticed a few minor things
> while trying to understand it.
>
> I think the test corpus isn't particularly interesting for how big it
> is. I'd rather have (a) a small corpus (say 100 words) with which to do
> detailed regression testing, and (b) some larger document for more
> extensive testing. I'm not sure (b) is actually necessary.
>
> Overall I think the new functions could stand a lot more commentary.

I added some comments to new functions iterate_substring_similarity()
and calc_substring_similarity().

>>
>> + #ifdef DIVUNION
>> + #define CALCSML(count, len1, len2) ((float4) (count)) / ((float4) ((len1) + (len2) - (count)))
>> + #else
>> + #define CALCSML(count, len1, len2) ((float4) (count)) / ((float4) (((len1) > (len2)) ? (len1) : (len2)))
>> + #endif
>
> This macro has a multiple evaluation gotcha. Maybe we don't really care
> because of the limited applicability but I'd put a comment on top.

I added a comment to explain various similarity formula.

>
> BTW you don't need a ternary op just to say "? true : false" -- just
> make it
> res = foo == bar;
> which immediately assigns the correct boolean value. (This pattern
> occurs in many places both in the original and in your submitted code.
> I see no reason to perpetuate it.)
>

Fixed all ternary operators.

>
>> + Datum
>> + set_substring_limit(PG_FUNCTION_ARGS)
>> + {
>> + float4 nlimit = PG_GETARG_FLOAT4(0);
>> +
>> + if (nlimit < 0 || nlimit > 1.0)
>> + elog(ERROR, "wrong limit, should be between 0 and 1");
>
> Should be an ereport().

Fixed. But an elog() function occurs in trgm_gin.c and trgm_gist.c also.
I'm not sure that it should be replaced with an ereport().

>
>
>> ! static void
>> ! protect_out_of_mem(int slen)
>> ! {
>> ! /*
>> ! * Guard against possible overflow in the palloc requests below. (We
>> ! * don't worry about the additive constants, since palloc can detect
>> ! * requests that are a little above MaxAllocSize --- we just need to
>> ! * prevent integer overflow in the multiplications.)
>> ! */
>> ! if ((Size) (slen / 2) >= (MaxAllocSize / (sizeof(trgm) * 3)) ||
>> ! (Size) slen >= (MaxAllocSize / pg_database_encoding_max_length()))
>> ! ereport(ERROR,
>> ! (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
>> ! errmsg("out of memory")));
>> ! }
>
> I think the comment should now be above the function, not inside it.

Fixed.

>>
>> + /* Trigram with position */
>> + typedef struct
>> + {
>> + trgm t;
>> + int i;
>> + } ptrgm;
>
> This struct definition could stand a better name, and better member
> names. Also, our policy (not always enforced) is to have these near the
> top of the file rather than in the middle. If the struct is used in
> many places then IMO it's better that it go at the top; if it's used
> only in one function I'd say it's okay to have it just after that
> function.

Renamed this structure to pos_trgm and move it near the top since this
structure is used in a couple of functions. I renamed member names to
trg and index accordingly.

>
>> + /*
>> + * Make array of positional trigrams from two trigram arrays. The first array
>> + * is required substing which positions don't matter and replaced with -1.
>> + * The second array is haystack where we search and have to store its
>> + * positions.
>> + */
>> + static ptrgm *
>> + make_positional_trgm(trgm *trg1, int len1, trgm *trg2, int len2)
>> + {
>> + ptrgm *result;
>> + int i, len = len1 + len2;
>> +
>> + result = (ptrgm *) palloc(sizeof(ptrgm) * len);
>> +
>> + for (i = 0; i < len1; i++)
>> + {
>> + memcpy(&result[i].t, &trg1[i], sizeof(trgm));
>> + result[i].i = -1;
>> + }
>> +
>> + for (i = 0; i < len2; i++)
>> + {
>> + memcpy(&result[i + len1].t, &trg2[i], sizeof(trgm));
>> + result[i + len1].i = i;
>> + }
>> +
>> + return result;
>> + }
>
> So the output trigram array is the concatenation of both input trigrams?
> That's not quite evident from the comment above the function ... I think
> some rephrasing there is in order.

Rephrased the comment.

>>
>> ! /* if inexact then len2 is equal to count */
>> ! if (inexact)
>> ! return CALCSML(count, len1, count);
>> ! else
>> ! return CALCSML(count, len1, len2);
>
> This is where a ternary op would make better code.
> CALCSML(count, len1, inexact ? count : len2);
> but a better comment would actually explain why we do this, rather than
> just stating we do.

Fixed a code and a comment.

>
>
>> + <row>
>> + <entry><function>show_substring_limit()</function><indexterm><primary>show_substring_limit</primary></indexterm></entry>
>> + <entry><type>real</type></entry>
>> + <entry>
>> + Returns the current similarity threshold used by the <literal>&lt;%</>
>> + operator. This sets the minimum substring similarity between
>> + two phrases.
>> + </entry>
>> + </row>
>> + <row>
>> + <entry><function>set_substring_limit(real)</function><indexterm><primary>set_substring_limit</primary></indexterm></entry>
>> + <entry><type>real</type></entry>
>> + <entry>
>> + Sets the current substring similarity threshold that is used by
>> + the <literal>&lt;%</> operator. The threshold must be between
>> + 0 and 1 (default is 0.6). Returns the same value passed in.
>> + </entry>
>
> I don't quite understand why aren't we using a custom GUC variable here.
> These already have SHOW and SET support ...
>

Added GUC variables:
- pg_trgm.limit
- pg_trgm.substring_limit
I added this variables to the documentation.
show_limit() and set_limit() functions work correctly and they are
marked as deprecated.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment Content-Type Size
substring_similarity.patch text/x-patch 185.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2016-01-18 10:46:39 Re: postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)
Previous Message Shulgin, Oleksandr 2016-01-18 09:46:20 Re: More stable query plans via more predictable column statistics