Re: pg_trgm version 1.2

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_trgm version 1.2
Date: 2015-07-07 13:33:43
Message-ID: CAPpHfdv8ZazkU5sWwaABMA9MZDLK=fYFEuKDbqsx-j-9v5Bn=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 30, 2015 at 11:28 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Tue, Jun 30, 2015 at 2:46 AM, Alexander Korotkov <
> a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>
>> On Sun, Jun 28, 2015 at 1:17 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>
>>> This patch implements version 1.2 of contrib module pg_trgm.
>>>
>>> This supports the triconsistent function, introduced in version 9.4 of
>>> the server, to make it faster to implement indexed queries where some keys
>>> are common and some are rare.
>>>
>>
>> Thank you for the patch! Lack of pg_trgm triconsistent support was
>> significant miss after "fast scan" implementation.
>>
>>
>>> I've included the paths to both upgrade and downgrade between 1.1 and
>>> 1.2, although after doing so you must close and restart the session before
>>> you can be sure the change has taken effect. There is no change to the
>>> on-disk index structure
>>>
>>
>> pg_trgm--1.1.sql andpg_trgm--1.1--1.2.sql are useful for debug, but do
>> you expect them in final commit? As I can see in other contribs we have
>> only last version and upgrade scripts.
>>
>
>
> I had thought that pg_trgm--1.1.sql was needed for pg_upgrade to work, but
> I see that that is not the case.
>
> I did see another downgrade path for different module, but on closer
> inspection it was one that I wrote while trying to analyze that module, and
> then never removed. I have no objection to removing pg_trgm--1.2--1.1.sql
> before the commit, but I don't see what we gain by doing so. If a
> downgrade is feasible and has been tested, why not include it?
>

See Tom Lane's comment about downgrade scripts. I think just remove it is a
right solution.

> You could get the same benefit just by increasing MAX_MAYBE_ENTRIES (in
>>> core) from 4 to some higher value (which it probably should be anyway, but
>>> there will always be a case where it needs to be higher than you can afford
>>> it to be, so a real solution is needed).
>>>
>>
>> Actually, it depends on how long it takes to calculate consistent
>> function. The cheaper consistent function is, the higher MAX_MAYBE_ENTRIES
>> could be. Since all functions in PostgreSQL may define its cost, GIN could
>> calculate MAX_MAYBE_ENTRIES from the cost of consistent function.
>>
>
> The consistent function gets called on every candidate tuple, so if it is
> expensive then it is also all the more worthwhile to reduce the set of
> candidate tuples. Perhaps MAX_MAYBE_ENTRIES could be calculated from the
> log of the maximum of the predictNumberResult entries? Anyway, a subject
> for a different day....
>

Yes, that also a point. However, we optimize not only costs of consistent
calls but also costs of getting candidate item pointers which are both CPU
and IO.

There may also be some gains in the similarity and regex cases, but I
>>> didn't really analyze those for performance.
>>>
>>
>>
>>> I've thought about how to document this change. Looking to other
>>> example of other contrib modules with multiple versions, I decided that we
>>> don't document them, other than in the release notes.
>>>
>>>
>>> The same patch applies to 9.4 code with a minor conflict in the
>>> Makefile, and gives benefits there as well.
>>>
>>
>> Some other notes about the patch:
>> * You allocate boolcheck array and don't use it.
>>
>
> That was a bug (at least notionally). trigramsMatchGraph was supposed to
> be getting boolcheck, not check, passed in to it.
>
> It may not have been a bug in practise, because GIN_MAYBE and GIN_TRUE
> both test as true when cast to booleans. But it seems wrong to rely on
> that. Or was it intended to work this way?
>
> I'm surprised the compiler suite doesn't emit some kind of warning on that.
>

I'm not sure. It's attractive to get rid of useless memory allocation and
copying.
You can also change trigramsMatchGraph() to take GinTernaryValue *
argument. Probably casting bool * as GinTernaryValue * looks better than
inverse casting.

>
>
>> * Check coding style and formatting, in particular "check[i]==GIN_TRUE"
>> should be "check[i] == GIN_TRUE".
>>
>
> Sorry about that, fixed. I also changed it in other places to check[i] !=
> GIN_FALSE, rather than checking against both GIN_TRUE and GIN_MAYBE. At
> first I was concerned we might decide to add a 4th option to the type which
> would render != GIN_FALSE the wrong way to test for it. But since it is
> called GinTernaryValue, I think we wouldn't add a fourth thing to it. But
> perhaps the more verbose form is clearer to some people.
>

Thanks!

> * I think some comments needed in gin_trgm_triconsistent() about
>> trigramsMatchGraph(). gin_trgm_triconsistent() may use trigramsMatchGraph()
>> that way because trigramsMatchGraph() implements monotonous boolean
>> function.
>>
>
> I have a function-level comment that in all cases, GIN_TRUE is at least as
> favorable to inclusion of a tuple as GIN_MAYBE. Should I reiterate that
> comment before trigramsMatchGraph() as well? Calling it a monotonic
> boolean function is precise and concise, but probably less understandable
> to people reading the code. At least, if I saw that, I would need to go do
> some reading before I knew what it meant.
>

Let's consider '^(?!.*def).*abc' regular expression as an example. It
matches strings which contains 'abc' and don't contains 'def'.

# SELECT 'abc' ~ '^(?!.*def).*abc';
?column?
----------
t
(1 row)

# SELECT 'def abc' ~ '^(?!.*def).*abc';
?column?
----------
f
(1 row)

# SELECT 'abc def' ~ '^(?!.*def).*abc';
?column?
----------
f
(1 row)

Theoretically, our trigram regex processing could support negative matching
of 'def' trigram, i.e. trigramsMatchGraph(abc = true, def = false) = true
but trigramsMatchGraph(abc = true, def = true) = false. Actually, it
doesn't because trigramsMatchGraph() implements a monotonic function. I
just think it should be stated explicitly.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2015-07-07 13:36:29 Re: creating extension including dependencies
Previous Message Stephen Frost 2015-07-07 13:31:55 Re: FPW compression leaks information