Re: spgist text_ops and LIKE

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: spgist text_ops and LIKE
Date: 2012-02-02 13:45:44
Message-ID: CA+Tgmoaww0Dgb=v5T_b1zviy4AxZ+BoU5vMZDVBGGWMA8OkLcQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 1, 2012 at 10:57 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Is spgist intended to support prefix searches with LIKE?
>
> Too lazy to look at the code right now, but I think indxpath.c contains
> hardwired assumptions that LIKE prefix optimizations are only possible
> with btree indexes.  Possibly it would be worth relaxing that.  (The
> whole "special index operator" mechanism is undesirably special-purpose,
> but I currently have no ideas about how to make it more flexible.)

Is it as simple as the attached?

Given the set of operators that are available for spgist, one would
assume that it's intended to handle both pattern-matching and regular
comparison operators. On the flip side, spg_text_inner_consistent()
seems to fall back to a full index-scan for the regular comparison
operators, which makes me wonder why anyone would bother having the
operator at all. Indeed, in my testing, it's slightly slower than a
sequential scan:

select sum(1) from person where last_name > 'WAR';
- btree: 113-114ms
- btree with text_pattern_ops: can't do it
- spgist: 920-935ms
- sequential scan: 895-910ms

With the attached patch, it seems to work as well as text_pattern_ops
for pattern matching:

select sum(1) from person where last_name LIKE 'WAR%';
- btree: can't do it
- btree with text_pattern_ops: 6-7ms
- spgist: 7-9ms
- sequential scan: 170-180ms

It's fine for equality, too:

select sum(1) from person where last_name = 'WARNER';
- btree: 1.7-1.9ms
- btree with text_pattern_ops: 1.7-1.9ms
- spgist: 1.7-1.9ms
- sequential scan: 160-165ms

It seems that there's at least the potential for spgist indexes to be
extremely efficient, given adequate knowledge about the collation
under which comparisons are being done. If for example we're trying
to find strings where last_name > 'WAR', we certainly need to chase
down the subtrees where the first character is 'W', 'X', 'Y', or 'Z'.
If the collation is case-insensitive, we might also need to chase 'w',
'x', 'y', and 'z'. And there might be other characters that need to
be chased as well, depending on the collation: some collations might
sort numbers before letters, while others might do the reverse. But
it's likely that there are few if any collations where we need to
chase down the subtree for 'Q' on this query, because anything
beginning with 'Q' is likely to sort before anything beginning with
'WAR' pretty much anywhere. I'm not sure there's a good way to get
that information without a lot of grotty hard-coding, but it seems
like the potential is there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
spgist-pattern-match.patch application/octet-stream 1.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-02-02 13:54:32 Re: JSON for PG 9.2
Previous Message Pavel Stehule 2012-02-02 11:44:01 Re: Patch pg_is_in_backup()