Re: Problem search on text arrays, using the overlaps (&&) operator

From: nha <lyondif02(at)free(dot)fr>
To: John Cheng <jlcheng(at)ymail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Problem search on text arrays, using the overlaps (&&) operator
Date: 2009-07-06 16:12:22
Message-ID: 4A522266.9070602@free.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

Le 2/07/09 2:07, John Cheng a écrit :
> We use text[] on one of our tables. This text[] column allows us to
> search for records that matches a keyword in a set of keywords. For
> example, if we want to find records that has a keyword of "foo" or
> "bar", we can use the condition:
>
> keywords&& '{foo, bar}'::text[]
>
> Another wau is to do this:
>
> (keywords&& '{foo}::text[]' OR keywords&& '{bar}::text[]')
>
> I am noticing a big difference between the two ways. I'm trying to
> find out if we need to re-write our queries to speed them up, or
> perhaps I am just missing something about how to use text[].
>
> [...]
> For some reason, I am seeing a big difference in our real database. I
> don't want to just rewrite all of our queries yet. I'm guessing the
> data makes a big difference. What would be a good way to examine the
> data to figure out what's the best way to write our queries? Is there
> any features in PostgreSQL that can help me improve the performance?
>
> Any advice would be greatly appreciated!
>

With your exhaustive example statements based on table foo and cars, I
performed some measures on my side (PostgreSQL 8.3.1 server). Here are
some statistical results:

seq rtm delta ratio deco
--- --- ----- ----- ----
s1cc 873 -1,74% 2//9 1+1
s1or 889 1,71% 2//9 1+1
s2cc 1322 8,53% 3//9 1+2
s2or 1209 -9,32% 3//9 1+2
s3cc 892 -2,61% 2//9 1+(.5*2)
s3or 915 2,54% 2//9 1+(.5*2)
s4cc 511 -3,00% 1//9 (.5*2)
s4or 526 2,91% 1//9 (.5*2)
s5cc 1635 2,13% 4//9 1+1+2
s5or 1600 -2,17% 4//9 1+1+2
--- --- ----- ----- ----

seq where clauses
--- -------------
s1cc keywords && '{ford, toyota}'::text[]
s1or keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[]
s2cc keywords && '{ford, honda}'::text[]
s2or keywords && '{ford}'::text[] OR keywords && '{honda}'::text[]
s3cc keywords && '{honda, ferrari}'::text[]
s3or keywords && '{honda}'::text[] OR keywords && '{ferrari}'::text[]
s4cc keywords && '{ferrari, hummer}'::text[]
s4or keywords && '{ferrari}'::text[] OR keywords && '{hummer}'::text[]
s5cc keywords && '{ford, toyota, porsche}'::text[]
s5or keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[] OR
keywords && '{porsche}'::text[]

legend
------
seq sequence of 10 subsequent explain analyze per row
rtm runtime mean (in milliseconds) of 10 subsequent measures
delta difference percentage between cc and or sequences
cc unique where clause with >1-size table (eg. {foo,bar})
or multiple where clauses with 1-size text table (eg. {foo})
ratio ratio between # of result rows and # of table rows
deco result row partition between constant text table values in where
clause.

In the following, I refer to your condition forms as:
- arr&&{f,b}
- arr&&{f} or arr&&{b}

I noticed first that, contrarily to your observation, for the "ford or
toyata" case (sequence s1 developped into 2 subcases s1cc and s1or for
both forms of condition), runtime mean is shorter for s1cc (arr&&{f,t})
than for s1or (arr&&{f} or arr&&{t}). But the difference percentage is
only about 1,7% (ie. not enough relevant).

This empirical advantage of form arr&&{f,t} over form (arr&&{f} or
arr&&{t}) is also observed for 2 cases (s3 and s4) out of 4 (s2 up to
s5). The difference percentage looks more relevant (about 3%). The cases
s3 and s4 differ from the others (s1, s2, and s5) by the fact that the
sets of matching rows for each compared text table value intersect: all
the rows matched by ferrari also match honda (strict inclusion not
equality); all the rows matched by ferrari also match hummer and
conversely (double inclusion here, ie. equality). In the other 3 cases,
each compared text table value matches set of rows without intersecting
the matching row set of the other(s) value(s). We may then assume that
form arr&&{f,t} would fit better when there are lots of rows matched by
several terms--however this cannot be generalized at this stage.

The reported data let us also guess some linear relationship between
runtime and # of result rows. Here this relationship seems equally
applicable with both forms arr&&{f,t} and (arr&&{f} or arr&&{t}).

Out of these measures and report, I notice that, regarding the query
plan explanation of your queries over real data, the difference between
actual runtimes reported for each case of condition forms is not so
relevant with respect to the overall runtime of the queries. At bitmap
heap scan on the table over which conditions are performed, when the
last row is retrieved, actual runtime is respectively of:
- for arr&&{f,b}: 1276.990 ms;
- for (arr&&{f) or arr&&{b}): 1211.535 ms.
While quite close (difference percentage of about 5%), the difference is
not really harmful with respect to the overall runtimes (resp. 13197 ms
and 7768 ms), ie. in terms of part of overall runtimes resp.
(1276/13197=) 10% and (1211/7768=) 16 %.

Even if overlap operator runtimes are interchanged between the 2 cases,
ratios would be resp. (1211/13197=) 9% and (1276/7768=) 16%, ie. not so
different from the current plan.

On the other hand, the hash join runtime part is very huge:
- for arr&&{f,b} case: (13170-1924=) 11246 ms;
- for (arr&&{f) or arr&&{b}) case: (7743-1850=) 5893 ms.
These values represent resp. (11246/13197=) 85% and (5893/7768=) 76% of
overall runtimes, ie. clearly dominating the overall query plan.

In my opinion, analysis and optimization may be deepen over table
indexes used for join planning. As your reported query plans show, the
Where clauses are performed independantly from the table ml_lead; the
reason is that all the attributes of the clauses belong to the table
lead_reporting_data. Time may be reduced on join condition achievements.

Hoping this observation will contribute a little to your opinion.

Without any claim, I attached a document to this email for details on
the measures I took with the overlap operator -- OpenDocument
Spreadsheet (ODS) v2 formatted file, 24 kiB. The 3rd sheet "various"
presents the detailed measures related to the data reported in this email.

Regards.

--
nha / Lyon / France.

Attachment Content-Type Size
overlap-opr-measures.ods application/octet-stream 23.7 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ivan Sergio Borgonovo 2009-07-06 16:45:35 Feistel cipher, shorter string and hex to int
Previous Message Tguru 2009-07-06 15:40:55 Re: Upgrading 8.3 to 8.4 on Windows.