Re: GIN improvements part2: fast scan

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: GIN improvements part2: fast scan
Date: 2014-02-05 10:42:33
Message-ID: CAPpHfdtSuSoZS4SwYQu4Rc8NOwDHoX3DsLGaxJF9HZ6gW0TdVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 5, 2014 at 1:23 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> On Mon, Feb 3, 2014 at 6:31 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:
>
>> On Mon, Jan 27, 2014 at 7:30 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com
>> > wrote:
>>
>>> On Mon, Jan 27, 2014 at 2:32 PM, Alexander Korotkov <
>>> aekorotkov(at)gmail(dot)com> wrote:
>>>
>>>> Every single change you did in fast scan seems to be reasonable, but
>>>> testing shows that something went wrong. Simple test with 3 words of
>>>> different selectivities.
>>>>
>>>> After applying your patches:
>>>>
>>>> # select count(*) from fts_test where fti @@ plainto_tsquery('english',
>>>> 'gin index select');
>>>> count
>>>> ───────
>>>> 627
>>>> (1 row)
>>>>
>>>> Time: 21,252 ms
>>>>
>>>> In original fast-scan:
>>>>
>>>> # select count(*) from fts_test where fti @@ plainto_tsquery('english',
>>>> 'gin index select');
>>>> count
>>>> ───────
>>>> 627
>>>> (1 row)
>>>>
>>>> Time: 3,382 ms
>>>>
>>>> I'm trying to get deeper into it.
>>>>
>>>
>>> I had two guesses about why it's become so slower than in my original
>>> fast-scan:
>>> 1) Not using native consistent function
>>> 2) Not sorting entries
>>>
>>> I attach two patches which rollback these two features (sorry for awful
>>> quality of second). Native consistent function accelerates thing
>>> significantly, as expected. Tt seems that sorting entries have almost no
>>> effect. However it's still not as fast as initial fast-scan:
>>>
>>> # select count(*) from fts_test where fti @@ plainto_tsquery('english',
>>> 'gin index select');
>>> count
>>> ───────
>>> 627
>>> (1 row)
>>>
>>> Time: 5,381 ms
>>>
>>> Tomas, could you rerun your tests with first and both these patches
>>> applied against patches by Heikki?
>>>
>>
>> I found my patch "0005-Ternary-consistent-implementation.patch" to be
>> completely wrong. It introduces ternary consistent function to opclass, but
>> don't uses it, because I forgot to include ginlogic.c change into patch.
>> So, it shouldn't make any impact on performance. However, testing results
>> with that patch significantly differs. That makes me very uneasy. Can we
>> now reproduce exact same?
>>
>> Right version of these two patches in one against current head is
>> attached. I've rerun tests with it, results are
>> /mnt/sas-raid10/gin-testing/queries/9.4-fast-scan-10. Could you rerun
>> postprocessing including graph drawing?
>>
>> Sometimes test cases are not what we expect. For example:
>>
>> =# explain SELECT id FROM messages WHERE body_tsvector @@
>> to_tsquery('english','(5alpha1-initdb''d)');
>> QUERY PLAN
>>
>>
>> ─────────────────────────────────────────────────────────────────────────────────────────────────────────
>> Bitmap Heap Scan on messages (cost=84.00..88.01 rows=1 width=4)
>> Recheck Cond: (body_tsvector @@ '''5alpha1-initdb'' & ''5alpha1'' &
>> ''initdb'' & ''d'''::tsquery)
>> -> Bitmap Index Scan on messages_body_tsvector_idx (cost=0.00..84.00
>> rows=1 width=0)
>> Index Cond: (body_tsvector @@ '''5alpha1-initdb'' & ''5alpha1''
>> & ''initdb'' & ''d'''::tsquery)
>> Planning time: 0.257 ms
>> (5 rows)
>>
>> 5alpha1-initdb'd is 3 gin entries with different frequencies.
>>
>> Also, these patches are not intended to change relevance ordering speed.
>> When number of results are high, most of time is relevance calculating and
>> sorting. I propose to remove ORDER BY clause from test cases to see scan
>> speed more clear.
>>
>> I've dump of postgresql.org search queries from Magnus. We can add them
>> to our test case.
>>
>
> It turns out that version 10 contained bug in ternary consistent function
> implementation for tsvector. Fixed in attached version.
>

Attached patch is "light" version of fast scan. It does extra consistent
function calls only on startScanKey, no extra calls during scan of the
index.
It finds subset of rarest entries absence of which guarantee false
consistent function result.

I've run real-life tests based of postgresql.org logs (33318 queries). Here
is a table with summary time of running whole test case.

=# select method, sum(time) from test_result group by method order by
method;
method | sum
---------------------+------------------
fast-scan-11 | 126916.111999999
fast-scan-light | 137321.211
heikki | 466751.693
heikki-true-ternary | 454113.623999997
master | 446976.288
(6 rows)

where 'heikki' is gin-ternary-logic binary-heap
preconsistent-only-on-new-page.patch and 'heikki-true-ternary' is version
with my catalog changes promoting ternary consistent function to opclass.

We can see fast-scan-light gives almost same performance benefit on "&"
queries. However, I test only CPU effect, not IO effect.
Any thoughts?

------
With best regards,
Alexander Korotkov.
>
>

Attachment Content-Type Size
gin-fast-scan-light.patch.gz application/x-gzip 11.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2014-02-05 10:48:25 Re: Fix picksplit with nan values
Previous Message Heikki Linnakangas 2014-02-05 08:50:03 Re: Performance Improvement by reducing WAL for Update Operation