Tsearch2 - bad performance with concatenated ts-vectors

From: Jan Wielgus <jan_w(at)tlen(dot)pl>
To: pgsql-performance(at)postgresql(dot)org
Subject: Tsearch2 - bad performance with concatenated ts-vectors
Date: 2011-08-02 06:22:36
Message-ID: 3479994c.402ea88e.4e3797ac.972cd@tlen.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello everyone,

This is my first post on this list, I tried to look after possible solutions in the archive, as well as in google, but I could not find an explanation for such a specific situation.

I am facing a performance problem connected with Postgres Tsearch2 FTS mechanism.

Here is my query:

select participant.participant_id from participant participant
join person person on person.person_participant_id = participant.participant_id
left join registration registration on registration.registration_registered_participant_id = participant.participant_id
left join enrollment enrollment on registration.registration_enrollment_id = enrollment.enrollment_id
join registration_configuration registration_configuration on enrollment.enrollment_configuration_id = registration_configuration.configuration_id
left join event_context context on context.context_id = registration_configuration.configuration_context_id
where participant.participant_type = 'PERSON'
and participant_status = 'ACTIVE'
and context.context_code in ('GB2TST2010A')
and registration_configuration.configuration_type in ('VISITOR')
and registration_configuration.configuration_id is not null
and participant.participant_tsv || person.person_tsv @@ to_tsquery('simple',to_tsquerystring('Abigail'))
limit 100

As you see, I am using two vectors which I concatenate and check against a tsquery.

Both vectors are indexed with GIN and updated with respective triggers in the following way:

ALTER TABLE person ALTER COLUMN person_tsv SET STORAGE EXTENDED;
CREATE INDEX person_ft_index ON person USING gin(person_tsv);
CREATE OR REPLACE FUNCTION update_person_tsv() RETURNS trigger AS $$ BEGIN NEW.person_tsv := to_tsvector('simple',create_tsv( ARRAY[NEW.person_first_name, NEW.person_last_name, NEW.person_middle_name] )); RETURN NEW; END; $$ LANGUAGE 'plpgsql';
CREATE TRIGGER person_tsv_update BEFORE INSERT or UPDATE ON person FOR EACH ROW EXECUTE PROCEDURE update_person_tsv();

ALTER TABLE participant ALTER COLUMN participant_tsv SET STORAGE EXTENDED;
CREATE INDEX participant_ft_index ON participant USING gin(participant_tsv);
CREATE OR REPLACE FUNCTION update_participant_tsv() RETURNS trigger AS $$ BEGIN NEW.participant_tsv := to_tsvector('simple',create_tsv( ARRAY[NEW.participant_login, NEW.participant_email] )); RETURN NEW; END; $$ LANGUAGE 'plpgsql';
CREATE TRIGGER participant_tsv_update BEFORE INSERT or UPDATE ON participant FOR EACH ROW EXECUTE PROCEDURE update_participant_tsv();

The database is quite big - has almost one million of participant records. The above query has taken almost 67 seconds to execute and fetch 100 rows, which is unacceptable for us.

As I assume, the problem is, when the vectors are concatenated, the individual indexes for each vector are not used. The execution plan done after 1st execution of the query:

"Limit (cost=46063.13..93586.79 rows=100 width=4) (actual time=4963.620..39703.645 rows=100 loops=1)"
" -> Nested Loop (cost=46063.13..493736.04 rows=942 width=4) (actual time=4963.617..39703.349 rows=100 loops=1)"
" Join Filter: (registration_configuration.configuration_id = enrollment.enrollment_configuration_id)"
" -> Nested Loop (cost=46063.13..493662.96 rows=3769 width=8) (actual time=4963.517..39701.557 rows=159 loops=1)"
" -> Nested Loop (cost=46063.13..466987.33 rows=3769 width=8) (actual time=4963.498..39698.542 rows=159 loops=1)"
" -> Hash Join (cost=46063.13..430280.76 rows=4984 width=8) (actual time=4963.464..39692.676 rows=216 loops=1)"
" Hash Cond: (participant.participant_id = person.person_participant_id)"
" Join Filter: ((participant.participant_tsv || person.person_tsv) @@ to_tsquery('simple'::regconfig, to_tsquerystring('Abigail'::text)))"
" -> Seq Scan on participant (cost=0.00..84680.85 rows=996741 width=42) (actual time=0.012..3132.944 rows=1007151 loops=1)"
" Filter: (((participant_type)::text = 'PERSON'::text) AND ((participant_status)::text = 'ACTIVE'::text))"
" -> Hash (cost=25495.39..25495.39 rows=1012539 width=38) (actual time=3145.628..3145.628 rows=1007151 loops=1)"
" Buckets: 2048 Batches: 128 Memory Usage: 556kB"
" -> Seq Scan on person (cost=0.00..25495.39 rows=1012539 width=38) (actual time=0.062..1582.990 rows=1007151 loops=1)"
" -> Index Scan using idx_registration_registered_participant_id on registration (cost=0.00..7.35 rows=1 width=8) (actual time=0.018..0.019 rows=1 loops=216)"
" Index Cond: (registration.registration_registered_participant_id = person.person_participant_id)"
" -> Index Scan using enrollment_pkey on enrollment (cost=0.00..7.07 rows=1 width=8) (actual time=0.011..0.013 rows=1 loops=159)"
" Index Cond: (enrollment.enrollment_id = registration.registration_enrollment_id)"
" -> Materialize (cost=0.00..16.55 rows=1 width=4) (actual time=0.002..0.005 rows=2 loops=159)"
" -> Nested Loop (cost=0.00..16.55 rows=1 width=4) (actual time=0.056..0.077 rows=2 loops=1)"
" Join Filter: (registration_configuration.configuration_context_id = context.context_id)"
" -> Index Scan using idx_configuration_type on registration_configuration (cost=0.00..8.27 rows=1 width=8) (actual time=0.018..0.022 rows=3 loops=1)"
" Index Cond: ((configuration_type)::text = 'VISITOR'::text)"
" Filter: (configuration_id IS NOT NULL)"
" -> Index Scan using idx_event_context_code on event_context context (cost=0.00..8.27 rows=1 width=4) (actual time=0.008..0.010 rows=1 loops=3)"
" Index Cond: ((context.context_code)::text = 'GB2TST2010A'::text)"
"Total runtime: 39775.578 ms"

The assumption seems to be correct, no indexes on vectors are used - sequence scans are done instead:

Join Filter: ((participant.participant_tsv || person.person_tsv) @@ to_tsquery('simple'::regconfig, to_tsquerystring('Abigail'::text)))"
" -> Seq Scan on participant (cost=0.00..84680.85 rows=996741 width=42) (actual time=0.012..3132.944 rows=1007151 loops=1)"
" Filter: (((participant_type)::text = 'PERSON'::text) AND ((participant_status)::text = 'ACTIVE'::text))"
" -> Hash (cost=25495.39..25495.39 rows=1012539 width=38) (actual time=3145.628..3145.628 rows=1007151 loops=1)"
" Buckets: 2048 Batches: 128 Memory Usage: 556kB"
" -> Seq Scan on person (cost=0.00..25495.39 rows=1012539 width=38) (actual time=0.062..1582.990 rows=1007151 loops=1)"

After I removed one of the vectors from the query and used only a single vector
...
and person.person_tsv @@ to_tsquery('simple', to_tsquery('simple',to_tsquerystring('Abigail'))
...
then the execution was much faster - about 5 seconds

Plan afterwards:

"Limit (cost=41.14..8145.82 rows=100 width=4) (actual time=3.776..13.454 rows=100 loops=1)"
" -> Nested Loop (cost=41.14..21923.77 rows=270 width=4) (actual time=3.773..13.248 rows=100 loops=1)"
" -> Nested Loop (cost=41.14..19730.17 rows=270 width=8) (actual time=3.760..11.971 rows=100 loops=1)"
" Join Filter: (registration_configuration.configuration_id = enrollment.enrollment_configuration_id)"
" -> Nested Loop (cost=0.00..16.55 rows=1 width=4) (actual time=0.051..0.051 rows=1 loops=1)"
" Join Filter: (registration_configuration.configuration_context_id = context.context_id)"
" -> Index Scan using idx_configuration_type on registration_configuration (cost=0.00..8.27 rows=1 width=8) (actual time=0.020..0.022 rows=2 loops=1)"
" Index Cond: ((configuration_type)::text = 'VISITOR'::text)"
" Filter: (configuration_id IS NOT NULL)"
" -> Index Scan using idx_event_context_code on event_context context (cost=0.00..8.27 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=2)"
" Index Cond: ((context.context_code)::text = 'GB2TST2010A'::text)"
" -> Nested Loop (cost=41.14..19700.12 rows=1080 width=12) (actual time=3.578..11.431 rows=269 loops=1)"
" -> Nested Loop (cost=41.14..12056.27 rows=1080 width=12) (actual time=3.568..8.203 rows=269 loops=1)"
" -> Bitmap Heap Scan on person (cost=41.14..3687.07 rows=1080 width=4) (actual time=3.553..4.401 rows=346 loops=1)"
" Recheck Cond: (person_tsv @@ to_tsquery('simple'::regconfig, to_tsquerystring('Abigail'::text)))"
" -> Bitmap Index Scan on person_ft_index (cost=0.00..40.87 rows=1080 width=0) (actual time=3.353..3.353 rows=1060 loops=1)"
" Index Cond: (person_tsv @@ to_tsquery('simple'::regconfig, to_tsquerystring('Abigail'::text)))"
" -> Index Scan using idx_registration_registered_participant_id on registration (cost=0.00..7.74 rows=1 width=8) (actual time=0.006..0.007 rows=1 loops=346)"
" Index Cond: (registration.registration_registered_participant_id = person.person_participant_id)"
" -> Index Scan using enrollment_pkey on enrollment (cost=0.00..7.07 rows=1 width=8) (actual time=0.006..0.007 rows=1 loops=269)"
" Index Cond: (enrollment.enrollment_id = registration.registration_enrollment_id)"
" -> Index Scan using participant_pkey on participant (cost=0.00..8.11 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=100)"
" Index Cond: (participant.participant_id = person.person_participant_id)"
" Filter: (((participant.participant_type)::text = 'PERSON'::text) AND ((participant.participant_status)::text = 'ACTIVE'::text))"
"Total runtime: 13.858 ms"

Now the index on vector was used:

"Recheck Cond: (person_tsv @@ to_tsquery('simple'::regconfig, to_tsquerystring('Abigail'::text)))"
" -> Bitmap Index Scan on person_ft_index (cost=0.00..40.87 rows=1080 width=0) (actual time=3.353..3.353 rows=1060 loops=1)"
" Index Cond: (person_tsv @@ to_tsquery('simple'::regconfig, to_tsquerystring('Abigail'::text)))"

So, there is apparently a problem with vector concatenating - the indexes don't work then. I tried to use the vectors separately and to make 'OR' comparison between single vector @@ ts_query checks,
but it didn't help very much (performance was better, but still over 20 sec):
...
(participant.participant_tsv @@ to_tsquery('simple',to_tsquerystring('Abigail'))) OR (person.person_tsv @@ to_tsquery('simple',to_tsquerystring('Abigail')))
...

Is there a way to make this work with better performance? Or is it necessary to create a single vector that contains data from multiple tables and then add an index on it? It would be so far problematic for us,
because we are using multiple complex queries with variable number of selected columns. I know that another solution might be an union among multiple queries, every of which uses a single vector,
but this solution is inconvenient too.

Greetings

Jan

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Adarsh Sharma 2011-08-02 06:42:11 Which Join is better
Previous Message Craig Ringer 2011-08-01 23:57:42 Re: Parameters for PostgreSQL