Re: temporary table as a subset of an existing table and indexes

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Matthieu Huin <matthieu(dot)huin(at)wallix(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: temporary table as a subset of an existing table and indexes
Date: 2010-11-09 11:47:30
Message-ID: AANLkTikwupVh=T9dDVgAymrffbWLW1ggzOPdvR+VhNGu@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Nov 9, 2010 at 4:47 AM, Matthieu Huin <matthieu(dot)huin(at)wallix(dot)com> wrote:
> Hello Merlin,
>
> So far the improvement in responsiveness has been very noticeable, even
> without indexing the temporary tables. Of course, this is just trading
> accuracy for speed as I simply narrow arbitrarily the search space ...
>
> The schema I am working on is close to the one I am referencing in this
> thread : http://archives.postgresql.org/pgsql-general/2010-11/msg00272.php
>
> Since we want to implement full text search and tags querying, it can lead
> to rather complex autogenerated queries such as this one (find log lines
> with the word 'root' in it, dated from 11/04 to 11/06, where the 'program'
> tag is sshd and the 'severity_code' tag is less than 3) :
>
>
> EXPLAIN ANALYZE SELECT r.logid FROM (((SELECT lcond84.logid, lcond84.date
> FROM
> ( SELECT tmp84.logid, tmp84.date FROM logs tmp84
>  WHERE TRUE  AND ( to_tsvector('simple',tmp84.body) @@
> plainto_tsquery('simple','root') ) AND tmp84.date > '2010-11-04 10:22:06.26'
> AND tmp84.date < '2010-11-06 10:22:06.26' ) AS lcond84
> NATURAL JOIN
> ( SELECT tmp85.logid FROM tags tmp85 WHERE
>  FALSE
> OR (tmp85.name='severity_code' AND num_lt(tmp85.value, 3) )
> OR (tmp85.name='program' AND tmp85.value = CAST(ROW('sshd') AS tagvalue) )
>  GROUP BY tmp85.logid HAVING COUNT(tmp85.logid) = 2 ) AS tcond84  )
>
> )) AS r ORDER BY r.date DESC LIMIT 1000;
>
>
>
> Giving the following query plan :
>
>  Limit  (cost=765445.54..765445.56 rows=9 width=16) (actual
> time=34744.257..34744.257 rows=0 loops=1)
>   ->  Sort  (cost=765445.54..765445.56 rows=9 width=16) (actual
> time=34744.255..34744.255 rows=0 loops=1)
>         Sort Key: tmp84.date
>         Sort Method:  quicksort  Memory: 17kB
>         ->  Hash Join  (cost=765005.46..765445.40 rows=9 width=16) (actual
> time=34744.202..34744.202 rows=0 loops=1)
>               Hash Cond: (tmp85.logid = tmp84.logid)
>               ->  HashAggregate  (cost=758440.29..758669.77 rows=15299
> width=8) (actual time=33343.816..33343.816 rows=0 loops=1)
>                     Filter: (count(tmp85.logid) = 2)
>                     ->  Bitmap Heap Scan on tags tmp85
>  (cost=92363.26..757225.45 rows=242968 width=8) (actual
> time=20676.354..33294.252 rows=32864 loops=1)
>                           Recheck Cond: ((name = 'severity_code'::text) OR
> (name = 'program'::text))
>                           Filter: (((name = 'severity_code'::text) AND
> num_lt(value, 3::double precision)) OR ((name = 'program'::text) AND
> ((value).storedvalue = 'sshd'::text)))
>                           ->  BitmapOr  (cost=92363.26..92363.26
> rows=2148625 width=0) (actual time=19688.915..19688.915 rows=0 loops=1)
>                                 ->  Bitmap Index Scan on nameval_idx
>  (cost=0.00..30388.35 rows=707841 width=0) (actual time=19337.358..19337.358
> rows=708719 loops=1)
>                                       Index Cond: (name =
> 'severity_code'::text)
>                                 ->  Bitmap Index Scan on nameval_idx
>  (cost=0.00..61853.42 rows=1440784 width=0) (actual time=351.551..351.551
> rows=1484703 loops=1)
>                                       Index Cond: (name = 'program'::text)
>               ->  Hash  (cost=6553.06..6553.06 rows=969 width=16) (actual
> time=1400.378..1400.378 rows=32516 loops=1)
>                     ->  Bitmap Heap Scan on logs tmp84
>  (cost=215.51..6553.06 rows=969 width=16) (actual time=1104.226..1383.745
> rows=32516 loops=1)
>                           Recheck Cond: (to_tsvector('simple'::regconfig,
> body) @@ '''root'''::tsquery)
>                           Filter: ((date > '2010-11-04
> 10:22:06.26+01'::timestamp with time zone) AND (date < '2010-11-06
> 10:22:06.26+01'::timestamp with time zone))
>                           ->  Bitmap Index Scan on fulltext_body_idx
>  (cost=0.00..215.27 rows=1740 width=0) (actual time=1097.874..1097.874
> rows=64340 loops=1)
>                                 Index Cond:
> (to_tsvector('simple'::regconfig, body) @@ '''root'''::tsquery)
>  Total runtime: 34756.938 ms
>
> This one isn't too bad, but the runtime seems to increase exponentially with
> the tables size. Therefore, using a temporary table based on the date
> condition can cut the query time by a factor of up to ten (table creation
> included, and provided the resulting table isn't too big - I make a COUNT
> check prior to creation so that I will eventually limit manually the table
> size.). But of course, I'd rather have speed AND accuracy ...
>
> To make things worse, the tables tend to grow very quickly since as you
> might have guessed, I am working on the database part of a logs collector;
> the current implementation doesn't scale well along the data.
>
> I hope this makes things clearer. Feel free to ask if you need more
> clarifications, and thanks for your time.

How are you partitioning the tags? Is the partitioned query doing the
same job as the non partitioned query? Is date a forced criteria?
(and if it is, have you considered date partition/brute force?)

merlin

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Robert Treat 2010-11-09 12:13:00 Looking for PostgreSQL Folks in New Orleans area
Previous Message Ivan Voras 2010-11-09 11:43:09 Re: postgresql scalability issue