GIN, pg_trgm and large table

From: Max Fomichev <max(dot)fomitchev(at)gmail(dot)com>
To: pgsql-novice(at)lists(dot)postgresql(dot)org
Subject: GIN, pg_trgm and large table
Date: 2018-03-10 10:31:46
Message-ID: 44c55f0b-a67e-94d3-07f3-6381204ef9e9@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

Hello!
I have the following table and index with about 15 billion records.

CREATE TABLE us_en.ngrams
(
    ngram text COLLATE pg_catalog."default" NOT NULL,
    year smallint NOT NULL,
    occurrence bigint NOT NULL,
    words smallint NOT NULL
)
WITH (
    OIDS = FALSE
)
TABLESPACE pg_default;

CREATE INDEX trgm_idx_ngram_ngrams_us_en
    ON us_en.ngrams USING gin
    (ngram COLLATE pg_catalog."default" gin_trgm_ops)
    TABLESPACE pg_default;

Configuration:
PostgreSQL 9.5.12 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit
128GB RAM, 2TB SDD
Server settings:
shared_buffers = 32GB
effective_cache_size = 96GB
work_mem = 4GB
maintenance_work_mem = 16GB
min_wal_size = 4GB
max_wal_size = 8GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
max_worker_processes = 16
gin_fuzzy_search_limit = 10000

My queries related to us_en.ngrams.ngram field are too slow:
ngrams=# set enable_seqscan=false;
SET
ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE
ngram LIKE '%computer%' LIMIT 20;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=232884.91..232933.12 rows=20 width=36) (actual
time=44962.926..44966.214 rows=14 loops=1)
   Buffers: shared hit=18177345 read=124224
   ->  Bitmap Heap Scan on ngrams  (cost=232884.91..61129746.57
rows=25261021 width=36) (actual time=44962.925..44966.202 rows=14 loops=1)
         Recheck Cond: (ngram ~~ '%computer%'::text)
         Rows Removed by Index Recheck: 10
         Heap Blocks: exact=8
         Buffers: shared hit=18177345 read=124224
         ->  Bitmap Index Scan on trgm_idx_ngram_ngrams_us_en
(cost=0.00..226569.66 rows=25261021 width=0) (actual
time=44961.929..44961.929 rows=24 loops=1)
               Index Cond: (ngram ~~ '%computer%'::text)
               Buffers: shared hit=18177345 read=124216
 Planning time: 0.305 ms
 Execution time: 44966.271 ms
(12 rows)

ngrams=# set enable_seqscan=true;
SET
ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE
ngram LIKE '%computer%' LIMIT 20;
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..215.41 rows=20 width=36) (actual time=0.248..29.089
rows=20 loops=1)
   Buffers: shared hit=14 read=1509
   ->  Seq Scan on ngrams  (cost=0.00..272072628.00 rows=25261021
width=36) (actual time=0.247..29.074 rows=20 loops=1)
         Filter: (ngram ~~ '%computer%'::text)
         Rows Removed by Filter: 207598
         Buffers: shared hit=14 read=1509
 Planning time: 0.332 ms
 Execution time: 29.117 ms
(8 rows)

ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE
ngram LIKE '%version%' ORDER BY (occurrence, year) DESC LIMIT 20;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=4661042.23..4661042.28 rows=20 width=36) (actual
time=144417.365..144417.365 rows=0 loops=1)
   Buffers: shared hit=28982531 read=167634
   ->  Sort  (cost=4661042.23..4664172.58 rows=1252138 width=36)
(actual time=144417.363..144417.363 rows=0 loops=1)
         Sort Key: (ROW(occurrence, year)) DESC
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=28982531 read=167634
         ->  Bitmap Heap Scan on ngrams (cost=20372.07..4627723.29
rows=1252138 width=36) (actual time=144417.331..144417.331 rows=0 loops=1)
               Recheck Cond: (ngram ~~ '%version%'::text)
               Buffers: shared hit=28982527 read=167633
               ->  Bitmap Index Scan on trgm_idx_ngram_ngrams_us_en 
(cost=0.00..20059.04 rows=1252138 width=0) (actual
time=144417.328..144417.328 rows=0 loops=1)
                     Index Cond: (ngram ~~ '%version%'::text)
                     Buffers: shared hit=28982527 read=167633
 Planning time: 0.344 ms
 Execution time: 144417.522 ms
(14 rows)

It there any way to improve GIN/pg_tgrm performance on a such large table?
Thank you!

--
Best regards,
Max Fomichev

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Tom Lane 2018-03-10 17:00:27 Re: GIN, pg_trgm and large table
Previous Message Alvaro Herrera 2018-03-06 16:22:41 Re: STATISTICS retained in CREATE TABLE ... LIKE (INCLUDING ALL)?