Re: insert and query performance on big string table with pg_trgm

From: Matthew Hall <mhall(at)mhcomputing(dot)net>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, surrano(at)gmail(dot)com
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: insert and query performance on big string table with pg_trgm
Date: 2017-12-06 06:15:13
Message-ID: 351E3FBB-7DFC-4BB0-841E-E7AB16DD57A3@mhcomputing.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Nov 21, 2017, at 12:05 AM, Matthew Hall <mhall(at)mhcomputing(dot)net> wrote:
>> Do you really need the artificial primary key, when you already have another column that would be used as the primary key? If you need to use this it a foreign key in another type, then very well might. But maintaining two unique indexes doesn't come free.
>
> OK, fair enough, I'll test with it removed and see what happens.

With the integer primary key removed, it still takes ~9 hours to load the table, so it didn't seem to make a big difference.

> Fixed it; I will re-test w/ 1 GB. Have you got any recommended values so I don't screw it up?

I also took this step for maintenance_work_mem.

Queries on the table still take a long time with the PK removed:

# explain (analyze, buffers) select * from huge_table where value ilike '%yahoo%';

Bitmap Heap Scan on huge_table (cost=593.72..68828.97 rows=18803 width=25) (actual time=3224.100..70059.839 rows=20909 loops=1)
Recheck Cond: ((value)::text ~~* '%yahoo%'::text)
Rows Removed by Index Recheck: 17
Heap Blocks: exact=6682
Buffers: shared hit=544 read=6760 dirtied=4034
I/O Timings: read=69709.611
-> Bitmap Index Scan on huge_table_value_trgm_idx (cost=0.00..589.02 rows=18803 width=0) (actual time=3216.545..3216.545 rows=20926 loops=1)
Index Cond: ((value)::text ~~* '%yahoo%'::text)
Buffers: shared hit=352 read=270
I/O Timings: read=3171.872
Planning time: 0.283 ms
Execution time: 70065.157 ms
(12 rows)

The slow process during inserts is:

postgres: username dbname [local] INSERT

The slow statement example is:

2017-12-06 04:27:11 UTC [16085-10378] username(at)dbname LOG: duration: 5028.190 ms plan:
Query Text: INSERT INTO huge_table (value) VALUES
.... 5000 values at once ...
ON CONFLICT (value) DO NOTHING
Insert on huge_table (cost=0.00..75.00 rows=5000 width=40)
Conflict Resolution: NOTHING
Conflict Arbiter Indexes: huge_table_value_idx
-> Values Scan on "*VALUES*" (cost=0.00..75.00 rows=5000 width=40)

> What is the size of the table and the gin index?

The table is 10 GB. The gin index is 5.8 GB.

> [From Gabor Szucs] [H]ow about adding a hash value column and creating the unique index on that one? May block some false duplicates but the unique index would be way smaller, speeding up inserts.

The mean length of the input items is about 18 bytes. The max length of the input items is about 67 bytes. The size of the md5 would of course be 16 bytes. I'm testing it now, and I'll write another update.

Matthew.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Vitaliy Garnashevich 2017-12-06 07:06:52 Re: Bitmap scan is undercosted?
Previous Message Aaron Werman 2017-12-06 01:59:35 Re: Half billion records in one table? RDS