I did some testing of GIST/GIN vs BTree indexing…

From: Guyren Howe <guyren(at)gmail(dot)com>
To: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: I did some testing of GIST/GIN vs BTree indexing…
Date: 2014-12-03 09:15:50
Message-ID: D599302A-50E6-4758-856E-61D619F4434F@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Obviously, database benchmarking is basically a silly idea, because every workload, every hardware configuration, every schema are different, with very different consequences.

Still, I was left with wondering when one should choose a BTree vs GIST or GIN (I didn’t even try to look at SP-GIST: any thoughts?).

The documentation on heuristics for when to choose which doesn’t say exactly but suggests that one should just choose a BTree.

But what made me curious about the alternatives is this: you have to use a compound BTree index left-to-right (you can use the first column, or the first and second, or first, second and third…), whereas GIST and GIN both support using arbitrary fields. That right there is mighty compelling for any kind of ad-hoc query scenario. So I thought I’d do some testing about insert and query speed.

I made a table with six integer columns but no indexes, and made copies with a regular BT primary key, and then an index on all the columns, in separate tables using either a BTree, a GIST or a GIN index. The original table I generated 10 million rows of random values in the range 1..1 million.

This was on 9.4RC1 on a recent-model iMac. I used an external 2.5” hard drive so the SSD wasn’t mucking things up. I restarted Postgres (but not the computer) between tests. I tested a different range of values, of the same size, when comparing searches, in order to mostly avoid having the MacOS disk cache helping things out. I also tried running multiple versions of the search tests on different ranges in different orders, just to see if the comparison of search times was roughly consistent, which it was.

The results were… surprising, and don’t seem to accord with what the documentation says.

I was interested in insert speed. The docs suggest that GIN would be appallingly slow, but don’t really talk about GIST vs BTree. When I configured reasonable caches, working mem and such, I got the following times for insert… select * (values in ms; columns are BTree-GIST-GIN):

1,233,570
2,323,639
1,700,752

GIST was slower than GIN (which flat contradicts the documentation), but both are in shouting distance of BTree. When I tuned down the caches, to simulate a machine under load, I got:

20,252,299
3,583,346
10,495,424

So wow, BTree is much slower. This is basically a default setup, with tiny cache and working memory and whatnot.

Searching on the first column is just as interesting. With decent caches, find 100 values (so about a thousand rows):

936
508
2,215

So this result is interesting. GIN is *slowest*, and in particular is slower than GIST by a good margin, whereas the docs suggest GIN should always be faster. GIST is significantly faster than BTree.

Now finding 1,000 values (so 10,000 rows, give or take):

4,148
1,302
3,078

GIST is fastest again, but now GIN is faster than BTree.

Again, but 10,000 values/100,000 rows:

34,767
8,104
8,995

So in a memory-rich environment, GIST appears to be the clear winner. When I turn down the memory values, though, I get a very different result.

100 values/1000 rows:

7,931
7,209
1,034

1000 values/10,000 rows:

81,675
88,191
1,598

I tried that one several times, because I found it hard to believe. The results were consistent. With 10,000 values/100,000 rows, the results are even starker.

Index sizes in pages:

258,578
309,636
581,795

GIN is certainly not the “three times” size suggested in the docs, but perhaps that just hasn’t been updated for the 9.4 improvements. Certainly, there isn’t sufficient difference here to make the BTree advantage compelling in most applications.

Given the futility of database benchmarking in general, I didn’t want to go any further with this. What I was interested in was whether it might be worth switching from BTree to GIST/GIN indexes with regular sorts of data. It appears to be the case that GIST and GIN are often better than BTree in general, and given their much greater flexibility in satisfying queries on different columns, it might even be the case that one should recommend a single GIST or GIN index on the frequently-searched columns of a table in most cases?

I would absolutely *love* to hear what this community has to say about this question: should one consider GIST or GIN indexes on regular old numeric/text columns? When, theoretically? When, in practice (ie does anyone have comparable benchmarks on 9.4?). Other thoughts?

Responses

Browse pgsql-general by date

  From Date Subject
Next Message M Tarkeshwar Rao 2014-12-03 11:07:39 FW: getting error while running sql on mm_activealrm table
Previous Message Albe Laurenz 2014-12-03 08:54:28 Re: Is "WITH () UPDATE" Thread Safe ?