Quick Links

Re: WIP: store additional info in GIN index

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WIP: store additional info in GIN index
Date:	2012-12-23 23:48:49
Message-ID:	50D79861.3030303@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi!

On 22.12.2012 17:15, Alexander Korotkov wrote:
> I'm not saying this is a perfect benchmark, but the differences (of
> querying) are pretty huge. Not sure where this difference comes from,
> but it seems to be quite consistent (I usually get +-10% results, which
> is negligible considering the huge difference).
>
> Is this an expected behaviour that will be fixed by another patch?
>
>
> Another patches which significantly accelerate index search will be
> provided. This patch changes only GIN posting lists/trees storage.
> However, it wasn't expected that this patch significantly changes index
> scan speed in any direction.

That was exactly my expectation - probably not an improvement, but
definitely not a worse performance.

>
> The database contains ~680k messages from the mailing list archives,
> i.e. about 900 MB of data (in the table), and the GIN index on tsvector
> is about 900MB too. So the whole dataset nicely fits into memory (8GB
> RAM), and it seems to be completely CPU bound (no I/O activity at all).
>
> The configuration was exactly the same in both cases
>
> shared buffers = 1GB
> work mem = 64 MB
> maintenance work mem = 256 MB
>
> I can either upload the database somewhere, or provide the benchmarking
> script if needed.
>
>
> Unfortunately, I can't reproduce such huge slowdown on my testcases.
> Could you share both database and benchmarking script?

It's strange, but no matter what I do I can't reproduce those results
(with the significant performance decrease). So either I've done some
strange mistake when running those tests, or there was something wrong
with my system, or whatever :-(

But when running the benchmarks now (double-checked everything, properly
repeated the tests, ...), I've noticed a different behaviour. But first
some info about the scripts I use for testing.

All the scripts are available here:

https://bitbucket.org/tvondra/archie

It's my "hobby project" implementing fulltext mbox archive. It should be
usable but it's still a bit WIP so let me know in case of any issues.

The README should give you all the instructions on how to setup and load
the database. I'm using ~1700 mbox files downloaded from
http://archives.postgresql.org/ for these lists (until 2012/11):

pgadmin-hackers
pgsql-advocacy
pgsql-announce
pgsql-bugs
pgsql-general
pgsql-hackers
pgsql-jdbc
pgsql-novice
pgsql-odbc
pgsql-patches
pgsql-sql

which in the end gives ~677k rows in the 'messages' table, occupying
~5.5GB disk space (including all the indexes etc).

Once you have the data loaded, you need to warmup the database and then
start benchmarking it - I'm using the warmup.py script to both things.
The script is quite simple, it basically just

To warmup the DB, just run this

./warmup.py --db archie --duration 300

until the %util drops near 0 (assuming you have enough RAM to fit the
whole database into memory). Then I usually do this as a benchmarking

./warmup.py --db archie --duration 300 --no-hash \
--no-thread --words 1

./warmup.py --db archie --duration 300 --no-hash \
--no-thread --words 2

which runs 60-second tests and outputs one line for worker (by default
equal to the number of CPUs).

The script itself is very simple, it fetches a random message and uses
the tsvector column as a source of words for the actual benchmark. It
takes N words from the tsvector, splits them into groups and performs a
simple fulltext query using plainto_tsquery('word1 word2 ...'). At the
end it prints info including the number of queries per second.

I've run the tests on the current master with and without the v3 patch.
I've tested it with 1GB or 2GB shared buffers, and 32MB or 64MB work mem.

The tests were run for 1, 2, 3, 4 and 5 words, and I've repeated it five
times for each configuration. Duration of each run was 5-minutes.

These are the averages (from the 5 runs) of queries per second for each
combination of parameters:

1 2 3 4 5
----------------------------------------------------
master 1GB/32MB 19 179 165 127 99
patched 1GB/32MB 19 175 163 124 96

master 1GB/64MB 20 181 165 127 99
patched 1GB/64MB 19 174 159 120 95

master 2GB/32MB 27 181 165 127 98
patched 2GB/32MB 25 176 156 120 93

master 2GB/64MB 27 180 166 128 102
patched 2GB/64MB 40 402 364 245 176

There's no significant difference in performance, except for the
2GB/64MB combination. And in that case it's actually the opposite
direction than I've reported before - i.e. this time it's up to 100%
faster than the unpatched master. The results are pretty consistent
(very small variance across the repeated runs), so I'm not sure about
the previous results.

Any idea what might cause such behavior? Why should it happen only with
this particular combination of shared_buffers and work_mem?

kind regards
Tomas

In response to

Re: WIP: store additional info in GIN index at 2012-12-22 16:15:49 from Alexander Korotkov

Responses

Re: WIP: store additional info in GIN index at 2013-03-03 14:53:15 from Craig Ringer

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Greg Stark	2012-12-24 00:01:39	Re: initdb and share/postgresql.conf.sample
Previous Message	Jeff Janes	2012-12-23 23:11:21	initdb and share/postgresql.conf.sample