Re: Remove 1MB size limit in tsvector

From: Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Remove 1MB size limit in tsvector
Date: 2017-09-11 09:33:32
Message-ID: 20170911123332.38d5853a@wp.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 7 Sep 2017 23:08:14 +0200
Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

> Hi,
>
> On 08/17/2017 12:23 PM, Ildus Kurbangaliev wrote:
> > In my benchmarks when database fits into buffers (so it's
> > measurement of the time required for the tsvectors conversion) it
> > gives me these results:
> >
> > Without conversion:
> >
> > $ ./tsbench2 -database test1 -bench_time 300
> > 2017/08/17 12:04:44 Number of connections: 4
> > 2017/08/17 12:04:44 Database: test1
> > 2017/08/17 12:09:44 Processed: 51419
> >
> > With conversion:
> >
> > $ ./tsbench2 -database test1 -bench_time 300
> > 2017/08/17 12:14:31 Number of connections: 4
> > 2017/08/17 12:14:31 Database: test1
> > 2017/08/17 12:19:31 Processed: 43607
> >
> > I ran a bunch of these tests, and these results are stable on my
> > machine. So in these specific tests performance regression about
> > 15%.
> >
> > Same time I think this could be the worst case, because usually data
> > is on disk and conversion will not affect so much to performance.
> >
>
> That seems like a fairly significant regression, TBH. I don't quite
> agree we can simply assume in-memory workloads don't matter, plenty of
> databases have 99% cache hit ratio (particularly when considering not
> just shared buffers, but also page cache).

I think part of this regression is caused by better compression of new
format. I can't say exact percent here, need to check with perf.

If you care about performace, you create indexes, which means that
tsvector will no longer be used for text search (except for ORDER BY
rank). Index machinery will only peek into tsquery. Moreover, RUM index
stores positions + lexemes, so it doesn't need tsvectors for ranked
search. As a result, tsvector becomes a storage for
building indexes (indexable type), not something that should be used at
runtime. And the change of the format doesn't affect index creation
time.

>
> Can you share the benchmarks, so that others can retry running them?

Benchmarks are published at github:
https://github.com/ildus/tsbench . I'm not sure that they are easy to
use.

Best regards,
Ildus Kurbangaliev

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2017-09-11 09:41:32 Re: Automatic testing of patches in commit fest
Previous Message Konstantin Knizhnik 2017-09-11 09:24:45 Re: Cached plans and statement generalization