Re: All Taxi Services need Index Clustered Heap Append

From: Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
To: Darafei Komяpa Praliaskouski <me(at)komzpa(dot)net>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: All Taxi Services need Index Clustered Heap Append
Date: 2018-03-05 16:48:11
Message-ID: CA+CSw_tcUYdXTjM6sGw=FdXqdmVhjsZ2L66cX=2mR=9jDS=SSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 5, 2018 at 2:11 PM, Darafei "Komяpa" Praliaskouski
<me(at)komzpa(dot)net> wrote:
>> This approach mixes well with hash
>> partitioning. It would be neat indeed if PostgreSQL do something
>> equivalent on its own, and pluggable storage work being done could
>> enable index organized tables that would help. But you probably need
>> something right now.
>
>
> Fixing glaring issues (no vacuum and thus no Index-Only Scan on append-only
> tables, vacuum processing all of the eternity of btree) by 11 will get most
> of spike-nails out of the microservice code, and we can probably live with
> them until 11 gets to RDS.
>
> I also don't see why a pluggable storage is a must for the clustered write.
> Postgres does have a mechanism for selecting the next page to write tuple
> to, right now it's just looking at FSM - but what if it just peeked at
> existing index that already has enough the data to route tuple to correct
> page on write?

The mechanism you outlined would likely work for your use case, but it
has many issues that prevent it from being universally useful. From
the top of my head:

* One extra index descent per insertion (I/O for this is necessary
anyway, but CPU work is duplicated).
* We don't currently track the amount of bloat. A mechanism that does
this needs to be added.
* If table hits the bloat limit there will be a sudden change in
behavior. This is pretty nasty from an operations point of view.
* With your (id,ts) clustering and data coming in mostly ordered by
timestamp, after initial warmup, each page will contain rows from a
single id, but different ids are arbitrarily interleaved. This is
better than current state, but people might want to have an
interleaving step bigger than 8kB to better utilize storage hardware.
* It seems that with a common (ts) clustering and age of timestamp
coming from an exponential distribution, this will quickly bloat to
threshold and then insert data in a rather arbitrary order. This is
much worse than the default behavior.

At least in my opinion these problems make it a special case
optimization that is hard to justify in core. A decent alternative
would be a plugin mechanism for locating free space for a tuple where
you can write your extension to find a suitable location for the row.

>> I guess I don't have to tell you that it looks like your needs have
>> outgrown what RDS works well with and you are in for a painful move
>> sooner or later.
>
>
> Painful move where to? If we just run a Postgres instance without RDS we'll
> get the pain of setting up Postgres and replication and backups and
> autofailover, with no visible gain except if we get some private /
> unaccepted patches applied to it. If we can get these things right upstream
> why would we want to switch?

EC2 for example. Mainly because I3 instances and ephemeral provide an
order of magnitude or two of performance improvement while costing
less. Being able to run custom extensions and patches if necessary is
a nice bonus. Yes, setting up replication, autofailover and backups is
extra work that you have to weigh against the benefits. But don't
overestimate the effort - there are some pretty nice tools available
that make a proper cluster relatively simple to set up.

> Per my colleagues, MySQL offers clustered index, also MySQL is available on
> RDS without the need of "painful move", which is doable by writing to two
> locations for a day and then pointing readers to new DB. But if we can
> instead do no move and be sure the issues are gone upstream before we hit
> the limit of spike-nails we're running on currently, wouldn't that be
> better? :)

The move off of RDS is painful because getting data out of RDS
involves either downtime or building an ad-hoc logical replication
solution. You need to solve that regardless of where you move to.

Providing an out-of-the-box solution in core PostgreSQL would of
course be best, but realistically you will be waiting at least 2 years
to get it on RDS. In the meanwhile either the buffer partition
approach I described, or a buffering microservice in front of
PostgreSQL like Aleksander recommended should fix data locality for
you. If you weren't running on RDS I would even propose using Redis as
the buffer with one key per driver and redis_fdw to make the data
accessible from within PostgreSQL.

Regards,
Ants Aasma
--
+43-670-6056265
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2018-03-05 16:52:08 Re: 2018-03 CFM
Previous Message Nikhil Sontakke 2018-03-05 16:37:01 Re: [HACKERS] logical decoding of two-phase transactions