Re: All Taxi Services need Index Clustered Heap Append

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: All Taxi Services need Index Clustered Heap Append
Date: 2018-03-12 08:55:39
Message-ID: 4b11c3e6-8c45-7d3a-cd2c-23d1c1e92667@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02.03.2018 19:30, Darafei "Komяpa" Praliaskouski wrote:
> Hi,
>
> I work at a ride sharing company and we found a simple scenario that
> Postgres has a lot to improve at.
> After my talk at pgconf.ru <http://pgconf.ru> Alexander Korotkov
> encouraged me to share my story and thoughts in -hackers.
>
> Story setting:
>
>  - Amazon RDS (thus vanilla unpatchable postgres),
> synchronuous_commit=off (we're ok to lose some seconds on crash).
>
>  - Amazon gp2 storage. It's almost like SSD for small spikes, but has
> IOPS limit if you're doing a lot of operations.
>
>  - Drivers driving around the city and sending their GPS positions
> each second, for simplicity - 50k of them at a time.
>
>  - An append-only table of shape (id uuid, ts timestamp, geom
> geometry, heading speed accuracy float, source text).
> A btree index on (id, ts).
>
>  - A service that gets the measurements on the network, batches all
> into a buffer of 1 second (~N=50k rows) and inserting via COPY.
>
> After someone orders and completes a trip, we have the id of the
> driver and trip time interval coming from another service, and want to
> get trip's route to calculate the bill. Trip times are from seconds up
> to 4 hours (N=4*3600=14400 rows, a page typically has 60 rows).
>
> select * from positions where id = :id and ts between :start_ts and
> :end_ts;
>
> Data for more than 24 hours ago is not needed in this service, thus a
> storage of 70 gb should be enough. On the safe side we're giving it
> 500 gb, which on gp2 gives a steady 1500 iops.
>
> In development (on synthetic data) plans use index and look great, so
> we proceed with Postgres and not MySQL. :)
>
> When deployed to production we figured out that a single query for
> interval of more than half an hour (1800+ rows) can exhaust all the IOPS.
> Data is appended with increasing time field, which effectively ensures
> no rows from the same driver are ever going to be in the same heap
> page. A 4 hour long request can degrade system for 10 seconds. gp2
> provides max 10000 IOPS, and we need to get 14400 pages then. We need
> the biggest available gp2 offer just to read 2 megabytes of data in
> 1.5 seconds. The best io-optimized io1 volumes provide 32000 IOPS,
> which get us as low as 500 ms.
>
> If the data was perfectly packed, it would be just 240 8k pages and
> translate to 120 input operations of gp2's 16kb blocks.
>
> Our options were:
>
>  - partitioning. Not entirely trivial when your id is uuid. To get
> visible gains, we need to make sure each driver gets their own
> partition. That would leave us with 50 000(+) tables, and rumors say
> that in that's what is done in some bigger taxi service, and relcache
> then eats up all the RAM and system OOMs.
>
>  - CLUSTER table using index. Works perfect on test stand, isn't
> available as online option.
>
>  - Postgres Pro suggested CREATE INDEX .. INCLUDE (on commitfest
> https://commitfest.postgresql.org/15/1350/). We can't use that as it's
> not in upstream/amazon Postgres yet.
>
>  - We decided to live with overhead of unnecessary sorting by all
> fields and keeping a copy of heap and created a btree over all the
> fields to utilize Index-Only Scans:
>
>   * Testing went well on dump of production database.
>
>   * After we've made indexes on production, we found out that
> performance is _worse_ than with simpler index.
>
>   * EXPLAIN (BUFFERS) revealed that Visibility Map is never being
> frozen, as autovacuum ignores append-only never-updated never-deleted
> table that is only truncated once a day. No way to force autovacuum on
> such table exists.
>
>   * We created a microservice (hard to find spot for crontab in
> distributed system!) that periodically agressively runs VACUUM on the
> table.
> It indeed helped with queries, but.. VACUUM skips all-visible pages in
> index, but always walks over all the pages of btree, which is even
> larger than the heap in our case.
> There is a patch to repair this behavior on commitfest
> https://commitfest.postgresql.org/16/952/ - but not yet in
> upstream/amazon Postgres.
>
>   * We ended up inventing partitioning schema that rotates a table
> every gigabyte of data, to keep VACUUM run time low. There are a
> hundreds partitions with indexes on all the fields. Finally the system
> is stable.
>
>   * EXPLAIN (BUFFERS) shows later that each reading query visits all
> the indexes of partitioned table and fetches a page from index to know
> there are 0 rows there. To prune obviously unneeded partitions we
> decided to add constraint on timestamp after a partition is finalized.
> Timestamps are sanitized due to mobile network instability are stamped
> on the client side, so we don't know the bounds in advance.
> Unfortunately that means we need two seq scans to do it: first one to
> get min(ts), max(ts), and second one on ALTER TABLE ADD CONSTRAINT.
> This operation also eats up iops.
>
> We are not very large company but we bump into too many scalability
> issues on this path already. Searching for solutions on every step
> shows other people with tables named like "gpsdata" and "traces", so
> we're not alone with this problem. :)
>
> I gave this all some thought and it looks like it all could have not
> happened if Postgres was able to cluster heap insertions by (id, ts)
> index. We're ok with synchronuous_commit=off, so amplified write won't
> immediately hit disk and can get cooled down in progress. Clustering
> doesn't require perfect sorting: we need to minimize number of pages
> fetched, it's ok if the pages are not consecutive on disk.
>
> I see the option for clustered heap writes as follows:
>
>  - on tuple insertion, check if table is clustered by some index;
>
>  - if table is clustered, we start writing not into last used page,
> but instead go into index and get page numbers for index tuples that
> are less or equal than current one, up to index page boundary (or some
> other exit strategy, at least one heap page is needed);
>
>  - if we can fit tuple into that page, let it be written there;
>
>  - if we cannot fit it, consult statistics to see if we have too many
> empty space in pages. If we have more than 50% space empty, get pages
> by FSM. Create a new page otherwise.
> (looking into FSM as safety measure for people who specifically insert
> tuples in backward sorted order).
>
> This would not require more space than is currently required to keep
> vacuumed all-fields index, and let us omit VACUUM passes over
> relations. A pencil-and-paper simulation shows it as a good thing. Do
> you have thoughts, or can you help with implementation, or tell why it
> would be a bad idea and we need to follow our competitor in moving
> away to other RDBMS? :)
>
> I encourage everyone to help with at least
> https://commitfest.postgresql.org/16/952/ as no good SQL-level
> workaround exists for it. After that enabling autovacuum for tables
> that have only inserted and no deleted tuples would be cool to have,
> so that we have them marked as Visible.
>
> Hackers, can you help me keep Postgres in the system? :)
>
> Darafei Praliaskouski,
> GIS Engineer / Juno Minsk

Clustered updateable heap should have structure very similar to B-Tree.
We need to somehow locate target page for insert by key, should handle
page overflow/underflow (split/merge),...
So if something looks like B-Tree, behaves like B-Tree,... then most
likely we should not reinvent a wheel and use B-Tree.
From my point of view covering index (CREATE INDEX .. INCLUDE on
commitfest https://commitfest.postgresql.org/15/1350/) is what we need
in this case.
May be some more options are needed to force use index only scan for
append only data. While this patch is not committed yet, it is possible
to try to create standard compound index, including all record columns
(the problem can be caused by types not having required comparison
operators).

Alternative solution is manual record clustering. It can be achieved by
storing several points of the same rote in one Postgres record. You can
use Postgres array or json types.
The approach with json was used by Platon company solving the similar
task (tracking information about vehicles movements).
Although it seems to be a litle bit strange idea to use JSON for
statically structured data, Platon was able to several times reduce
storage size and increase query execution speed.
Instead of arrays/json type you can defined your own "vector"/"tile"
types or use my extension VOPS. Certainly in all this cases you will
have to rewrite queries and them will become more complicated.

It is actually very common problem that the order of inserting and
traversing data is different. The same think happen in most trading
systems, when data is imported in time ascending order while it is
usually accessed and analyzed grouped by symbols. The most general
solution of the problem is to maintain several different representations
of the data.
It can be for example "vertical" and "horizonal" representations. In
Vertica you can create arbitrary number of "projections" where data is
sorted in different way.
In Postgres is can be achieved using indexes or external storages. I
hope that extensible heap API will simplify development and integration
of such alternative storages.
But even right now you can try to do some manual clustering using
indexes or compound types.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michail Nikolaev 2018-03-12 09:01:29 Re: [WIP PATCH] Index scan offset optimisation using visibility map
Previous Message Amit Khandekar 2018-03-12 08:52:22 Re: Concurrency bug in UPDATE of partition-key