Re: Partitioning with a lot of number of partitions

From: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
To: Aleksej Trofimov <aleksej(dot)trofimov(at)ruptela(dot)lt>
Cc: pgsql-novice(at)postgresql(dot)org
Subject: Re: Partitioning with a lot of number of partitions
Date: 2011-10-18 02:42:09
Message-ID: CAK3UJRFyeFkt_4M-rPX3uqq1mKk_7X8YJkHvcdpqUb8qpK4DeQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

On Wed, Oct 12, 2011 at 10:09 AM, Aleksej Trofimov
<aleksej(dot)trofimov(at)ruptela(dot)lt> wrote:
> By the moment we are using PostgreSQL 9.1.0. Our database is about 270GB and
> currently we have a problem with our "hot" table which is about 120GB of
> data and 20GB index.

That sounds quite manageable with decent hardware.

> By "hot" I mean, that we have a lot of
> insert/update/select operations on this table (about 30/30/120 per second).
> This hot table is called coordinates. In our logic we have several
> definitions: Coordinate - basic packet of data; Object - basic unit, which
> sends coordinates; Client - client which own object.  One client may has a
> lot of object. and each object may has a lot  of coordinates. All
> coordinates are stored in coordinates table.
> So by the moment we have an issue with slow insert/select queries... this
> why we decided to partition this table. We have tested 2 types of
> partitioning: By date (weekly), by object;
> This are our  results:
>
> SELECT QUERY (1 object):
>
> -----------------------------------------------------
> Partition type         |  M     |  W    |    D    |
> -----------------------------------------------------
> Non-partitioned     | 5830 | 460  | 2913  |
> Date(weekly)         | 1000 | 440  | 106    |
> Object                    | 0.02 | 0.03 |  0.009 |
> ------------------------------------------------------
>
> M - query for month period
> W - query for week period
> D - query for day period
> All times in ms.
>
> So according to our results, we have much greater performance using
> partitioning by Object,

Well hold on, your results only show the timings for "SELECT QUERY (1
object):" -- does this mean you're just selecting one row? If so, that
doesn't seem like a particularly useful performance benchmark. What
are the _real_ bottleneck queries for your application when you're not
using partitioning?

Also, 5830 ms seems like a lot for what amounts to an indexed (primary
key?) lookup of a single row -- how long does it take for you to
SELECT [some object ID] FROM original_unpartitioned_big_table; (try
the query a few times, with both same and different object IDs to get
a feel for cache effects).

BTW, If you do decide to partition based on objects, you don't have to
make a partition for each object ID, if there are 5000 of those. You
can partition based on ranged of object IDs.

> BUT we have a lot of object (about 5000, and
> planning to connect 2-5 times bigger object count next year). This plans
> leads to the firs question:
>
> 1) How postgres would "like" such a big count of partitions? Will it
> decrease performance in future?

5000 partitions is going to incur significant planner overhead for
each query; check the docs and archives (search term "large number of
partitions" or similar).

> By the moment we noticed several performance
> issues with the big count of object id in such conditional statement: SELECT
> * FROM coordinates_object.coordinates WHERE object_id in (HERE ARE SEVERAL
> OBJECTS);
[snip]
> "              Filter: (object_id = ANY ('{722,728,727,248}'::bigint[]))"
> "Total runtime: 695.963 ms"
>
> 2) As you see from EXPLAIN, postgres is using array for filtering Filter:
> (object_id = ANY ('{722,728,727,248}'::bigint[]))", why?

AFAIR, that's a normal transformation of a IN (...) query to = ANY
(...), and I bet you'll see the same thing without partitioning at
play.

For a variety of reasons (performance, ease of maintenance), I think
partitioning is most useful for improving query times for large
data-sets when you can separate out your "hot" and "cold" data. Or, if
you can't do that, then next-best is to try to partition into sets
that will be queried together often (e.g. if you often query a cluster
of nearby object IDs together, partition by object ID, or same thing
for your other types), so that your queries hit a small number of
partitions on average.

Josh

In response to

Browse pgsql-novice by date

  From Date Subject
Next Message Josh Kupershmidt 2011-10-18 02:50:36 Re: last row of table after csv import
Previous Message e-letter 2011-10-17 22:06:46 last row of table after csv import