Re: Guidance Requested - Bulk Inserting + Queries

From: Benjamin Johnson <benjamin(dot)johnson(at)getcarbonblack(dot)com>
To: Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Guidance Requested - Bulk Inserting + Queries
Date: 2011-12-01 02:00:56
Message-ID: 4ED6DFD8.3080104@getcarbonblack.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

We're trying to split the current day into hourly tables so that the
size of the indexes that are popular is much lower and therefore we can
support more rows across the day. We also are using numerics where we
could be using bigints, so we're going to also work on that to see how
much smaller we can get it. Once a daily table is not "today", we will
remove duplicates, so we can combine that step with rolling up the
hourly tables into one daily table.

In a *small* test (1-2 orders of magnitude smaller than some potential
customer environments), the cumulative size of the daily indexes is 3.6
GB and that's for only about half of the test.

We're talking 4 different daily partitioned tables with each table
having 1 - 6 indexes (yes, a lot!).

I'll post another update when I have it.

Thanks Leonardo.

On 11/30/2011 10:17 AM, Leonardo Francalanci wrote:
>> We now found (thanks Andres and Snow-Man in #postgresql) that in our
>> tests, after the indexes get too large performance drops signficantly
>> and our system limps forward due to disk reads (presumably for the
>> indexes). If we remove the indexes, performance for our entire sample
>> test is great and everything is written to postgresql very quickly.
> It's usually the fact that the data you index is "random" as opposed to,
> say, an always incremented value (could be a timestamp, or a sequence)
> that leads to insert problems with btrees.
>> My question is, what possible routes can I take where we can have both
>> fast inserts (with indexes removed until the end of the day), but still
>> allow a user to query against today's data? Is this even possible? One
>> idea would be possibly have hourly tables for today and as soon as we
>> can try to re-add indexes.
> Yep, that's the only way I've found: use smaller partitions. That leads
> to slower reads (due to the fact that you have to visit more indexes to
> read the same amount of data). But you'll get faster writes.
>
>> Another possible solution might be to stream
>> the data to another "reader" postgres instance that has indexes,
>> although I'm not very versed in replication.
> I don't think you can do that.
> Another option that you have is to use ssd instead of HD for the indexes
> only (that is, having the indexes in a separate tablespace of ssds). The
> problem is that your disks usually can't keep up with the number of
> random writes it takes to update N "random values" btrees; ssd might help.
> Can you post some numbers, such as # of indexes, # of rows you're trying
> to insert per hour etc etc?
>
>

--
Benjamin Johnson
http://getcarbonblack.com/ | @getcarbonblack
cell: 312.933.3612

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Jesper Krogh 2011-12-01 06:11:40 Re: Problems with FTS
Previous Message MirrorX 2011-11-30 21:34:20 Re: vacuum internals and performance affect