Re: Hash partitioning.

From: Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
To: Markus Wanner <markus(at)bluegap(dot)ch>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Yuri Levinsky <yuril(at)celltick(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Christopher Browne <cbbrowne(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL Mailing Lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash partitioning.
Date: 2013-06-27 16:35:24
Message-ID: CAP-rdTZogd6dFc2AuXJ9HV1J8R6pbQoNZKP-Ovn6PcBRrpk7jw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2013/6/27 Markus Wanner <markus(at)bluegap(dot)ch>:

> On 06/27/2013 11:12 AM, Nicolas Barbier wrote:
>
>> Imagine that there are a lot of indexes, e.g., 50. Although a lookup
>> (walking one index) is equally fast, an insertion must update al 50
>> indexes. When each index requires one extra I/O (because each index is
>> one level taller), that is 50 extra I/Os. In the partitioned case,
>> each index would require the normal smaller amount of I/Os. Choosing
>> which partition to use must only be done once: The result “counts” for
>> all indexes that are to be updated.
>
> I think you're underestimating the cost of partitioning. After all, the
> lookup of what index to update for a given partition is a a lookup in
> pg_index via pg_index_indrelid_index - a btree index.

I am assuming that this (comparatively very small and super-hot) index
is cached all the time, while for the other indexes (that are
supposedly super-huge) only the top part stays cached.

I am mostly just trying to find out where Yuri’s “partitioning is
needed for super-huge tables” experience might come from, and noting
that Heikki’s argument might not be 100% valid. I think that the
“PostgreSQL-answer” to this problem is to somehow cluster the data on
the “hotness column” (so that all hot rows are close to one another,
thereby improving the efficiency of the caching of relation blocks) +
partial indexes for the hot rows (as first mentioned by Claudio; to
improve the efficiency of the caching of index blocks).

> Additionally, the depth of an index doesn't directly translate to the
> number of I/O writes per insert (or delete). I'd rather expect the avg.
> number of I/O writes per insert into a b-tree to be reasonably close to
> one - depending mostly on the number of keys per page, not depth.

My reasoning was: To determine which index block to update (typically
one in both the partitioned and non-partitioned cases), one needs to
walk the index first, which supposedly causes one additional (read)
I/O in the non-partitioned case on average, because there is one extra
level and the lower part of the index is not cached (because of the
size of the index). I think that pokes a hole in Heikki’s argument of
“it really doesn’t matter, partitioning == using one big table with
big non-partial indexes.”

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2013-06-27 16:35:33 Re: Group Commits Vs WAL Writes
Previous Message Andrew Dunstan 2013-06-27 16:19:33 Re: Kudos for Reviewers -- straw poll