Re: Air-traffic benchmark

From: Nikolas Everett <nik9000(at)gmail(dot)com>
To: "Gurgel, Flavio" <flavio(at)4linux(dot)com(dot)br>
Cc: Matthew Wakeling <matthew(at)flymine(dot)org>, Lefteris <lsidir(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org, Ivan Voras <ivoras(at)freebsd(dot)org>
Subject: Re: Air-traffic benchmark
Date: 2010-01-07 18:11:15
Message-ID: d4e11e981001071011t2e919703paedb8d24c800072c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

This table is totally unnormalized. Normalize it and try again. You'll
probably see a huge speedup. Maybe even 10x. My mantra has always been
less data stored means less data to scan means faster scans.

On Thu, Jan 7, 2010 at 12:57 PM, Gurgel, Flavio <flavio(at)4linux(dot)com(dot)br>wrote:

> ----- "Matthew Wakeling" <matthew(at)flymine(dot)org> escreveu:
> > On Thu, 7 Jan 2010, Gurgel, Flavio wrote:
> > Postgres does not change a query plan according to the shared_buffers
> >
> > setting. It does not anticipate one step contributing to another step
> > in
> > this way. It does however make use of the effective_cache_size setting
> > to
> > estimate this effect, and that does affect the planner.
>
> That was what I was trying to say :)
>
> > In a situation like this, the opposite will be true. If you were
> > accessing
> > a very small part of a table, say to order by a field with a small
> > limit,
> > then an index can be very useful by providing the results in the
> > correct
> > order. However, in this case, almost the entire table has to be read.
> >
> > Changing the order in which it is read will mean that the disc access
> > is
> > no longer sequential, which will slow things down, not speed them up.
> >
> > The Postgres planner isn't stupid (mostly), there is probably a good
> > reason why it isn't using an index scan.
>
> Sorry but I disagree. This is the typical case where the test has to be
> made.
> The results are partial, let's say 50% of the table. Considerind that the
> disk is fast enough, the cost estimation of sequential and random reads are
> in a proportion of 1 to 4, considering default settings in PostgreSQL.
>
> If, and only if the data is time distributed in the table (which can be
> this case during bulk load) there will be some gain in seqscan.
> If, let's say 50% of the 50% (25% of the data) is time distributed (which
> can be the case in most data warehouses), the cost of random reads * number
> of reads can be cheaper then seqscan.
>
> The volume of data doesn't turn the index generations so deep, let's say
> 2^7 or 2^8. This can lead to very fast data retrieval.
>
> > > Ugh. I don't think so. That's why indexes were invented. PostgreSQL
> > is
> > > smart enough to "jump" over columns using byte offsets.
> > > A better option for this table is to partition it in year (or
> > year/month) chunks.
> >
> > Postgres (mostly) stores the columns for a row together with a row, so
> >
> > what you say is completely wrong. Postgres does not "jump" over
> > columns
> > using byte offsets in this way. The index references a row in a page
> > on
> > disc, and that page is fetched separately in order to retrieve the
> > row.
> > The expensive part is physically moving the disc head to the right
> > part of
> > the disc in order to fetch the correct page from the disc - jumping
> > over
> > columns will not help with that at all.
>
> If the index point to the right pages, I keep the circumstance of 1 to 4
> cost.
> Agreed about seqscans. When I talked about byte offsets I was talking of
> data in the same disk page, and this does not help I/O reduction at all.
>
> > Reducing the width of the table will greatly improve the performance
> > of a
> > sequential scan, as it will reduce the size of the table on disc, and
> >
> > therefore the time taken to read the entire table sequentially.
>
> I just don't understand if you're seeing this situation as OLTP or DW,
> sorry.
> DW tables are usually wider then OLTP.
>
> > Moreover, your suggestion of partitioning the table may not help much
> > with
> > this query. It will turn a single sequential scan into a UNION of many
> >
> > tables, which may be harder for the planner to plan. Also, for queries
>
> Partitioned plans are a collection of an independent plan for each table in
> the inheritance.
> If the data to be retrieved is confined in selected partitions, you won't
> seqscan the partitions you don't need.
> The cost of the "union" of the aggregations in memory is a lot cheaper then
> the avoided seqscans.
>
> I have at least 3 cases of partitioning in queries exactly like this that
> droped from 3min to 5s execution times.
> All of that DW tables, with aggregation and huge seqscans.
>
> I keep my word that the right use of indexes here has to be tested.
>
> Flavio Henrique A. Gurgel
> tel. 55-11-2125.4786
> cel. 55-11-8389.7635
> www.4linux.com.br
> FREE SOFTWARE SOLUTIONS
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Brian Cox 2010-01-07 18:43:25 Re: query looping?
Previous Message Gurgel, Flavio 2010-01-07 18:10:21 Re: Air-traffic benchmark