Skip site navigation (1) Skip section navigation (2)

Re: Index with all necessary columns - Postgres vs MSSQL

From: Gudmundur Johannesson <gudmundur(dot)johannesson(at)gmail(dot)com>
To: Igor Neyman <ineyman(at)perceptron(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Index with all necessary columns - Postgres vs MSSQL
Date: 2012-02-07 17:59:33
Message-ID: CAHNvtn-+1A0YMA0AtZafaoGSEYbREuvdZ62jXmMNF4=va4XjeA@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-performance
On Tue, Feb 7, 2012 at 3:11 PM, Igor Neyman <ineyman(at)perceptron(dot)com> wrote:

> From: Gudmundur Johannesson [mailto:gudmundur(dot)johannesson(at)gmail(dot)com]
> Sent: Thursday, February 02, 2012 11:42 AM
> To: Merlin Moncure
> Cc: pgsql-performance(at)postgresql(dot)org
> Subject: Re: Index with all necessary columns - Postgres vs MSSQL
>
> Hi,
>
> I want to start by thanking you guys for a quick response and I will try
> to provide all the information you request.
>
> 1) What version am I running:
> "PostgreSQL 9.1.2, compiled by Visual C++ build 1500, 64-bit"
>
> 2) Schema:
> CREATE TABLE test( id integer,  dtstamp timestamp without time zone,
> rating real) WITH ( OIDS=FALSE);
> CREATE INDEX test_all ON test USING btree  (id , dtstamp, rating);
> 200M rows
> Table size 9833MB
> Index size 7653 MB
>
> 3) Difference between the first and the second run time?
> The statement executed is:
> SELECT count(1) FROM test
> WHERE id in (58,83,88,98,124,141,170,195,
> 202,252,265,293,305,331,348)
> AND dtstamp between cast('2011-10-19 08:00:00' as timestamp)  and
> cast('2011-10-19 16:00:00' as timestamp)
> a) 1st run = 26 seconds
> b) 2nd run = 0.234 seconds
> c) 3rd-6th run = 0.06 seconds
>
> If I perform the query above for another day then I get 26 seconds for the
> 1st query.
>
> 4) What was the execution plan of it
> "Aggregate  (cost=151950.75..151950.76 rows=1 width=0)"
> "  ->  Bitmap Heap Scan on data_cbm_reading cbm  (cost=1503.69..151840.82
> rows=43974 width=0)"
> "        Recheck Cond: ((virtual_id = ANY
> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
> AND ("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND
> ("timestamp" <= '2011-10-19 16:00:00'::timestamp without time zone))"
> "        ->  Bitmap Index Scan on data_cbm_reading_all
> (cost=0.00..1492.70 rows=43974 width=0)"
> "              Index Cond: ((virtual_id = ANY
> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
> AND ("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND
> ("timestamp" <= '2011-10-19 16:00:00'::timestamp without time zone))"
>
> 5) In this case, I shut down the mssql server/machine and restart it.  To
> be on the safe side, I ensured the cache is empty using dbcc freeproccache
> and dbcc dropcleanbuffers.
> Then I tried the same statement as above:
> a) 1st run = 0.8 seconds
> b) 2nd, 3rd, ... run = 0.04 seconds
> c) change the select statement for any another other day and run it again
> give 1st run 0.5 seconds
> d) 2nd, 3rd, ... run = 0.04 seconds
>
> 6) You wrote "I doubt covering indexes is going to make that query 23x
> faster."
> I decided to check out how mssql performs if it cannot use a covering
> index.  In order to do that, I drop my current index and create it again on
> id, dtstamp.  That forces mssql to look into the data file and the index is
> no longer sufficient.
> Running the following statement force the "rating" columns to be accessed:
> select sum(rating)
> FROM test
>                WHERE id in
> (58,83,88,98,124,141,170,195,202,252,265,293,305,331,348)
>                AND dtstamp >= '2011-10-19 08:00:00' AND dtstamp <=
> '2011-10-19 16:00:00'
> a) 1st run = 20 seconds
> b) 2nd run = 0.6
> c) 3rd, ... run = 0.3 seconds
> As you can see the response time gets just as bad as in Postgres.
> Now lets recreate the mssql index with all the columns and double check
> the response time:
> a) 1st run = 2 seconds
> b) 2nd run = 0.12
> c) 3rd, ... run = 0.3 seconds
>
>
> Therefore, I must conclude that in the case of mssql the "covering" index
> is making a huge impact.
>
> I have spent the whole day providing this data (takes a while to shuffle
> 200M rows) and tomorrow I will try your suggestion regarding two indexes.
>
> Do you think I should try using the latest build of the source for 9.2
> since index-only-scan is "ready" according to
> http://www.depesz.com/index.php/2011/10/08/waiting-for-9-2-index-only-scans/
> ?
>
> Thanks,
>    - Gummi
>
>
> Gudmundur,
>
> Just for clarification purposes:
>
> This schema:
>
> CREATE TABLE test( id integer,  dtstamp timestamp without time zone,
>  rating real) WITH ( OIDS=FALSE);
> CREATE INDEX test_all ON test USING btree  (id , dtstamp, rating);
>
> and this query plan:
>
> "Aggregate  (cost=151950.75..151950.76 rows=1 width=0)"
> "  ->  Bitmap Heap Scan on data_cbm_reading cbm  (cost=1503.69..151840.82
> rows=43974 width=0)"
> "        Recheck Cond: ((virtual_id = ANY
> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
> AND ("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND
> ("timestamp" <= '2011-10-19 16:00:00'::timestamp without time zone))"
> "        ->  Bitmap Index Scan on data_cbm_reading_all
>  (cost=0.00..1492.70 rows=43974 width=0)"
> "              Index Cond: ((virtual_id = ANY
> ('{58,83,88,98,124,141,170,195,202,252,265,293,305,331,348}'::integer[]))
> AND ("timestamp" >= '2011-10-19 08:00:00'::timestamp without time zone) AND
> ("timestamp" <= '2011-10-19 16:00:00'::timestamp without time zone))"
>
> reference different table and index names.
> Also, EXPLAIN ANALYZE would provide additional info compared to just
> EXPLAIN.
>
> One option you could try, is to cluster your table based on " test_all"
> index, and see if it makes a difference.
> BTW., in SQL Server your "covering" index - is it clustered?
>
> Regards,
> Igor Neyman
>
>

Hi Igor,

1) I "simplified" the names when posting originally and forgot to replace
the names in the analyze output.  Sorry about the confusion.

2) The index in mssql is not clustered.

3) I am now testing to partition the 200 million table into one partition
per day and see how it performs.

4) I compiled and installed Postgres 9.2 and proved to my self that
Postgres does not look up into the table and relies only on the index.
Therefore, this is looking bright at the moment.

5) I must deliver the db for production in june and it does not sound wise
to do that in 9.2 (unless it has been released by then).

Thanks,
   - Gummi

Thanks,
   - Gummi

In response to

Responses

pgsql-performance by date

Next:From: Ofer IsraeliDate: 2012-02-07 19:12:01
Subject: Re: Inserts or Updates
Previous:From: Claudio FreireDate: 2012-02-07 17:57:37
Subject: Re: Inserts or Updates

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group