Testing of various opclasses for ranges

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Testing of various opclasses for ranges
Date: 2012-07-09 23:33:03
Message-ID: CAPpHfduAJ_pX_v-MFraE7x2sgMSJ6jEty=s5eCFGT8d1axvzOA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hackers,

I've tested various opclasses for ranges (including currently in-core one
and my patches). I've looked into scholar papers for which datasets they
are using for testing. The lists below show kinds of datasets used in
papers.

1) "Advanced Indexing Technique for Temporal Data", "Indexing Valid Time
Intervals" uses two kinds of datasets:
a) uniformly distributed start of range and uniformly distributed size of
range
b) uniformly distributed start of range and exponentially distributed
size of range

2) "The Time index+: An incremental access structure for temporal
databases" uses dataset produced by simulating of objects "life". Each
object consists of number of versions which lifetime ranges are adjuncted.
Start of object life is uniformly distributed. Object lifetime is normally
distributed. Version time is exponentially distributed.

3) "Top-k Queries on Temporal Data"
a) Datasets of clusters. Each cluster center is normally distributed.
Offset of range inside cluster is also normally distributed. Range size is
distributed exponentially.
b) Real-life datasets http://www.cs.ucr.edu/~eamonn/time_series_data/.
Unfortunately I can't get password for them ((((

4) "Indexing Valid Time Intervals" uses two kinds of datasets
a) uniformly distributed start of range and exponentially distributed
size of range
b) uniformly distributed start of range and normally distributed size of
range

5) "Managing intervals efficiently in object-relational databases",
"Segment indexes: dynamic indexing techniques for multi-dimensional
interval data"
a) uniformly distributed start of range and exponentially distributed
size of range
b) uniformly distributed start of range and uniformly distributed size of
range
a) exponentially distributed start of range and exponentially distributed
size of range
b) exponentially distributed start of range and uniformly distributed
size of range

Therefore 3 basic distributions are used in synthetic datasets:
1) uniform
2) exponential
3) normal

Datasets can be classified into 3 kinds:
1) simple: ranges are distributed independently
2) clusters: ranges are grouped into clusters
3) lifetime: ranges are produced by life simulation

Each kind of dataset require some distributions for generation:
1) simple: range start and range size
2) clusters: cluster center, range offset, range size
3) lifetime: object lifetime start, object lifetime length, version length

In my testsuite each of 3 distribution is used in each slot. Additionally
mean size of range was varied. See attached range_test.php and
range_test_schema.sql.

I've merged all 3 patches into 1 (see 2d_map_range_indexing.patch). In this
patch following opclasses are available for ranges:
1) range_ops - currently in-core GiST opclass
2) range_ops2 - GiST opclass based on 2d-mapping
3) range_ops_quad - SP-GiST quad tree based opclass
4) range_ops_kd - SP-GiST k-d tree based opclass

There is average results of index build time depending on used operator
class.

test=# select opclass, avg(buildtime) from indexes group by opclass order
by opclass;
opclass | avg
----------------+------------------
range_ops | 16.1305697569772
range_ops2 | 21.6557774392386
range_ops_quad | 6.1000143980223
range_ops_kd | 4.97456835754334
(4 rows)

2d-mapping GiST is longer for build than 1d GiST. It seems to be inevitable
because 2d-mapping GiST has to try split in both dimensions and it has more
complicated calculations in penalty too. K-d tree is faster for build than
quad tree, I don't have convincing explanation why.

There is average number of page hits and average execution in milliseconds
for test queries depending on operator and opclass. We can see that
2d-mapping GiST use about 2 times less pages for "<@" operators. However,
it use a little more pages in "@>" and "&&" while I expected superiority
also for "@>". But average time of query execution is lower for all
operators. Probably, it's because consistent function appears to be more
efficiently implemented.
SP-GiST uses more pages than GiST. Especially k-d tree. However, some pages
could be counted more than once, but I think it's OK to compare SP-GiST
opclasses by this parameter between themselves. Also, k-d tree appears to
be slower than quad tree.

test=# select tr.operator, opclass, avg(tr.hits), avg(tr.time) from
test_results tr join indexes i on i.id = tr.index_id where tr.count > 0
group by tr.operator, i.opclass order by tr.operator, i.opclass;
operator | opclass | avg | avg
----------+----------------+----------------------+------------------
<@ | range_ops | 104.6797374137490417 | 5.63374670968584
<@ | range_ops2 | 57.6100418476871965 | 4.38479106503973
<@ | range_ops_kd | 362.4359706427293637 | 5.42760708296065
<@ | range_ops_quad | 185.4336426654740608 | 4.41001823648733
@> | range_ops | 102.7120835405271009 | 3.95064963751995
@> | range_ops2 | 112.8144023659347274 | 3.64920412729987
@> | range_ops_kd | 318.5045800727577272 | 3.54148674396084
@> | range_ops_quad | 118.8078201470857651 | 2.52750201523206
&& | range_ops | 104.6941111111111111 | 7.07480315079371
&& | range_ops2 | 106.7263531746031746 | 6.45640819841263
&& | range_ops_kd | 426.3542380952380952 | 7.41615567063479
&& | range_ops_quad | 247.2961468253968254 | 6.17403029761907
(12 rows)

In order to find why test results don't always meet my expectations I
filter tests to show only low-selectivity queries. Now, we can see
2d-mapping GiST is much more dramatically faster than 1d GiST on "<@" and
use slightly less amount of pages than 1d GiST. k-d tree appears to be much
slower than quad tree and use dramatically more pages. Probably, it is
caused by some bug. I will examine it.

test=# select tr.operator, opclass, avg(tr.hits), avg(tr.time) from
test_results tr join indexes i on i.id = tr.index_id where tr.count > 0 and
tr.count < 100 group by tr.operator, i.opclass order by tr.operator,
i.opclass;
operator | opclass | avg | avg
----------+----------------+----------------------+-------------------
<@ | range_ops | 111.1455328421708000 | 2.36361217183771
<@ | range_ops2 | 6.2042648127010480 | 0.181787537615442
<@ | range_ops_kd | 259.1292933485524541 | 1.39777430735706
<@ | range_ops_quad | 54.7487288575282764 | 0.318179931513958
@> | range_ops | 10.6586927630784101 | 0.309286698079016
@> | range_ops2 | 8.9114413434819379 | 0.215462788449922
@> | range_ops_kd | 286.6884740848133382 | 1.54235544279329
@> | range_ops_quad | 39.8035520115984052 | 0.231929382626555
&& | range_ops | 3.6001236093943140 | 0.217711372064277
&& | range_ops2 | 3.7082818294190358 | 0.18267737948084
&& | range_ops_kd | 299.8071693448702101 | 2.02300927070457
&& | range_ops_quad | 24.7459826946847960 | 0.183641532756489
(12 rows)

You can download full results of testing at
http://test.i-gene.ru/uploads/range_test.sql.gz.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
2d_map_range_indexing.patch application/octet-stream 121.5 KB
range_test.php application/x-httpd-php 7.9 KB
range_test_schema.sql application/octet-stream 7.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Conway 2012-07-10 01:23:56 has_language_privilege returns incorrect answer for non-superuser
Previous Message Josh Berkus 2012-07-09 22:15:18 Re: Synchronous Standalone Master Redoux