Re: Real-life range datasets

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Real-life range datasets
Date: 2011-12-22 15:48:03
Message-ID: Pine.LNX.4.64.1112221945580.14072@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Bene,

we have pgfoundry project http://pgfoundry.org/projects/dbsamples/.
Since your sample database is very important (for me also), I suggest to use
this site.

Oleg
On Thu, 22 Dec 2011, Benedikt Grundmann wrote:

> Hello,
>
> We have a table in a postgres 8.4 database that would make use of date
> ranges and exclusion constraints if they were available. Sadly I cannot
> give you the data as it is based on data we are paying for and as part
> of the relevant licenses we are obliqued to not give the data to third
> parties.
>
> Basically the tables keep meta data about financial instruments on
> a given day. Thanks to corporate actions (companies merging, splitting
> up, etc...) that meta data can be different from one day to the next.
>
> One way to model such a table is:
>
> identifier, date, payload columns...
>
>
> unique index on (date, identifier)
>
> (and in fact some of the payload columns are unique per day indices
> as well).
>
> But because there are a large number of rows per day and most don't
> change this is a very wasteful representation.
>
> Instead we use this
>
> identifier, effective_from, effective_until, payload columns...
>
> And we have some clever plpgsql functions that merge a days snapshot
> into that representation. That happens only a few times per day and
> is currently quite slow mostly because a lot of time is spend in
> validation triggers to check that there are no overlapping entries
> for effective_from,effective_until for jane_symbol and a few other
> identifiers.
>
> The most common operations are:
>
> Get all or most rows of a given day
>
> (select ... from instruments where :date between effective_from and effective_until)
>
> left join of the instruments (again in the normal case constrained to
> one day but in same cases periods of a week or a few month)
>
> select ... from t left join instruments on
> t.jane_symbol = instruments.jane_symbol
> t.date between instruments.effective_from and t.effective_until
> where t.date = X
> and additional constraint on the number of rows from t
>
> With t a huge table clustered on date with roughly 500,000 to 2,000,000
> entries per day. The left join would work most of the time (my guess is
> more than 90%). But there are entries in t where the jane_symbol would
> not be in instruments (sadly).
>
> Current size (immediately after a cluster):
>
> table toast (all indices) total
> | 1268 MB | 900 MB | 693 MB | 2861 MB
>
> => select min(effective_from), max(effective_from) from instruments;
> min | max
> ------------+------------
> 2011-05-30 | 2011-12-21
> (1 row)
>
> b=> select count(*) from instruments where current_date - 1 between effective_from and effective_until ;
> count
> --------
> 358741
> (1 row)
>
> I should be able to give you a table with the same characteristics as
> the instruments table but bogus data by replacing all entries in the
> table with random strings of the same length or something like that.
> I can probably take a little bit of time during this or the next week
> to generate such "fake" real world data ;-) Is there an ftp site to
> upload the gzipped pg_dump file to?
>
> Cheers,
>
> Bene
>
> On 20/12/11 16:48, Alexander Korotkov wrote:
>> Hackers,
>>
>> For better GiST indexing of range types it's important to have real-life
>> datasets for testing on. Real-life range datasets would help to proof (or
>> reject) some concepts and get more realistic benchmarks. Also, it would be
>> nice to know what queries you expect to run fast on that datasets. Ideally
>> it should be real-life set of queries, but it also could be your
>> presentation of what are typical queries for such datasets.
>> Thanks!
>>
>> -----
>> With best regards,
>> Alexander Korotkov.
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Benedikt Grundmann 2011-12-22 15:51:48 Re: Typed hstore proposal
Previous Message Tom Lane 2011-12-22 15:44:48 Re: Typed hstore proposal