Re: Index creation takes more time?

From: Herouth Maoz <herouth(at)unicell(dot)co(dot)il>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, pgsql-general(at)postgresql(dot)org, tv(at)fuzzy(dot)cz
Subject: Re: Index creation takes more time?
Date: 2012-09-19 11:26:30
Message-ID: D43CCEEB-E655-4BBE-B037-D046B0BF2889@unicell.co.il
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


On 18/09/2012, at 20:19, Jeff Janes wrote:
> I think the one below will show an even larger discrepancy. You are
> doing 2 casts for each comparison,
> so I think the casts overhead will dilute out the comparison.
>
> select count(distinct foo) from ( select cast(random() as varchar(14)) as foo
> from generate_series (1,100000000)) asdf;

Actually, it doesn't. I suspect that it doesn't actually do string comparison per se. I don't know how "distinct" is implemented in PostgreSQL, but if it was me, I'd implement it with a hash table, which means that you calculate the hash of the string rather than compare it. Even if it is done with actual comparison, I don't think it's a collation-based comparison, but rather a byte-by-byte comparison.

>
>
>> Finally, I created a test table, as you asked:
>>
>>
>>> create table foo as select msisdn,sme_reference from
>>> sms.billing__archive limit 1000000;
>>
>> Then I created an index on the msisdn and sme_reference columns together.
>> 99% of the data in the msisdn field consist of 11-digit phone numbers.
>> Result:
>>
>> PC: 5792.641 ms
>> Server: 23740.470 ms
>>
>> Huge discrepancy there.
>
> try:
> create index ON foo (msisdn COLLATE "C", sme_reference) ;
>
> This can only be done on 9.1 server, as that feature is new to that
> release. It should be much faster to create than the index with
> default collation.
>
> (or change the collation of msisdn column definition, rather than just
> in the index).
>
> This assumes you just need the index for equality, not for some
> precise locale-specific ordering (which for phone numbers seems like a
> safe bet).

Yes, this certainly reduced the index creation time to within a reasonable margin. OK, now we have to decide whether to move the entire database to the 'C' collation (which would require, I suppose, a dump and restore) with the option of changing collation for specific columns that actually need it, or to just solve the current problem by changing the index creation commands where relevant.

Thank you very much for your help with this issue, your input has been invaluable.

Herouth

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Guilherme Rodrigues 2012-09-19 11:42:16 Re: Change key primary for key foreign
Previous Message Raymond O'Donnell 2012-09-19 09:42:14 Re: Column aliases in WHERE clauses