Re: Fastest char datatype

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Robert James <srobertjames(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Fastest char datatype
Date: 2009-07-20 21:24:57
Message-ID: b42b73150907201424l1f5e5621p1b8ae03caa0e61e5@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Sun, Jul 19, 2009 at 9:46 PM, Robert James<srobertjames(at)gmail(dot)com> wrote:
> I'm storing a lot of words in a database.  What's the fastest format for
> finding them? I'm going to be doing a lot of WHERE w LIKE 'marsh%' and WHERE
> w IN ('m', 'ma').  All characters are lowercase a-z, no punctuation, no
> other alphabets.  By default I'm using varchar in utf-8 encoding, but was
> wondering if I could specificy something else (perhaps 7bit ascii, perhaps
> lowercase only) that would speed things up even further.

All the charater types are basically the same except for char(n) which
pads out the string on disk. Reading downthread, [a-z] needs more
than 4 bits (4 bits could only represent 16 characters). 5 bits is a
very awkward number in computing, which may explain why this type of
encoding is rarely done. Coming from the 'cobol' world, where there
were all kinds of zany bit compressed encodings, I can tell you that
the trend is definitely in the other direction...standard data layouts
coupled with well known algorithms.

Any type of simple bitwise encoding that would get you any space
benefit would mean converting your text fields to bytea. This would
mean that any place you needed to deal with your text field as text
would require running your data through a decoder function...you would
encode going into the field and decode going out...ugh.

Better would be to use a functional index:
create index foo_idx on foo(compactify(myfield));

If you don't need index ordering, then you could swap a hash function
for compactify and have it return type 'int'. This should give the
best possible performance (probably better than the built in hash
index). You would probably only see a useful benefit if your average
string length was well over 10 characters though. In the end though,
I bet you're best off using a vanilla text field/index unless you
expect your table to get really huge. PostgreSQL's btree
implementation is really quite amazing.

merlin

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Kevin Grittner 2009-07-20 21:42:31 Re: Full text search with ORDER BY performance issue
Previous Message Kevin Grittner 2009-07-20 21:03:44 Re: Calling conventions