text datum VARDATA and strings

From: Reece Hart <reece(at)harts(dot)net>
To: pgsql-general(at)postgresql(dot)org
Cc: Michael Enke <michael(dot)enke(at)wincor-nixdorf(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: text datum VARDATA and strings
Date: 2006-08-14 18:04:30
Message-ID: 1155578671.4158.45.camel@tallac.gene.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-general

Michael Enke recently asked in pgsql-bugs about VARDATA and C strings
(BUG #2574: C function: arg TEXT data corrupt). Since that's not a bug,
I've moved this follow-up to pgsql-general.

On Mon, 2006-08-14 at 11:27 -0400, Tom Lane wrote:
> The usual way to get a C string from a TEXT datum is to call textout,
> eg
> str = DatumGetCString(DirectFunctionCall1(textout, datumval));

Yikes! I've been accessing VARDATA text data like Michael for years
(code below). I account for length and don't expect null-termination,
but I don't use anything like Tom's suggestion above. (I always try to
do what Tom says because that usually hurts less.)

I have three questions:

1) I based everything I did on examples lifted nearly verbatim from a
7.x manual, and I bet Michael did similarly. I've never heard of
DatumGetCString, DirectFunctionCall1, or textout. Are these and other
treasures documented somewhere?

2) Does DatumGetCString(DirectFunctionCall1(textout, datumval)) do
something other than null terminate a string? All of the strings are
from [-A-Z0-1*]; server_encoding has been either SQL_ASCII or UTF8 in
case that's relevant.

3) Is there any reason to believe that the code below is problematic?

Thanks,
Reece

#include <postgres.h>
#include <fmgr.h>
#include <ctype.h>
#include <string.h>

static char* clean_sequence(const char* in, int32 n);

PG_FUNCTION_INFO_V1(pg_clean_sequence);
Datum pg_clean_sequence(PG_FUNCTION_ARGS)
{
text* t0; /* in */
text* t1; /* out */
char* tmp;
int32 tmpl;

if ( PG_ARGISNULL(0) )
{ PG_RETURN_NULL(); }

t0 = PG_GETARG_TEXT_P(0);

tmp = clean_sequence( VARDATA(t0), VARSIZE(t0)-VARHDRSZ );
tmpl = (int32) strlen(tmp);

/* copy temp sequence into new pg variable */
t1 = (text*) palloc( tmpl + VARHDRSZ );
if (!t1)
{ elog( ERROR, "couldn't palloc (%d bytes)", tmpl+VARHDRSZ ); }
memcpy(VARDATA(t1),tmp,tmpl);
VARATT_SIZEP(t1) = tmpl + VARHDRSZ;

pfree(tmp);

PG_RETURN_TEXT_P(t1);
}

/* clean_sequence -- strip non-IUPAC symbols
The intent is to strip non-sequence data which might result from
copy-pasting a fasta file or some such.

in: char*, length
out: char*, |out|<=length, NULL-TERMINATED
out is palloc'd memory; caller must free

allow chars from IUPAC std 20
+ selenocysteine (U) + ambiguity (BZX) + gap (-) + stop (*)
*/

#define isseq(c) ( ((c)>='A' && (c)<='Z' && (c)!='J' && (c)!='O') \
|| ((c)=='-') \
|| ((c)=='*') )

char* clean_sequence(const char* in, int32 n) {
char* out;
char* oi;
int32 i;

out = palloc( n + 1 ); /* w/null */
if (!out)
{ elog( ERROR, "couldn't palloc (%d bytes)", n+1 ); }

for( i=0, oi=out; i<=n-1; i++ ) {
char c = toupper(in[i]);
if ( isseq(c) ) {
*oi++ = c;
}
}
*oi = '\0';
return(out);
}

--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2006-08-14 19:51:22 Re: text datum VARDATA and strings
Previous Message Tom Lane 2006-08-14 15:29:32 Re: no native spinlock support on os x 10.4.7

Browse pgsql-general by date

  From Date Subject
Next Message Scott Ribe 2006-08-14 18:39:48 Re: Best approach for a "gap-less" sequence
Previous Message Jaime Casanova 2006-08-14 17:07:20 Re: problem with a dropped database