Re: Help: Function for splitting VARCHAR column and migrating its data to 2 new tables

From: Tony Wasson <ajwasson(at)gmail(dot)com>
To: ogjunk-pgjedan(at)yahoo(dot)com
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Help: Function for splitting VARCHAR column and migrating its data to 2 new tables
Date: 2005-05-24 18:15:12
Message-ID: 6d8daee305052411157a750bc7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On 5/23/05, ogjunk-pgjedan(at)yahoo(dot)com <ogjunk-pgjedan(at)yahoo(dot)com> wrote:
> I am restructuring my DB schema and need help migrating data from 1
> column of an existing table to two new tables. I have some Java code
> that can do this for me, but it's very slow, and I am now hoping I can
> migrate this data with some clever SQL instead.
>
> Here are my 3 tables:
>
> user_data (existing, old table)
> ---------
> id (PK),
> user_id (FK)
> keywords VARCHAR(256)
> -- this contains comma separated keywords
> -- e.g. "new york,san francisco, dallas, food"
> -- also "keywords without strings are really just 1 keyword"
> add_date TIMESTAMP
>
>
> So now I'm trying to migrate this "keywords" VARCHAR column to a more
> normalized schema:
>
> user_data_keyword (new lookup table to populate)
> -----------------
> id (PK) -- I may change PK to PK(user_data_id, keyword_id)
> user_data_id (FK)
> keyword_id (FK)
>
>
> keyword (new table to populate)
> -------
> id (PK)
> name VARCHAR(64) NOT NULL UNIQUE
> add_date TIMEZONE
>
>
> I just found
> http://www.postgresql.org/docs/current/static/functions-string.html ,
> but if anyone could lend me a hand by getting me started with writing a
> function for this, I'd really appreciate it.
>
> Thanks,
> Otis
>

I am not aware of any extremely clever SQL to make this ALL happen.
However you can do everything with pl/pgsql. My contribution below
will build the keyword table for you. Once you have this working and
you understand it, you could extend it to build your user_data_keyword
table.

CREATE OR REPLACE FUNCTION split_on_commas(TEXT) RETURNS SETOF TEXT
LANGUAGE 'plpgsql' AS '
DECLARE
mystring ALIAS FOR $1;
incomma BOOLEAN := FALSE; -- ## Catch the first word
endpos INTEGER; -- end character
startpos INTEGER := 0;
pos INTEGER;
outstring TEXT;
BEGIN
SELECT INTO endpos CHAR_LENGTH(mystring);
FOR pos IN 1 .. endpos
LOOP
-- There are 2 single quotes, a space, and 2 single quotes below:
IF SUBSTRING(mystring,pos,1) = '','' THEN
incomma := TRUE;
outstring := SUBSTRING(mystring,startpos,pos-startpos);
RETURN NEXT outstring;
ELSE
IF incomma IS TRUE THEN
incomma := FALSE;
startpos := pos;
END IF;
END IF;
END LOOP;
IF incomma IS FALSE THEN -- ## Catch the last phrase
outstring := SUBSTRING(mystring,startpos,endpos);
RETURN NEXT outstring;
END IF;
RETURN;
END;
';

-- try it! You will get a set of keyword all split out
SELECT * FROM split_on_commas('foo1,bar2,foo3');

From here, we need a migration function that will loop through row(s)
in your table. I also noticed that some of your words may have leading
and trailing spaces. The function drop_first_and_last_space should
take care of that. Notice that I use this function below in the
build_keyword_table function.

CREATE OR REPLACE FUNCTION drop_first_and_last_space(TEXT) RETURNS
TEXT LANGUAGE 'plpgsql' AS '
DECLARE
in_string ALIAS FOR $1;
out_msg TEXT;
BEGIN
out_msg := in_string;
IF SUBSTRING(out_msg,1,1) = '' '' THEN
RAISE NOTICE ''dropping leading space'';
out_msg := substring(out_msg,2,length(out_msg));
END IF;
--and the last space
IF SUBSTRING(out_msg,length(out_msg),1) = '' '' THEN
RAISE NOTICE ''dropping trailing space'';
out_msg := substring(out_msg,1,length(out_msg)-1);
END IF;
RETURN out_msg;
END
';

-- This function uses the INSERT ((SELECT)EXCEPT(SELECT)); syntax
mentioned in http://www.varlena.com/varlena/GeneralBits/19.htm. Note
that I am unsure why I didn't need to use FOR r in EXECUTE ''SELECT...
syntax. But this did work in my small scale test.

CREATE OR REPLACE FUNCTION build_keyword_table(TEXT) RETURNS TEXT
LANGUAGE 'plpgsql' AS '
DECLARE
match ALIAS FOR $1;
rec RECORD;
r RECORD;
out TEXT := ''done'';
BEGIN
FOR rec IN SELECT keywords
FROM user_data
WHERE id SIMILAR TO match
ORDER BY id
LOOP
RAISE NOTICE '' working on %'',rec.keywords;
FOR r IN SELECT drop_first_and_last_space(split_on_commas) AS
kw FROM split_on_commas(rec.keywords)
LOOP
RAISE NOTICE ''trying to insert - %'',r.kw;
INSERT INTO keyword (name) ( (SELECT r.kw) EXCEPT (SELECT
name FROM keyword WHERE name=r.kw));
END LOOP;
END LOOP;
RETURN out;
END;
';

Then try it all together, like so:
SELECT build_keyword_table('%');
The input goes to a SIMILAR TO on id - and % means all elements. You
can match a single row by using something like SELECT
build_keyword_table('123');

I also used this as my test data... It worked for me!

CREATE TABLE user_data (
id SERIAL,
user_id INTEGER,
keywords VARCHAR(256) NOT NULL,
add_date TIMESTAMP,
PRIMARY KEY(id)
);

INSERT INTO user_data (keywords) VALUES ('new york,san francisco,
dallas, food');
INSERT INTO user_data (keywords) VALUES ('phoenix, hot, summer, fun');

CREATE TABLE keyword (
name VARCHAR(64) NOT NULL,
id SERIAL,
add_date TIMESTAMP,
PRIMARY KEY(name)
);
-- todo put a UNIQUE INDEX on keyword (id)

I hope this helps.
Tony Wasson

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Alain 2005-05-24 18:26:50 Tip ?
Previous Message Tom Lane 2005-05-24 18:12:25 Re: [SQL] could not devise a query plan