Re: Need help: Find dirty rows, Update, Delete SQL

From: Janning Vygen <vygen(at)gmx(dot)de>
To: pgsql-sql(at)postgresql(dot)org
Cc: ogjunk-pgjedan(at)yahoo(dot)com
Subject: Re: Need help: Find dirty rows, Update, Delete SQL
Date: 2006-02-19 10:44:34
Message-ID: 200602191144.34335.vygen@gmx.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Am Samstag, 18. Februar 2006 18:41 schrieb ogjunk-pgjedan(at)yahoo(dot)com:
> Hello,
>
> I need a bit of help with some SQL.
> I have two tables, call them Page and Bookmark.
> Each row in Page can have many Bookmarks pointing to it, and
> they are joined via a FK (Page.id = Bookmark.page_id).
>
> Page has a 'url' column: Page.url, which has a unique index on it.
>
> My Page.url column got a little dirty, and I need to clean it up,
> and that's what I need help with.
>
> Here is an example of dirtiness:
>
> Page:
>
> id=1 url = 'http://example.com/'
> id=2 url = 'http://example.com/#' -- dirty
> id=3 url = 'http://example.com/#foo' -- dirty
>
> The last two rows are dirty. Normally I normalize URLs before
> inserting them, but these got in, and now I need to clean them.
>
> The problem is that rows in Bookmark table may point to dirty
> rows in Page, so I can't just remove the dirty rows, and I can't
> just update 'url' column in Page to 'http://example.com/',
> because that column is unique.
>
> Is there some fancy SQL that I can use them to find the dirty
> rows in page (... where url like '%#%') and then find rows in
> Bookmark table that point to them, then point those rows to
> good rows in Page (e.g. id=1 row above), and finally remove the
> dirty rows from Page?

try this. But please check if it really does its job. I just wrote it down in
a minute or two. There will be an easier way or nicer written SQL but a sit
is just a one time operation you shoudn't care too much. One more hint: you
should add a CHECK clause to your page_url like "page_url text NOT NULL
UNIQUE CHECK (page_url !~ '#')"

here is my test code

CREATE TABLE pages (
page_id SERIAL PRIMARY KEY,
page_url text NOT NULL UNIQUE
);

CREATE TABLE bookmarks (
bm_id SERIAL PRIMARY KEY,
bm_text text not null,
page_id int4 NOT NULL REFERENCES pages (page_id)
);

INSERT INTO pages (page_url) VALUES ('http://example.com/');
INSERT INTO pages (page_url) VALUES ('http://example.com/#');
INSERT INTO pages (page_url) VALUES ('http://example.com/#foo');
INSERT INTO pages (page_url) VALUES ('http://example2.com/#foo');
INSERT INTO pages (page_url) VALUES ('http://example3.com/#foobar');

insert into bookmarks (bm_text, page_id) values ('test1', 1);
insert into bookmarks (bm_text, page_id) values ('test2', 1);
insert into bookmarks (bm_text, page_id) values ('test3', 2);
insert into bookmarks (bm_text, page_id) values ('test4', 2);
insert into bookmarks (bm_text, page_id) values ('test5', 3);
insert into bookmarks (bm_text, page_id) values ('test6', 3);
insert into bookmarks (bm_text, page_id) values ('test7', 4);

BEGIN;
UPDATE bookmarks set page_id = pages2.page_id
FROM
pages AS pages1,
pages AS pages2
WHERE
pages1.page_id = bookmarks.page_id
AND pages2.page_url = split_part(pages1.page_url, '#', 1)
;

DELETE FROM pages WHERE page_id IN (
SELECT
pages1.page_id
FROM
pages AS pages1
JOIN pages AS pages2 ON (
pages1.page_id != pages2.page_id
AND pages2.page_url = split_part(pages1.page_url, '#', 1)
)
WHERE position('#' in pages1.page_url) > 0
AND pages1.page_id NOT IN (SELECT page_id FROM bookmarks)
);
;

UPDATE pages SET page_url = split_part(page_url, '#', 1)
WHERE position('#' in pages.page_url) > 0
;
select * from bookmarks;
select * from pages;
COMMIT;

kind regards,
janning

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Henry Ortega 2006-02-19 18:47:21 Re: Given 02-01-2006 to 02-28-2006, output all days.
Previous Message Stephan Szabo 2006-02-18 19:03:39 Re: [SQL] Interval subtracting