Re: extracting location info from string

From: Tarlika Elisabeth Schmitz <postgresql3(at)numerixtechnology(dot)de>
To: sql pgsql <pgsql-sql(at)postgresql(dot)org>
Subject: Re: extracting location info from string
Date: 2011-05-23 22:39:49
Message-ID: 20110523233949.07fb0005@dick.coachhouse
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Mon, 23 May 2011 13:11:24 +1200
Andrej <andrej(dot)groups(at)gmail(dot)com> wrote:

>On 23 May 2011 10:00, Tarlika Elisabeth Schmitz
><postgresql3(at)numerixtechnology(dot)de> wrote:
>> On Sun, 22 May 2011 21:05:26 +0100
>> Tarlika Elisabeth Schmitz <postgresql3(at)numerixtechnology(dot)de> wrote:
>>
>>>A column contains location information, which may contain any of the
>>>following:
>>>
>>>1) null
>>>2) country name (e.g. "France")
>>>3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen")
>>>4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen")
>>>5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen")
>>
>>
>> I also need to cope with variations of COUNTRY.NAME and REGION.NAME.
>
>sanitising that data will be tedious, - particularly with the
>variations on region.

Indeed. However, the situation is not quite as bleak as it appears:
- I am only dealing with 50 countries (Luxemburg and Vatican are not
amongst them)
- Only for two countries will city/region be displayed instead of
country.
- Ultimately, where the only important bit of imformation is the
country.
- The only really important persons are those from the two countries.

>Another thing of great import is whether the city can occur in the
>data column all by itself; if yes, it's next to impossible to
>distinguish it from a country.

Unfortunately this is the case.

>Specially if we assume that
>typos/misspelling are feasible on top of punctuation ...

and former countries like Czechoslovakia ...

>If I had a task like that to perform I'd dump the data out to file
>and have a good go at it w/ sed & awk, or perl, depending on
>how complex & voluminous the data is.

I had a quick look at the data (maybe not the most efficient SQL)

SELECT id, name,
CASE
WHEN location is null then null
WHEN location !~ '.*,.*' then (select id from country where name =
location)
ELSE (select country from county where name =
regexp_replace(location, '.*, (Rg\\.? )?(.*)', '\\2')) END AS country,
location
FROM temp_person

Of 17000 historical records, 4400 don't match this simple pattern.
Of the 4400, 1300 are "USA" or "Usa" instead of "United States", 900
"North America" whatever that is! There are plenty of common +
valid region abbreviations.

I get about 1000 new records of this type per year.

I presume that more recent data are more accurate. I know I won't be
able to clean them all up.

However, the import process has to be as automatic as possible in such
a way that inconsistencies are flagged up for later manual
intervention. I say later because, for instance, a person's data will
have to be imported with or without location info because other new
data will link to it.

--

Best Regards,
Tarlika Elisabeth Schmitz

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Craig Ringer 2011-05-23 23:18:13 Re: Which version of PostgreSQL should I use.
Previous Message Dean le Roux 2011-05-23 21:22:23 Re: problem with update data sets from front ends 8.4, Ubuntu 10.04