Re: extracting location info from string

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Tarlika Elisabeth Schmitz <postgresql3(at)numerixtechnology(dot)de>
Cc: sql pgsql <pgsql-sql(at)postgresql(dot)org>
Subject: Re: extracting location info from string
Date: 2011-05-24 00:59:21
Message-ID: 4DDB02E9.9040509@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On 24/05/2011 6:39 AM, Tarlika Elisabeth Schmitz wrote:

> Indeed. However, the situation is not quite as bleak as it appears:
> - I am only dealing with 50 countries (Luxemburg and Vatican are not
> amongst them)
> - Only for two countries will city/region be displayed instead of
> country.
> - Ultimately, where the only important bit of imformation is the
> country.
> - The only really important persons are those from the two countries.

Ah, see that's critical. You've just been able to restrict the problem
domain to a much simpler task with a smaller and well-defined range of
possibilities. Most of the complexity is in the nasty corner cases and
weirdness, and you've (probably) just cut most of that away.

> Of 17000 historical records, 4400 don't match this simple pattern.
> Of the 4400, 1300 are "USA" or "Usa" instead of "United States", 900
> "North America" whatever that is! There are plenty of common +
> valid region abbreviations.
>
> I get about 1000 new records of this type per year.

I'd do this kind of analysis in a PL/Perl or PL/python function myself.
It's easier to write "If <x> then <y> else <x>" logic in a readable
form, and such chained tests are usually better for this sort of work.
That also makes it easier to do a cleanup pass first, where you
substitute common spelling errors and canonicalize country names.

> However, the import process has to be as automatic as possible in such
> a way that inconsistencies are flagged up for later manual
> intervention. I say later because, for instance, a person's data will
> have to be imported with or without location info because other new
> data will link to it.

That's another good reason to use a PL function for this cleanup work.
It's easy to INSERT a record into a side table that flags it for later
examination if necessary, and to RAISE NOTICE or to issue a NOTIFY if
you need to do closer-to-realtime checking.

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Adrian Klaver 2011-05-24 01:04:52 Re: problem with update data sets from front ends 8.4, Ubuntu 10.04
Previous Message Craig Ringer 2011-05-23 23:18:13 Re: Which version of PostgreSQL should I use.