Skip site navigation (1) Skip section navigation (2)

Re: Out of Memory and Configuration Problems (Big Computer)

From: Tom Wilcox <hungrytom(at)googlemail(dot)com>
To:
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Out of Memory and Configuration Problems (Big Computer)
Date: 2010-05-31 19:14:34
Message-ID: 4C040A9A.8020207@gmail.com (view raw or flat)
Thread:
Lists: pgsql-general
I am having difficulties. I have rerun my update that uses the python 
functions..

(1) UPDATE nlpg.match_data SET org = normalise(org);

And some other similar queries on neighbouring fields in the table. They 
have all now worked. Without any changes to the configuration. I have 
done one thing in an attempt to minimise the risk of memory leak 
normalise() I added "toks = None" to the end of the normalise() 
function. However this was done after query (1) succeeded on the rerun.

Why would I get inconsistent behaviour? Would it have anything to do 
with SQL Server running on the same machine (although not actually doing 
anything at the moment - just idle server running in background).

Tangent: Is there any way to increase the memory allocated to postgres 
by Windows using Job Objects?

Cheers,
Tom

On 29/05/2010 18:55, Bill Moran wrote:
> On 5/28/10 8:43:48 PM, Tom Wilcox wrote:
>> I ran this query:
>>
>> EXPLAIN ANALYZE UPDATE nlpg.match_data SET org = org;
>>
>> And I got this result:
>>
>> "Seq Scan on match_data  (cost=0.00..9762191.68 rows=32205168 width=206)
>> (actual time=76873.592..357450.519 rows=27777961 loops=1)"
>> "Total runtime: 8028212.367 ms"
>
> That would seem to indicate that the problem is in your Python
> functions.
>
> Some ideas for next steps:
> * Perhaps it's just a few rows that have data in them that the
>   function has difficulty with.  Add some debugging/logging to
>   the function and see if the row it bombs on has anything unusual
>   in it (such as a very large text field)
> * While large, that function is fairly simplistic.  You may want
>   to consider rewriting it as an SQL function, which should be
>   more efficient in any event.
>
>>
>>
>> On 28 May 2010 19:39, Tom Wilcox <hungrytom(at)googlemail(dot)com
>> <mailto:hungrytom(at)googlemail(dot)com>> wrote:
>>
>>     Oops. Sorry about that.
>>
>>     I am having this problem with multiple queries however I am
>>     confident that a fair number may involve the custom plpython
>>     "normalise" function which I have made myself. I didn't think it
>>     would be complicated enough to produce a memory problem.. here it 
>> is:
>>
>>     -- Normalises common address words (i.e. 'Ground' maps to 'grd')
>>     CREATE OR REPLACE FUNCTION normalise(s text) RETURNS text AS $$
>>     ADDR_FIELD_DELIM = ' '
>>
>>     # Returns distinct list without null or empty elements
>>     def distinct_str(list):
>>          seen = set()
>>          return [x for x in list if x not in seen and not seen.add(x)
>>     and x!=None and len(x)>0]
>>
>>     # normalise common words in given address string
>>     def normalise(match_data):
>>          if match_data==None: return ''
>>          import re
>>          # Tokenise
>>          toks = distinct_str(re.split(r'\s', match_data.lower()))
>>          out = ''
>>          for tok in toks:
>>              ## full word replace
>>              if tok == 'house' : out += 'hse'+ADDR_FIELD_DELIM
>>              elif tok == 'ground' : out += 'grd'+ADDR_FIELD_DELIM
>>              elif tok == 'gnd' : out += 'grd'+ADDR_FIELD_DELIM
>>              elif tok == 'front' : out += 'fnt'+ADDR_FIELD_DELIM
>>              elif tok == 'floor' : out += 'flr'+ADDR_FIELD_DELIM
>>              elif tok == 'floors' : out += 'flr'+ADDR_FIELD_DELIM
>>              elif tok == 'flrs' : out += 'flr'+ADDR_FIELD_DELIM
>>              elif tok == 'fl' : out += 'flr'+ADDR_FIELD_DELIM
>>              elif tok == 'basement' : out += 'bst'+ADDR_FIELD_DELIM
>>              elif tok == 'subbasement' : out += 'sbst'+ADDR_FIELD_DELIM
>>              elif tok == 'bsmt' : out += 'bst'+ADDR_FIELD_DELIM
>>              elif tok == 'lbst' : out += 'lower bst'+ADDR_FIELD_DELIM
>>              elif tok == 'street' : out += 'st'+ADDR_FIELD_DELIM
>>              elif tok == 'road' : out += 'rd'+ADDR_FIELD_DELIM
>>              elif tok == 'lane' : out += 'ln'+ADDR_FIELD_DELIM
>>              elif tok == 'rooms' : out += 'rm'+ADDR_FIELD_DELIM
>>              elif tok == 'room' : out += 'rm'+ADDR_FIELD_DELIM
>>              elif tok == 'no' : pass
>>              elif tok == 'number' : pass
>>              elif tok == 'and' : out += '&'+ADDR_FIELD_DELIM
>>              elif tok == 'rear' : out += 'rr'+ADDR_FIELD_DELIM
>>              elif tok == 'part' : out += 'pt'+ADDR_FIELD_DELIM
>>              elif tok == 'south' : out += 's'+ADDR_FIELD_DELIM
>>              elif tok == 'sth' : out += 's'+ADDR_FIELD_DELIM
>>              elif tok == 'north' : out += 'n'+ADDR_FIELD_DELIM
>>              elif tok == 'nth' : out += 'n'+ADDR_FIELD_DELIM
>>              elif tok == 'west' : out += 'w'+ADDR_FIELD_DELIM
>>              elif tok == 'wst' : out += 'w'+ADDR_FIELD_DELIM
>>              elif tok == 'east' : out += 'e'+ADDR_FIELD_DELIM
>>              elif tok == 'est' : out += 'e'+ADDR_FIELD_DELIM
>>              elif tok == 'first' : out += '1st'+ADDR_FIELD_DELIM
>>              elif tok == 'second' : out += '2nd'+ADDR_FIELD_DELIM
>>              elif tok == 'third' : out += '3rd'+ADDR_FIELD_DELIM
>>              elif tok == 'fourth' : out += '4th'+ADDR_FIELD_DELIM
>>              elif tok == 'fifth' : out += '5th'+ADDR_FIELD_DELIM
>>              elif tok == 'sixth' : out += '6th'+ADDR_FIELD_DELIM
>>              elif tok == 'seventh' : out += '7th'+ADDR_FIELD_DELIM
>>              elif tok == 'eighth' : out += '8th'+ADDR_FIELD_DELIM
>>              elif tok == 'ninth' : out += '9th'+ADDR_FIELD_DELIM
>>              elif tok == 'tenth' : out += '10th'+ADDR_FIELD_DELIM
>>              elif tok == 'eleventh' : out += '11th'+ADDR_FIELD_DELIM
>>              elif tok == 'twelfth' : out += '12th'+ADDR_FIELD_DELIM
>>              elif tok == 'thirteenth' : out += '13th'+ADDR_FIELD_DELIM
>>              elif tok == 'fourteenth' : out += '14th'+ADDR_FIELD_DELIM
>>              elif tok == 'fifteenth' : out += '15th'+ADDR_FIELD_DELIM
>>              elif tok == 'sixteenth' : out += '16th'+ADDR_FIELD_DELIM
>>              elif tok == 'seventeenth' : out += '17th'+ADDR_FIELD_DELIM
>>              elif tok == 'eighteenth' : out += '18th'+ADDR_FIELD_DELIM
>>              elif tok == 'nineteenth' : out += '19th'+ADDR_FIELD_DELIM
>>              elif tok == 'twentieth' : out += '20th'+ADDR_FIELD_DELIM
>>              # numbers 0 - 20
>>              elif tok == 'one' : out += '1'+ADDR_FIELD_DELIM
>>              elif tok == 'two' : out += '2'+ADDR_FIELD_DELIM
>>              elif tok == 'three' : out += '3'+ADDR_FIELD_DELIM
>>              elif tok == 'four' : out += '4'+ADDR_FIELD_DELIM
>>              elif tok == 'five' : out += '5'+ADDR_FIELD_DELIM
>>              elif tok == 'six' : out += '6'+ADDR_FIELD_DELIM
>>              elif tok == 'seven' : out += '7'+ADDR_FIELD_DELIM
>>              elif tok == 'eight' : out += '8'+ADDR_FIELD_DELIM
>>              elif tok == 'nine' : out += '9'+ADDR_FIELD_DELIM
>>              elif tok == 'ten' : out += '10'+ADDR_FIELD_DELIM
>>              elif tok == 'eleven' : out += '11'+ADDR_FIELD_DELIM
>>              elif tok == 'twelve' : out += '12'+ADDR_FIELD_DELIM
>>              elif tok == 'thirteen' : out += '13'+ADDR_FIELD_DELIM
>>              elif tok == 'fourteen' : out += '14'+ADDR_FIELD_DELIM
>>              elif tok == 'fifteen' : out += '15'+ADDR_FIELD_DELIM
>>              elif tok == 'sixteen' : out += '16'+ADDR_FIELD_DELIM
>>              elif tok == 'seventeen' : out += '17'+ADDR_FIELD_DELIM
>>              elif tok == 'eighteen' : out += '18'+ADDR_FIELD_DELIM
>>              elif tok == 'nineteen' : out += '19'+ADDR_FIELD_DELIM
>>              elif tok == 'twenty' : out += '20'+ADDR_FIELD_DELIM
>>              # town dictionary items
>>              elif tok == 'borough' : pass
>>              elif tok == 'city' : pass
>>              elif tok == 'of' : pass
>>              elif tok == 'the' : pass
>>              # a few extras (from looking at voa)
>>              elif tok == 'at' : pass
>>              elif tok == 'incl' : pass
>>              elif tok == 'inc' : pass
>>              else: out += tok+ADDR_FIELD_DELIM
>>          return out
>>
>>     return normalise(s)
>>     $$ LANGUAGE plpythonu;
>>
>>
>>     Here's the create script for the table from pgAdmin (I hope that
>>     will be good enough instead of \d as I can't do that right now)..
>>
>>     -- Table: nlpg.match_data
>>
>>     -- DROP TABLE nlpg.match_data;
>>
>>     CREATE TABLE nlpg.match_data
>>     (
>>        premise_id integer,
>>        usrn bigint,
>>        org text,
>>        sao text,
>>     "level" text,
>>        pao text,
>>     "name" text,
>>        street text,
>>        town text,
>>        pc postcode,
>>        postcode text,
>>        match_data_id integer NOT NULL DEFAULT
>>     nextval('nlpg.match_data_match_data_id_seq1'::regclass),
>>        addr_str text,
>>        tssearch_name tsvector,
>>
>>        CONSTRAINT match_data_pkey1 PRIMARY KEY (match_data_id)
>>     )
>>     WITH (
>>        OIDS=FALSE
>>     );
>>     ALTER TABLE nlpg.match_data OWNER TO postgres;
>>     ALTER TABLE nlpg.match_data ALTER COLUMN "name" SET STATISTICS 
>> 10000;
>>
>>
>>     -- Index: nlpg.index_match_data_mid
>>
>>     -- DROP INDEX nlpg.index_match_data_mid;
>>
>>     CREATE INDEX index_match_data_mid
>>        ON nlpg.match_data
>>        USING btree
>>        (match_data_id);
>>
>>     -- Index: nlpg.index_match_data_pc
>>
>>     -- DROP INDEX nlpg.index_match_data_pc;
>>
>>     CREATE INDEX index_match_data_pc
>>        ON nlpg.match_data
>>        USING btree
>>        (pc);
>>
>>     -- Index: nlpg.index_match_data_pid
>>
>>     -- DROP INDEX nlpg.index_match_data_pid;
>>
>>     CREATE INDEX index_match_data_pid
>>        ON nlpg.match_data
>>        USING btree
>>        (premise_id);
>>
>>     -- Index: nlpg.index_match_data_tssearch_name
>>
>>     -- DROP INDEX nlpg.index_match_data_tssearch_name;
>>
>>     CREATE INDEX index_match_data_tssearch_name
>>        ON nlpg.match_data
>>        USING gin
>>        (tssearch_name);
>>
>>     -- Index: nlpg.index_match_data_usrn
>>
>>     -- DROP INDEX nlpg.index_match_data_usrn;
>>
>>     CREATE INDEX index_match_data_usrn
>>        ON nlpg.match_data
>>        USING btree
>>        (usrn);
>>
>>     As you can see, no FKs or triggers..
>>
>>     I am running: EXPLAIN ANALYZE UPDATE nlpg.match_data SET org = org;
>>
>>     However, as it should take around 90mins (if it is linear) then I
>>     thought I would send this now and follow up with the results once it
>>     finishes. (Has taken 2hours so far..)
>>
>>     Thanks very much for your help.
>>
>>     Tom
>>
>>
>>     On 28 May 2010 17:54, "Bill Moran" <wmoran(at)potentialtech(dot)com
>> <mailto:wmoran(at)potentialtech(dot)com>> wrote:
>>>
>>>     In response to Tom Wilcox <hungrytom(at)googlemail(dot)com
>>> <mailto:hungrytom(at)googlemail(dot)com>>:
>>>
>>> > In addition, I have discovered that the update query that runs
>>>     on each row
>>> > of a 27million row ta...
>>>
>>>     You're not liable to get shit for answers if you omit the mailing
>>>     list from
>>>     the conversation, especially since I know almost nothing about 
>>> tuning
>>>     PostgreSQL installed on Windows.
>>>
>>>     Are there multiple queries having this problem?  The original
>>>     query didn't
>>>     have normalise() in it, and I would be highly suspicious that a 
>>> custom
>>>     function may have a memory leak or other memory-intensive
>>>     side-effects.
>>>     What is the code for that function?
>>>
>>>     For example, does:
>>>     UPDATE nlpg.match_data SET org = org WHERE match_data_id;
>>>     finish in a reasonable amount of time or exhibit the same out of
>>>     memory
>>>     problem?
>>>
>>>     It'd be nice to see a \d on that table ... does it have any
>>>     triggers or
>>>     cascading foreign keys?
>>>
>>>     And stop
>>>
>>>     --
>>>
>>>     Bill Moran
>>>     http://www.potentialtech.com
>>>     http://people.collaborativefusion.com/~wmoran/
>>> <http://people.collaborativefusion.com/%7Ewmoran/>
>>>
>>
>
>


In response to

Responses

pgsql-general by date

Next:From: Andy ColsonDate: 2010-05-31 19:23:22
Subject: Re: Nested function invocation, but parameter does not exist
Previous:From: Andy ColsonDate: 2010-05-31 19:13:05
Subject: Re: PG backup performance

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group