Capacity questions

From: Bill House <wch-tech(at)house-grp(dot)net>
To: psycopg(at)postgresql(dot)org
Subject: Capacity questions
Date: 2012-06-17 15:32:29
Message-ID: 4FDDF88D.9000908@house-grp.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: psycopg

Hello all,

I have a written a script which does what I want it to do on a table
with 999 records.

I was wondering if there is anything I need to consider if I run this
script on a table with 1.1 million records?

The purpose of the script is to flag records representing files if they
are redundant (duplicate md5sum).

Later this table will be used to move or remove the files represented by
the flagged records.

The script performs the following actions

1 resets the del column to false for all records

2 creates a temporary table consisting of aggregate records of
identical md5sums which count more than 1.

3 iterates down the cursor and with this information

4 updates the del column to True on all records in a batch of
identical md5sums except for one.

Details are below.

Thanks in advance for your advise.

Regards,

Bill

The server hosting postgresql:

32 bit 2.8GHz cpu with 2Gb ram running openSUSE 12.1.

The workstation executing this script:

64 bit 3.1 GHz cpu with 12Gb ram running openSUSE 12.1.

Table description:

Table "public.files_test"
Column | Type | Modifiers
--------+-----------------+-----------
md5sum | character(33) |
path | character(475) |
file | character(200) |
del | boolean |
recno | integer | not null
Indexes:
"files_test_ord" btree (recno)

Script:

<CODE>
#!/usr/bin/python
#20120609

import sys
import os

import psycopg2
import psycopg2.extras
import psycopg2.extensions

conn = None

sys.path.append('/mnt/nfs/bb/library_1/stuff_mine/wch/z_projects/computer/z_development/languages/python/z_wch_functions')

import z_all

#z_all.zz_clear()

debug = 0
if debug == 1:
z_all.zz_set_trace_on()

def main():
#get the connection information
cxn = z_all.zz_pickle_load('_cxn_bb_wch')

table = "files_test"

try:
#use connection information from the list retrieved from the pickle file
conn = psycopg2.connect(host= cxn['host'],
dbname=cxn['db'],
user=cxn['user'],
password=cxn['password'])


print "Resetting all del fields to false."
working_curs = conn.cursor()
cmd = "working_curs.execute(\"\"\"UPDATE %s SET del = False;\"\"\")"
% table ; exec cmd
conn.commit()
working_curs.close()

print "Creating walking cursor."
walking_curs = conn.cursor("walking_curs")
cmd = "walking_curs.execute(\"\"\"SELECT md5sum, count(md5sum) FROM
%s GROUP BY md5sum HAVING count(md5sum) > 1;\"\"\")" % table ; exec cmd
for row in walking_curs:
mem_md5 = row[0]
mem_cnt = row[1]
row_cnt =+ 1
print mem_md5, mem_cnt, len(mem_md5)

print "creating working_cursora"
working_cursa = conn.cursor()

print "running command on working_cursora with substitution"
cmd = "working_cursa.execute(\"\"\"WITH batch AS (SELECT * FROM %s
WHERE md5sum = \'%s\' ORDER BY path DESC OFFSET 1) UPDATE %s SET del =
True FROM batch WHERE batch.md5sum || batch.path = files_test.md5sum ||
files_test.path;\"\"\")" % (table, mem_md5, table) ; exec
cmd

print "closing working_cursora"
working_cursa.close()

walking_curs.close()

conn.commit()

except psycopg2.DatabaseError, e:
print "psycopg2 reports error: {}".format(e)
sys.exit(1)

finally:

if conn:
conn.close()
return

if __name__ == '__main__':
main()

</CODE>

Responses

Browse psycopg by date

  From Date Subject
Next Message W. Matthew Wilson 2012-06-17 15:55:15 When I select a single column, can I prevent getting a list of one-element tuples?
Previous Message Marc Abramowitz 2012-06-15 15:17:30 Patches to add support for tox and Travis CI tools to psycopg2