From: | Bill House <wch-tech(at)house-grp(dot)net> |
---|---|
To: | psycopg(at)postgresql(dot)org |
Subject: | Capacity questions |
Date: | 2012-06-17 15:32:29 |
Message-ID: | 4FDDF88D.9000908@house-grp.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | psycopg |
Hello all,
I have a written a script which does what I want it to do on a table
with 999 records.
I was wondering if there is anything I need to consider if I run this
script on a table with 1.1 million records?
The purpose of the script is to flag records representing files if they
are redundant (duplicate md5sum).
Later this table will be used to move or remove the files represented by
the flagged records.
The script performs the following actions
1 resets the del column to false for all records
2 creates a temporary table consisting of aggregate records of
identical md5sums which count more than 1.
3 iterates down the cursor and with this information
4 updates the del column to True on all records in a batch of
identical md5sums except for one.
Details are below.
Thanks in advance for your advise.
Regards,
Bill
The server hosting postgresql:
32 bit 2.8GHz cpu with 2Gb ram running openSUSE 12.1.
The workstation executing this script:
64 bit 3.1 GHz cpu with 12Gb ram running openSUSE 12.1.
Table description:
Table "public.files_test"
Column | Type | Modifiers
--------+-----------------+-----------
md5sum | character(33) |
path | character(475) |
file | character(200) |
del | boolean |
recno | integer | not null
Indexes:
"files_test_ord" btree (recno)
Script:
<CODE>
#!/usr/bin/python
#20120609
import sys
import os
import psycopg2
import psycopg2.extras
import psycopg2.extensions
conn = None
sys.path.append('/mnt/nfs/bb/library_1/stuff_mine/wch/z_projects/computer/z_development/languages/python/z_wch_functions')
import z_all
#z_all.zz_clear()
debug = 0
if debug == 1:
z_all.zz_set_trace_on()
def main():
#get the connection information
cxn = z_all.zz_pickle_load('_cxn_bb_wch')
table = "files_test"
try:
#use connection information from the list retrieved from the pickle file
conn = psycopg2.connect(host= cxn['host'],
dbname=cxn['db'],
user=cxn['user'],
password=cxn['password'])
print "Resetting all del fields to false."
working_curs = conn.cursor()
cmd = "working_curs.execute(\"\"\"UPDATE %s SET del = False;\"\"\")"
% table ; exec cmd
conn.commit()
working_curs.close()
print "Creating walking cursor."
walking_curs = conn.cursor("walking_curs")
cmd = "walking_curs.execute(\"\"\"SELECT md5sum, count(md5sum) FROM
%s GROUP BY md5sum HAVING count(md5sum) > 1;\"\"\")" % table ; exec cmd
for row in walking_curs:
mem_md5 = row[0]
mem_cnt = row[1]
row_cnt =+ 1
print mem_md5, mem_cnt, len(mem_md5)
print "creating working_cursora"
working_cursa = conn.cursor()
print "running command on working_cursora with substitution"
cmd = "working_cursa.execute(\"\"\"WITH batch AS (SELECT * FROM %s
WHERE md5sum = \'%s\' ORDER BY path DESC OFFSET 1) UPDATE %s SET del =
True FROM batch WHERE batch.md5sum || batch.path = files_test.md5sum ||
files_test.path;\"\"\")" % (table, mem_md5, table) ; exec
cmd
print "closing working_cursora"
working_cursa.close()
walking_curs.close()
conn.commit()
except psycopg2.DatabaseError, e:
print "psycopg2 reports error: {}".format(e)
sys.exit(1)
finally:
if conn:
conn.close()
return
if __name__ == '__main__':
main()
</CODE>
From | Date | Subject | |
---|---|---|---|
Next Message | W. Matthew Wilson | 2012-06-17 15:55:15 | When I select a single column, can I prevent getting a list of one-element tuples? |
Previous Message | Marc Abramowitz | 2012-06-15 15:17:30 | Patches to add support for tox and Travis CI tools to psycopg2 |