| From: | Bill House <wch-tech(at)house-grp(dot)net> | 
|---|---|
| To: | Federico Di Gregorio <fog(at)dndg(dot)it> | 
| Cc: | psycopg(at)postgresql(dot)org | 
| Subject: | Re: Capacity questions | 
| Date: | 2012-06-19 02:23:46 | 
| Message-ID: | 4FDFE2B2.5050405@house-grp.net | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | psycopg | 
On 06/18/2012 05:04 AM, Federico Di Gregorio wrote:
> On 17/06/12 17:32, Bill House wrote:
>> I have a written a script which does what I want it to do on a table
>> with 999 records.
>>
>> I was wondering if there is anything I need to consider if I run this
>> script on a table with 1.1 million records?
>>
>> The purpose of the script is to flag records representing files if they
>> are redundant (duplicate md5sum).
>>
>> Later this table will be used to move or remove the files represented by
>> the flagged records.
>>
>> The script performs the following actions
>>
>>     1 resets the del column  to false for all records
>>
>>     2 creates a temporary table consisting of aggregate records of
>> identical md5sums which count more than 1.
>>
>>     3 iterates down the cursor and with this information
>>
>>     4 updates the del column to True on all records in a batch of
>> identical md5sums except for one.
> If your recno column is a primary key, you can do this with a single SQL
> UPDATE, without the initial update, temporary table or psycopg (with the
> right indices this is blazing fast even on a gazzilion rows):
>
> UPDATE files_test SET del = T.n > 1
> FROM (SELECT recno,
>              row_number() OVER (PARTITION BY md5sum) AS n
>         FROM files_test) T
> WHERE files_test.recno = T.recno;
>
> Hope this helps,
>
> federico (that *loves* UPDATE/FROM and window functions)
>
Thanks for the input all. 
Part of the purpose of this little project is for me to learn about both
Python and Postgresql; I know very little about either.  That is what
brings me to psycopg2.
I have read in the documents it is recommend that a programmer avoiding
parameter substitution by any means other than the one supplied by
psycopg2.  I understand the reasons and will comply if I create code for
public access.
I appreciate the opportunity to study the alternative more efficient
approach offered above.  As I said, I am new to SQL and have much to learn.
Thanks again,
Bill
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ottavio Campana | 2012-06-19 07:51:52 | connection pooling in web apps | 
| Previous Message | Federico Di Gregorio | 2012-06-18 10:27:02 | Re: Patches to add support for tox and Travis CI tools to psycopg2 |