Re: Deletion Challenge

From: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
To: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>, Berend Tober <btober(at)computer(dot)org>, Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Deletion Challenge
Date: 2015-12-09 21:45:52
Message-ID: CAEfWYywEBm_zSBVtbZ48BLH96-W4_VJASNL9B4LR+YvCZyBPGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

The two general solutions are the "keep the last one" proposed by Adrian
"keep the last N" that I sent.

But it might be worth stepping back a bit. You said you are having
performance problems that you feel would be improved by removing only a
million rows which doesn't sound like that much to me. It's less than half
of what I *add* to just one of my tables every week and my database is
dwarfed by those of many of the participants on this list.

This suggests that there may be other issues such as tuning, indexing or
query optimization at play. Depending on your requirements, partitioning
might be useful. It wouldn't be last N but could easily be done to
partition by date-ranges which makes archiving and purging a low-cost
operation.

You might want to expand a bit on the core issue you are trying to solve.

Cheers,
Steve

On Wed, Dec 9, 2015 at 12:43 PM, David G. Johnston <
david(dot)g(dot)johnston(at)gmail(dot)com> wrote:

> On Wed, Dec 9, 2015 at 1:31 PM, Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
> wrote:
>
>> On 12/09/2015 12:24 AM, Berend Tober wrote:
>>
>>> Adrian Klaver wrote:
>>>
>>>> On 12/05/2015 08:08 AM, Berend Tober wrote:
>>>>
>>>>> /*
>>>>>
>>>>> Deletion Challenge
>>>>>
>>>>> I want to delete all but the most recent transaction, per person, from
>>>>> a
>>>>> table that records a transaction history because at some point the
>>>>> transaction history grows large enough to adversely effect performance,
>>>>> and also becomes less relevant for retention.
>>>>>
>>>>> ...
>>>>>
>>>>>
>>>> test=> delete from cash_journal where ARRAY[click, cash_journal_id]
>>>> NOT in (select max(ARRAY[click,cash_journal_id]) from cash_journal
>>>> group by fairian_id);
>>>> DELETE 7
>>>>
>>>> test=> SELECT * FROM cash_journal order by fairian_id, click,
>>>> cash_journal_id;
>>>> click | cash_journal_id | fairian_id | debit | credit | balance
>>>> | description
>>>>
>>>> -------+-----------------+------------+-------+--------+---------+----------------------------------
>>>>
>>>> 412 | 1 | 7 | 5 | | 14 |
>>>> Sold food quantity 7 units.
>>>> 37 | 7 | 8 | 8 | | 8 |
>>>> Ratified contract f1abd670358e03
>>>> 37 | 9 | 9 | 7 | | 7 |
>>>> Ratified contract 1574bddb75c78a
>>>> 36 | 14 | 18 | 0 | 0 | 0 |
>>>> initial cash balance
>>>> 413 | 1 | 25 | | 995 | 0 |
>>>> Redeemed bond 7719a1c782a1ba
>>>> (5 rows)
>>>>
>>>>
>>> Nice.
>>>
>>> The idea of a NOT IN query had occurred to me briefly, but I failed to
>>> pursue it because at some point in the distant past I had gained the
>>> impression that NOT IN queries were not computationally efficient.
>>> During one round of testing I had like a million rows. I'll have to run
>>> some EXPLAIN query testing with a larger data sample for comparison.
>>> Thanks!
>>>
>>
>> Plan B:
>>
>> WITH d AS
>> (SELECT * FROM
>> cash_journal
>> LEFT JOIN
>> (SELECT
>> MAX(ARRAY[click,cash_journal_id]) AS mx
>> FROM
>> cash_journal
>> GROUP BY
>> fairian_id)
>> AS
>> mxa
>> ON
>> mxa.mx=ARRAY[click, cash_journal_id]
>> WHERE
>> mx IS NULL)
>> DELETE FROM
>> cash_journal
>> USING
>> d
>> WHERE
>> d.click = cash_journal.click
>> AND
>> d.cash_journal_id = cash_journal.cash_journal_id;
>>
>>
> ​Couldn't the LEFT JOIN relation in the CTE be better written using
> "SELECT DISTINCT ON (click, cash_journal_id) click, cash_journal_id [...]
> ORDER BY click DESC, cash_journal_id" or something similar? It doesn't
> seem like you should need to introduce an array and an aggregate here.
>
> ​It does have the negative property of only providing a single row; which
> excludes using it for the "last 5" part but I suspect it will be
> considerably faster for the single version.
>
> David J.
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Christopher Molnar 2015-12-09 21:56:57 regexp_replace question / help needed
Previous Message David G. Johnston 2015-12-09 20:43:11 Re: Deletion Challenge