Re: questions on (parallel) COPY and when to REINDEX

From: andy <andy(at)squeakycode(dot)net>
To: Janet Jacobsen <jsjacobsen(at)lbl(dot)gov>, pgsql-general(at)postgresql(dot)org
Subject: Re: questions on (parallel) COPY and when to REINDEX
Date: 2009-08-03 00:36:45
Message-ID: 4A76311D.4020609@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Janet Jacobsen wrote:
> Thanks for your reply. Responses below to answer your questions,
> and one follow-up question on REINDEX.
>
> Janet
>
>
>>> Both tables have one or two foreign key constraints.
>>>
>>> (2) Should I REINDEX these two tables daily after the pipeline
>>> completes? Is this what other people do in practice?
>>
>> it depends if an index exists on the table when you fill it with
>> data. But I repeat myself :-). If an index exists you would not need
>> to reindex it. It may be faster to fill a table without an index,
>> then add an index later. But that would depend on if you need the
>> index for unique constraints.
>>
>
> Ok. Since data loading occurs daily, and the users query the table
> while the data loading is going on, it seems like I should not drop
> the indexes before the daily loading.
>
> I re-read the REINDEX pages. I see the following statement,
>
> "Also, for B-tree indexes a freshly-constructed index is somewhat
> faster to access than one that has been updated many times, because
> logically adjacent pages are usually also physically adjacent in a
> newly built index. (This consideration does not currently apply to
> non-B-tree indexes.) It might be worthwhile to reindex periodically
> just to improve access speed."
>
> This quote says "has been updated many times" and "worthwhile to index
> periodically". I'm not sure how to interpret "many times" and
> "periodically".
>
> In our case, on a daily basis, 100K rows or so are added to two tables,
> and a later script does 100K updates on one of the table. Does that make
> us a candidate for daily REINDEXing?

Its tough to say. I'd guess not every day. Once a month? The best way to find out is to do some timing. Do a few indexed select statements and 'explain analyze' them. See what the numbers tell you. Then REINDEX and do the same test. Then wait a month and try the same test. See if its much slower.

The difference between having an index and not is hugely huge orders of magnitude. The difference between a balanced index and unbalanced is minor.

A vacuum analyze might be more important than a reindex, depending on how many indexes you have, it will update the stats about the indexes and help the planner pick the best index.

I cant answer as to what others do, my pg database is 25meg. Yes meg. And 260K rows. Its embarrassing. By next month I'll probably be up to 260.5K rows. So I really have no experience with the volume of data your pushing around.

>>
>>> (3) Currently the pipeline executes in serial fashion. We'd
>>> like to cut the wall clock time down as much as possible.
>>> The data processing and data analysis can be done in parallel,
>>> but can the loading of the database be done in parallel, i.e.,
>>> can I execute four parallel COPY commands from four copies
>>
>> We'd need more specifics. Are you COPY'ing into two different tables
>> at once? (that should work). Or the same table with different data
>> (that should work too, I'd guess) or the same data with a unique key
>> (that'll break)?
>>
>
> We'd like to run four identical scripts in parallel, each of which will
> copy a different file into the same table.
>>> Our initial attempt at doing this failed.
>>
>> What was the error?
>>
>
> If the return status from trying to do the COPY is 7, the script prints
> a failure message, i.e., not the Postgres error. I will set up a test on
> a test case to get more information. (I didn't run the initial try.)

COPY wont return 7. Not sure where the 7 comes from. The copy may fail and whatever language your programming in my raise an exception, which might be numbered 7... I suppose. Multiple copy's into the same table at the same time should work. I think the error was on your side.

Note that while you are in a transaction your clients wont be able to see any of the data until you commit. Since some of them want at the data asap, you might want to break up your copy's with a few commits. I sur'pose tha'd depend on how long it all takes though.

>>> I found one
>>> posting in the archives about parallel COPY, but it doesn't seem
>>> to be quite on point.
>>
>> They have added parallel copy to the pg_restore, but I think that does
>> different tables, not the same table. Was that what you saw?
>
> Yes, I think so. The reference is to "Andrews parallel restore patch":
> http://archives.postgresql.org/pgsql-hackers/2008-09/msg01711.php
> The subject line is "lock contention on parallel COPY ?"

Yeah, that's an internal lock on some really small variable deep in the guts of pg core. Not an entire table lock.

-Andy

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Andy Colson 2009-08-03 00:59:41 Re: questions on (parallel) COPY and when to REINDEX
Previous Message Janet Jacobsen 2009-08-03 00:25:48 Re: questions on (parallel) COPY and when to REINDEX