Skip site navigation (1) Skip section navigation (2)

Re: [Retrieved]RE: backup and recovery

From: "Murthy Kambhampaty" <murthy(dot)kambhampaty(at)goeci(dot)com>
To: "Naomi Walker" <nwalker(at)eldocomp(dot)com>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>,"Tsirkin Evgeny" <tsurkin(at)mail(dot)jct(dot)ac(dot)il>,"Mark M(dot) Huber" <MHuber(at)VMdirect(dot)com>, <pgsql-admin(at)postgresql(dot)org>
Subject: Re: [Retrieved]RE: backup and recovery
Date: 2004-03-26 20:30:19
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-admin
I think you can get both benefits of "multi-statement transactions for INSERT dumps" by doing "subset copies"  ... without any changes in postgresql!  The method I use is developed for handling single table "loads", but is still relatively painless even for database dumps; however, it is limited to text dumps.

Let's say you want to unload-load table "tbl_foo" from schema "sch_bar" from database "db_baz" and reload "sch_bar.tbl_foo" to database "db_quux".  Try the following:

1.) Dump-restore the table schema so you create an empty table in the destination database. e.g.:
	/usr/local/pgsql-7.4/bin/pg_dump -s -t tbl_foo --schema sch_bar db_baz | \
		/usr/local/pgsql-7.4/bin/psql -d db_quux 
This can be adjusted for different hosts, etc.

2.) COPY the records to a file:
	/usr/local/pgsql-7.4/bin/psql -d db_bar \
	 -c "copy sch_bar.tbl_foo to stdout" > sch_bar.tbl_foo.dat
	/usr/local/pgsql-7.4/bin/psql -d db_bar -Aqt \
	 -c "select * from sch_bar.tbl_foo where <some condition>" > sch_bar.tbl_foo.dat
The latter is slower, but selective.  You can also use the -p option to set col and row separators to whatever you like (as with copy options). If your source data came from a dump file, rather than a COPY, you can strip sql commands to leave data only, or modify the commands below.

3. Pipe the data from sch_bar.tbl_foo.dat to psql, with copy commands spliced in at chosen intervals (in numbers of lines) depending on your preferences for speed versus "recoverability".  In the example below, the subset size is 2000 lines:
awk \
  -v SubSize=2000 \
  -v COPYSTMT="copy sch_bar.tbl_foo from stdin;" \
 'BEGIN{ print COPYSTMT } \
  { print $0 } \
  FNR % SubSize == 0 { \
  print "\\.\n\n" ; \
  print "\n"; \
  print COPYSTMT }' "sch_bar.tbl_foo.dat" | \
        /usr/local/pgsql-7.4/bin/psql -U gouser -d airfrance -f -

The awk command specifies the chosen subset size ("2000") and a copy statement for putting stdin in the selected table; at the "BEGIN"ning, a copy statmenet is issued and lines are streamed in from the text file containing table rows; after each SubSize number of lines the copy stream is ended (as in text dumps, with a "\."), and a new copy statment inserted.

For a 220,000 row table, times for the simple copy versus the subset copy were:

	Simple copy:
	real    0m21.704s
	user    0m3.790s
	sys     0m0.880s

	Subset copy:
	real    0m24.233s
	user    0m5.710s
	sys     0m1.090s

Over 10% more wall clock time, but the savings from not having to rerun the entire "load" if errors are found could be tremendous.

3a.  Alternately, you can generate a log so you easily know which subset failed (if any):
# LogFile="/home/postgres/load.log"; \
 awk \
  -v SubSize=2000 \
  -v COPYSTMT="copy S2.air from stdin;" \
  -v LogF="$LogFile" \
 'BEGIN{ print "Block Size: " SubSize > LogF; \
   print "Copy Statment: " COPYSTMT > LogF; \
   print "\n\n" > LogF; \
   close(LogF) ; \
   print COPYSTMT } \
  { print $0 } \
  FNR % SubSize == 0 { \
  print "\\.\n\n" ; \
  printf("select \047Processed %d records from line no. %d to line no. %d\047;\n", SubSize, FNR -SubSize +1, FNR) ; \
  print "\n"; \
  print COPYSTMT }
  END{ \
  print "\\.\n\n" ; \
  printf("select \047Processed a grand total of %d lines from %s\047;\n", NR, FILENAME ) }' \
  "sch_bar.tbl_foo.dat" | \
        /usr/local/pgsql-7.4/bin/psql -U gouser -d airfrance -Atnq -f - >> "$LogFile" 2>&1

Errors can be located with:

[postgres(at)mybox postgres]$ cat load.log | grep -B 3 -A 3 "ERROR:"
Processed 2000 records from line no. 192001 to line no. 194000
Processed 2000 records from line no. 194001 to line no. 196000
Processed 2000 records from line no. 196001 to line no. 198000
ERROR:  invalid input syntax for integer: "My0"
CONTEXT:  COPY tbl_foo, line 2000, column oct 02: "My0"
Processed 2000 records from line no. 198001 to line no. 200000
Processed 2000 records from line no. 200001 to line no. 202000


> -----Original Message-----
> From: pgsql-admin-owner(at)postgresql(dot)org
> [mailto:pgsql-admin-owner(at)postgresql(dot)org]On Behalf Of Naomi Walker
> Sent: Wednesday, March 24, 2004 5:58 PM
> To: Tom Lane
> Cc: Bruce Momjian; Tsirkin Evgeny; Naomi Walker; Mark M. Huber;
> pgsql-admin(at)postgresql(dot)org
> Subject: Re: [Retrieved]RE: [ADMIN] backup and recovery
> At 03:54 PM 3/24/2004, Tom Lane wrote:
> >Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > > Added to TODO:
> > >       * Have pg_dump use multi-statement transactions for 
> INSERT dumps
> >
> > > For simple performance reasons, it would be good.  I am 
> not sure about
> > > allowing errors to continue loading.   Anyone else?
> >
> >Of course, anyone who actually cares about reload speed shouldn't be
> >using INSERT-style dumps anyway ... I'm not sure why we should expend
> >effort on that rather than just telling people to use the COPY mode.
> Understood.  I would still love this feature for when in the 
> COPY mode.
> >                         regards, tom lane
> --------------------------------------------------------------
> -----------------------------------------------------------
> Naomi Walker                         Chief Information Officer
>                                                Eldorado 
> Computing, Inc.
nwalker(at)eldocomp(dot)com           602-604-3100
> --------------------------------------------------------------
> -----------------------------------------------------------
> Forget past mistakes. Forget failures. Forget everything 
> except what you're 
> going to do now and do it.
> - William Durant, founder of General Motors
> --------------------------------------------------------------
> ----------------------------------------------------------
> This message is intended for the sole use of the individual 
> and entity to whom it is addressed, and may contain 
> information that is privileged, confidential and exempt from 
> disclosure under applicable law. If you are not the intended 
> addressee, nor authorized to receive for the intended 
> addressee, you are hereby notified that you may not use, 
> copy, disclose or distribute to anyone the message or any 
> information contained in the message. If you have received 
> this message in error, please immediately advise the sender 
> by reply email, and delete the message. Thank you.
> ---------------------------(end of 
> broadcast)---------------------------
> TIP 8: explain analyze is your friend

pgsql-admin by date

Next:From: HemapriyaDate: 2004-03-26 20:33:45
Subject: postgres copy command very slow.
Previous:From: Bruno Wolff IIIDate: 2004-03-26 20:29:39
Subject: Re: backup and recovery

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group