ITEM: AB7025L

Problems with "snapshop" backups of LV copies on 7135 system


Question:

Customer is running data base(recital), and customer is making 
snap shot backups by making a LV copy, breaking the copy, creating
a new LV from the broken copy, and backing it up.  When they
create the new LV, fsck complains that the superblock id is bad.

Response:

The maximum number of supported PPs for an hdisk is 1016.
Although you don't get an error for creating an hdisk that 
has more than this, the results are unpredictable.  We have
spoken with the developers and they say that you will especially
get unpredictable results when syncing a mirror copy. 

Although this doesn't answer all of the questions here, I
would recommend increasing the PP size to 8 or 16MB.  This
means recreating the vg and restoring the data.

DESC: 

The volume group was set up again, this time with a PP size 
of 16Meg.  Now the hdisks have about 450PPs on them, well below
the 1016 limit.  Everything has been running fine with the mirror
backup script until yesterday when the system configuration was
changed.  The customer added some new filesystems to this disk,
leaving 70 free PPs.  The largest logical volume that is being backed
up with this mirror method is 66 PPs, so there should be room.

The error you get now when you try running the mirror backup on any
of the logical volumes is from the fsck part:

0506-416 Not a recognized filesystem type. (TERMINATED)

This is the same error you saw a month ago and it was caused by
the data not getting synced up at all, the superblocks were all
zeros.  The customer will check if this is the case now also.
It seems to be on the partitions greater than 400.

NEXT: Customer is putting together this information into a fax 
and I will receive it tomorrow.
 Response:

DESC: Customer sent previous owner a FAX with information
      on his problem.

Fax received, summary follows:

He has 7135 with one hdisk (a RAID 5 hdisk with 479 PPs of 16MB
each for a total of 7.664 GB)  named hdisk1.

 LVs on the disk include:

  loglv00
  histlv      /hist
  prodlv      /home/rcpva/prod1b
  csnaplv     /home/rcpva/csnap
  homelv      /home
  testlv      /home/rcpva/test
  cmlv        /wms_ctrl
  prod1alv    /home/rcpva/prod
  countlv     /home/rcpva/count
  prodsnaplv  /home/rcpva/prod1bsnap

They had no problems using the break LV copy procedure to make a
backup until they added the csnaplv LV.  Then backups of prodlv,
countlv, and testlv all failed "with symptoms similar to those we had
when [they had 4MB PP sizes]."  The snapshot backup of homelv
succeeded.  Note that homelv had contiguous partition numbers for the
LV copy they made, while the other three did not.

If they removed the csnaplv LV, then used the break LV copy procedure
to make a backup, the backups succeeded for countlv, testlv, and
homelv succeeded (these had contiguous partition numbers for the
copied LV) but the backup of prodlv failed (this did not have
contiguous partitions).

Customer called in at this point, with a consultant.

The failure is occurring during the fsck of the broken copy LV after

0516-416 Not a recognized filesystem type. (TERMINATED)

They created two trash filesystems using PPs 466-473, and 474-479,
they then used a copy using PPs 51-96 and 427-446 for prodlv. 
The data-base Recital (prodlv) uses no raw logical volumes.
They say this works.  This workaround is working for them and
they will monitor it carefully.

Customer does not want to run reorgvg on the RAID box.  Nothing has
shown up in the error log during these procedures.  They have about
6GB of internal disk and at one time moved their application to the
internal disk and had similar problems intermittantly.  They believed
that the problems with the internal snapshot copies were due to the
load on the disks.

U432241 and U428409 (latest RAID PTFs) are not on the system.

They are using RAID 5, 7135 with 2 controllers, attached via seperate
SCSI-2 busses (not Fast & Wide).

NEXT:

What would be the next step to help isolate this problem.
Customer will send fax with info on their snapshop backup procedure.

Possibilites include: 1) their PP maps are incorrect, 2) they
are doing a sync but not a syncvg.  If their PP maps include
the correct PPs but out of order, fsck will fail.  If they break
the copy with a stale PP, the copy will be bad.  

Response:

Here's info on the script used to make the LV copy (parts of it)
received from the fax:

 \#!/bin/ksh
 VGNAME=\
 LVNAME=\
 HDISKLIST=\
 \#
 mklvcopy $LVNAME 2
 syncvg -l $LVNAME
 numlps=`lslv $LVNAME | egrep "LPs.*PPs" | cut -c21-25`
 lsvg -M $VGNAME | egrep "${LVNAME}:.*:2" | cut -f1 > /tmp/${LVNAME}map
 rmlvcopy $LVNAME 1 $HDISKLIST
 mklv -y ${LVNAME}temp -m /tmp/$LVNAME}map $VGNAME $numlps
 fsck -Vjfs -y -t/tmp/fsck.tmp /dev/${LVNAME}temp
 mount -r /dev/${LVNAME}temp /mnt
 \#
 \# Code to do backup is here
 \#
 umount /mnt
 rmlv -f ${LVNAME}temp

Action: There is one problem that could be occuring that would 
 account for the "not a recognized filesystem" message.
 This would be if the map that was being generated was
 incorrect or out of order.  From the code that was sent
 in, this could be happening.  The one case that this
 code would not work properly would be when the logical
 volume "wrapped around" the disk (i.e. one of more of the
 logical partitions have a smaller PP number than a previous
 logical partition).  Example:

 hdisk0:1       hd1:6
 hdisk0:2       hd1:7
 hdisk0:90      hd1:1
 hdisk0:91      hd1:2
 hdisk0:92      hd1:3
 hdisk0:93      hd1:4
 hdisk0:94      hd1:5

lsvg -M $VGNAME | egrep "${LVNAME}:.*:2" | cut -f1 > /tmp/${LVNAME}map

would produce the following map:

 hdisk0:1
 hdisk0:2
 hdisk0:90
 hdisk0:91
 hdisk0:92
 hdisk0:93
 hdisk0:94

This map is incorrect and would cause the "not a recognized file-
system" error.

A more correct command would be:

lsvg -M $VGNAME | egrep "${LVNAME}:.*:2" | sort -n -t: -k3 | \\
  cut -f1 > /tmp/${LVNAME}map

which would product the following map:

 hdisk0:90
 hdisk0:91
 hdisk0:92
 hdisk0:93
 hdisk0:94
 hdisk0:1
 hdisk0:2

Customer should try making this change and see if this remedies the
situation.

ACT:

Customer called.  I gave him the info on using the sort command.  I
also recommended he save the LV map into a file so when he does have a
failure that he can look at the LV map before breaking the copy, and
the LV map after making the broken copy a new LV.


Support Line: Problems with "snapshop" backups of LV copies on 7135 system ITEM: AB7025L
Dated: November 1994 Category: N/A
This HTML file was generated 99/06/24~13:30:31
Comments or suggestions? Contact us