ITEM: AT0478L

WOCB: getting a stale partitions/lvm



Question:

   Env:
        4.1.3
        Model R30
        2 7137's RAID 5 mirrored to each other.
        HACMP 4.1

   Desc:
After leaving the system up and running, the system comes up 
with stale partitions.  Running syncvg -v \ corrects it.
Would like to know why this happens.  Yesterday, the machine 
was running fine, no stale partitions.  This morning, there were
stale partitions.  This machine is not in production at this 
time.  No applications are installed or running.  No users are 
on the system.  This is occuring on a daily basis.  

The customer stated that the system that has the rotating
resource at 3:00AM gets several DISK_ERR2 errors on the shared disks
of the rotating resource group but not on the cascading resource
group.  The customer stated that if he stops HACMP on this node and
manually acquires all of the resources of the rotating resource group,
the system does not give any errors on those disks at 3:00AM.  It
would appear that something is running at 3:00AM on the shared disks
while they are varied on to the system and this is causing a problem.

Action:
Checked the error log and found the following:
        Label:  PSIGDELIVER
        ID:     A2A97A5F

Customer did not have any core files on the system.  The system
is not coming down.  It remains stable.

Found a few hits in Xmenu, none that match the customers
situation enough to be definate.  Will research and contact
customer with additional information.

Desc:
  How to stop diag from running on certain devices

Action:
  diag
    Service Aids
      Periodic Diagnostics Service Aid
         Delete a resource from the periodic test list

That didn't work.

Action:  I had the customer modify the
time that hdisk0 was to be checked to be a different time on each
node.  I had him set diagnostics to run at 5AM on hdisk0 of Node1 and
at 6AM on hdisk0 of Node2.  With these modifications, their shouldn't
be any diagnostics run at 3AM.

I also recommended that the customer run a system trace from 2:55AM to
3:05AM on the node that owns the shared rotating resource group.  This
should be done with HACMP up and running.

After altering the diagnostic startup time of Node2, the customer
tried to restart HACMP on Node2.  rc.cluster core dumped while
attempting to start up HACMP.

Therefore, I requested that the customer send me the output of the
getinfo script so that I could examine his HACMP configuration.

That didn't work either.

Action:  Looking at what diagd does when it builds the test list.  It 
appears that it looks in the /etc/objrepos/CDiagDev odm object class 
to build the list.  Any device in this object class with a Periodic 
field that is not 9999 will be tested.  I called Mark back and told 
him this.  He will do the following to test this out.

    cd /etc/objrepos
    cp CDiagDev CDiagDev.orig
    odmget -q "DType like hdisk*" CDiagDev > Filename
    odmdelete -q "DType like hdisk*" -o CDiagDev
    vi Filename
    change all hdisks that are the 7137 so that Periodic = 9999
    odmadd Filename
    set time to test fix
    reboot system.

Action:  Mark called back to say that changing the CDiagDev object class
fixed the problem.  He has no longer received any error messages since
this change was made.


Support Line: WOCB: getting a stale partitions/lvm ITEM: AT0478L
Dated: March 1996 Category: N/A
This HTML file was generated 99/06/24~13:30:25
Comments or suggestions? Contact us