ITEM: AT0478L
WOCB: getting a stale partitions/lvm
Question:
Env:
4.1.3
Model R30
2 7137's RAID 5 mirrored to each other.
HACMP 4.1
Desc:
After leaving the system up and running, the system comes up
with stale partitions. Running syncvg -v \ corrects it.
Would like to know why this happens. Yesterday, the machine
was running fine, no stale partitions. This morning, there were
stale partitions. This machine is not in production at this
time. No applications are installed or running. No users are
on the system. This is occuring on a daily basis.
The customer stated that the system that has the rotating
resource at 3:00AM gets several DISK_ERR2 errors on the shared disks
of the rotating resource group but not on the cascading resource
group. The customer stated that if he stops HACMP on this node and
manually acquires all of the resources of the rotating resource group,
the system does not give any errors on those disks at 3:00AM. It
would appear that something is running at 3:00AM on the shared disks
while they are varied on to the system and this is causing a problem.
Action:
Checked the error log and found the following:
Label: PSIGDELIVER
ID: A2A97A5F
Customer did not have any core files on the system. The system
is not coming down. It remains stable.
Found a few hits in Xmenu, none that match the customers
situation enough to be definate. Will research and contact
customer with additional information.
Desc:
How to stop diag from running on certain devices
Action:
diag
Service Aids
Periodic Diagnostics Service Aid
Delete a resource from the periodic test list
That didn't work.
Action: I had the customer modify the
time that hdisk0 was to be checked to be a different time on each
node. I had him set diagnostics to run at 5AM on hdisk0 of Node1 and
at 6AM on hdisk0 of Node2. With these modifications, their shouldn't
be any diagnostics run at 3AM.
I also recommended that the customer run a system trace from 2:55AM to
3:05AM on the node that owns the shared rotating resource group. This
should be done with HACMP up and running.
After altering the diagnostic startup time of Node2, the customer
tried to restart HACMP on Node2. rc.cluster core dumped while
attempting to start up HACMP.
Therefore, I requested that the customer send me the output of the
getinfo script so that I could examine his HACMP configuration.
That didn't work either.
Action: Looking at what diagd does when it builds the test list. It
appears that it looks in the /etc/objrepos/CDiagDev odm object class
to build the list. Any device in this object class with a Periodic
field that is not 9999 will be tested. I called Mark back and told
him this. He will do the following to test this out.
cd /etc/objrepos
cp CDiagDev CDiagDev.orig
odmget -q "DType like hdisk*" CDiagDev > Filename
odmdelete -q "DType like hdisk*" -o CDiagDev
vi Filename
change all hdisks that are the 7137 so that Periodic = 9999
odmadd Filename
set time to test fix
reboot system.
Action: Mark called back to say that changing the CDiagDev object class
fixed the problem. He has no longer received any error messages since
this change was made.
Support Line: WOCB: getting a stale partitions/lvm ITEM: AT0478L
Dated: March 1996 Category: N/A
This HTML file was generated 99/06/24~13:30:25
Comments or suggestions?
Contact us