Asm Health Checker Found 1 New Failures -
: One or more LUNs/disks became inaccessible due to hardware, cable, or storage controller issues. Write I/O Errors
He dug deeper into the ASM logs. The health checker hadn't flagged a total crash; it had flagged a "Zombie Process" in the health-check script itself. A legacy script, written years ago by an engineer who had long since moved on, had timed out while trying to ping a decommissioned staging server.
If you want, I can:
Drop unneeded files, purge the ASM volume recycle bin, or immediately add a new LUN to the disk group: ALTER DISKGROUP ADD DISK '/dev/mapper/new_lun1'; Scenario C: ASM Parameter Mismatch Across Nodes asm health checker found 1 new failures
The ASM Health Checker is part of the Oracle Check Framework. It runs periodic checks on the ASM instance, disk groups, and metadata to ensure everything is operating within healthy parameters.
# Navigate to your ASM trace directory cd $ORACLE_BASE/diag/asm/+asm/+ASM1/trace/ tail -n 200 alert_+ASM1.log Use code with caution.
Check your OS system logs ( /var/log/messages or dmesg ) for SCSI timeout errors or multipath path failures. Adjust your disk timeout configurations if your SAN fabric is experiencing transient load spikes. 4. Post-Resolution Verification : One or more LUNs/disks became inaccessible due
: If your diskgroup uses external redundancy and a disk fails, the group will likely dismount immediately, potentially crashing your database. Intermediate States
Beyond the technical remediation, the message “found 1 new failure” is a powerful lesson in monitoring philosophy. It underscores the value of proactive over reactive management. A system that never reports failures is either imaginary or poorly monitored. Failures are inevitable in distributed systems. The question is not if a component will fail, but when and how prepared you are. A health checker that reliably reports a single new failure empowers the operations team to perform a planned, low-impact replacement on a Tuesday afternoon, rather than an emergency, middle-of-the-night recovery following a double failure. It transforms a potential disaster into a routine maintenance ticket.
To identify the exact cause, execute the following steps within your environment: A legacy script, written years ago by an
The execution role used by the health checker lost its secretsmanager:GetSecretValue or secretsmanager:DescribeSecret permissions.
To minimize the likelihood of ASM health checker failures: