Dont Just Do Something! (For Junior DBA's - like me )

Synopsis
As Oracle DBAs, our immediate reaction is to do something - right now! - to solve the problem. Moreover, it usually does not help when your supervisor is standing behind you asking what's wrong, and how long will it be until the database is available again. While doing something immediately is certainly a noble goal, in many cases it is just the opposite of the proper action. Here the scenario:-
Scenario
I was called out of a meeting when a fellow IT employee reported that many of our users were having difficulty accessing our company's primary order entry application. Like any DBA, I made a beeline for the server room to check on the status of the database server. What I found sent my heart racing: the database server had rebooted itself for some reason from a Windows 2000 "blue screen of death" (BSOD) error, and in fact was still rebooting when I had arrived on the scene.

Since I knew I had configured the Oracle database service and instance to restart itself automatically in these circumstances, I headed back to my desk to start checking on the status of the applications that depend on the database. However, after a long five minutes or so - more than enough time for the server to reboot - we were still getting reports of inaccessibility from our user community.

Back to the server room! I immediately checked the database alert log and found an error message indicating that one of the disks storing the database's datafiles could not be located. Now my heart is really pounding, and I am already thinking about which Recovery Manager backups would have to be applied to restore the datafile, whether all the archived redo logs I would need to recover were present on the server.

Therefore, I just shut down the database instance, restarted the service and attempted to restart the database. Almost magically, everything was fine. Oracle found the datafile on the expected drive, the database restarted just fine, and SMON performed a perfect instance recovery. The database was back online in a few moments, and my heartbeat returned to normal.

It turned out that during the server reboot, the Windows 2000 operating system had just fallen a bit behind in recognizing the "missing" drive, and when it was time to mount the database, Windows told Oracle the drive "wasn't there." Of course, by the time I performed a shutdown and restart, Windows had finished making the disk drive available, and voila! everything was once again copasetic. Our team eventually found the root cause of the BSOD with some help from Microsoft. We traced it down to a Windows driver conflict between the OS and SNMP-based software that is supposed to monitor the server for hardware failures, including (ironically) disk drive failures.

But what if I hadn't resisted my urge to do something right away to fix the problem? I could have made the situation much worse by performing a time-consuming restoration and recovery in the midst of our company's busiest operational period, and possibly corrupted a datafile that was not even damaged.

No comments:

Post a Comment