9 Recovering from Outages

This chapter describes scheduled and unscheduled outages and the Oracle recovery process and architectural framework that can manage each outage and minimize downtime. This chapter contains the following sections:

Recovery Steps for Unscheduled Outages
Recovery Steps for Scheduled Outages

9.1 Recovery Steps for Unscheduled Outages

Unscheduled outages are unanticipated failures in any part of the technology infrastructure that supports the application, including the following components:

Hardware such as host machines, storage, switches, cables, and cards
Software, including the operating system, the Oracle database and application server, and application code
Network infrastructure
Naming services infrastructure
Front-end load balancers
The current production site

The monitoring and HA infrastructure should provide rapid detection and recovery from failures. Detection is described in Chapter 8, "Using Oracle Enterprise Manager for Monitoring and Detection", while this chapter focuses on the recovery operations for each outage.

Table 9-1 describes the unscheduled outages that impact the primary or secondary site components.

Table 9-1 Unscheduled Outages

Outage	Description	Examples
Site failure	The entire site where the current production database resides is unavailable. This includes all tiers of the application.	Disaster at the production site such as a fire, flood, or earthquake Power outages. (If there are multiple power grids and backup generators for critical systems, then this should affect only part of the data center.)
Node failure	A node of the RAC cluster is unavailable or fails	A database tier node fails or has to be shut down because of bad memory or bad CPU The database tier node is unreachable Both of the redundant cluster interconnects fail, resulting in another node taking ownership
Instance failure	A database instance is unavailable or fails	An instance of the RAC database on the data server fails because of a software bug or an operating system or hardware problem
Clusterwide failure	The whole cluster hosting the RAC database is unavailable or fails. This includes failures of nodes in the cluster as well as any other components that result in the cluster being unavailable and the Oracle database and instances on the site being unavailable.	The last surviving node on the RAC cluster fails and cannot be restarted Both of the redundant cluster interconnects fail Database corruption is severe enough to disallow continuity on the current data server Disk storage fails
Data failure	This failure results in unavailability of parts of the database because of media corruptions, inaccessibility, or inconsistencies.	A datafile is accidentally removed or is unavailable Media corruption affects blocks of the database Oracle block corruption is caused by operating system or other node-related problems
User error	This failure results in unavailability of parts of the database and causes transactional or logical data inconsistencies. It is usually caused by the operator or by bugs in the application code. This is estimated to be the greatest single cause of downtime.	Localized damage (needs surgical repair) User error results in a table being dropped or in rows being deleted from a table Widespread damage (needs drastic action to avoid downtime) Application errors result in logical corruptions in the database Operator error results in a batch job being run more times than specified. Note: This category focuses on user errors that affect database availability and, in particular, cause transactional or logical data inconsistencies.

The rest of this section provides outage decision trees for unscheduled outages on the primary site and the secondary site. The decision trees appear in the following sections:

Recovery Steps for Unscheduled Outages on the Primary Site
Recovery Steps for Unscheduled Outages on the Secondary Site

The high-level recovery steps for each outage are listed with links to the detailed descriptions for each recovery step. These descriptions are found in Chapter 10, "Detailed Recovery Steps".

Some outages require multiple recovery steps. For example, when a site failure occurs, the outage decision matrix states that Data Guard failover must occur before site failover. Some outages are handled automatically without any loss of availability. For example, instance failure is managed automatically by RAC. Multiple recovery options for each outage are listed wherever relevant.

9.1.1 Recovery Steps for Unscheduled Outages on the Primary Site

If the primary site contains the production database and the secondary site contains the standby database, then the outages on the primary site are the ones of most interest. Solutions for these outages are critical for maximum availability of the system. Only the "Data Guard only" and MAA architectures have a secondary site to protect from site disasters. The estimated recovery times (ERT) are strictly examples derived from customer and actual testing experiences and do not reflect a guaranteed recovery time.

Table 9-2 summarizes the recovery steps for unscheduled outages on the primary site.

Table 9-2 Recovery Steps for Unscheduled Outages on the Primary Site

Reason for Outage	Recovery Steps for "Database Only" Architecture	Recovery Steps for "RAC Only" Architecture	Recovery Steps for "Data Guard Only" Architecture	Recovery Steps for MAA
Site failure	ERT: hours to days Restore site. Restore from tape backups. Recover database.	ERT: hours to days Restore site. Restore from tape backups. Recover database.	ERT: minutes to an hour Database Failover Complete or Partial Site Failover	ERT: minutes to an hour Database Failover Complete or Partial Site Failover
Node failure	ERT: minutes to an hour Restart node and restart database. Reconnect users.	ERT: seconds to minutes Managed automatically by RAC Recovery	ERT: minutes to an hour Restart node and restart database. Reconnect users. or ERT: minutes to an hour Database Failover Complete or Partial Site Failover	ERT: seconds to minutes Managed automatically by RAC Recovery
Instance failure	ERT: minutes Restart instance. Reconnect users.	ERT: seconds to minutes Managed automatically by RAC Recovery	ERT: minutes Restart instance. Reconnect users.	ERT: seconds to minutes Managed automatically by RAC Recovery
Clusterwide failure	N/A	ERT: hours to days Restore cluster or restore at least one node. Restore from tape backups. Recover database.	N/A	ERT: minutes to an hour Database Failover Complete or Partial Site Failover
Data failure	ERT: minutes to an hour Recovery Solutions for Data Failures	ERT: minutes to an hour Recovery Solutions for Data Failures	ERT: minutes to an hour Recovery Solutions for Data Failures or ERT: minutes to an hour Database Failover Complete or Partial Site Failover Note: For primary database media failures or media corruptions, database failover may minimize data loss.	ERT: minutes to an hour Recovery Solutions for Data Failures or ERT: minutes to an hour Database Failover Complete or Partial Site Failover Note: For primary database media failures or media corruptions, database failover may minimize data loss.
User error	ERT: minutes Recovering from User Error with Flashback Technology	ERT: minutes Recovering from User Error with Flashback Technology	ERT: minutes Recovering from User Error with Flashback Technology	ERT: minutes Recovering from User Error with Flashback Technology

9.1.2 Recovery Steps for Unscheduled Outages on the Secondary Site

Outages on the secondary site do not directly affect availability because the clients always access the primary site unless there is a switchover or failover. Outages on the secondary site may impact the MTTR if there are concurrent failures on the primary site. For most cases, outages on the secondary site can be managed with no impact on availability. However, if maximum protection mode is part of the configuration, then an unscheduled outage on the last surviving standby database causes downtime on the production database. After downgrading the data protection mode, you can restart the production database.

Table 9-3 summarizes the recovery steps for unscheduled outages of the standby database on the secondary site.

Table 9-3 Recovery Steps for Unscheduled Outages of the Standby Database on the Secondary Site

Reason for Outage	Recovery Steps for "Data Guard Only" Architecture	Recovery Steps for MAA
Standby apply instance failure	Restart node and standby instance. Restart recovery. If there is only one standby database and if maximum database protection is configured, then the production database will shut down to ensure that there is no data divergence with the standby database.	ERT: seconds Apply Instance Failover There is no effect on production availability if the production database Oracle Net descriptor is configured to use connect-time failover to an available standby instance. Restart node and instance when they are available.
Standby non-apply instance failure	N/A	There is no effect on availability because the primary node or instance receives redo logs and applies them with the recovery process. The production database continues to communicate with this standby instance. Restart node and instance when they are available.
Data failure such as media failure or disk corruption	Step 2: Start Recovery	Step 2: Start Recovery
Primary database resets logs because of flashback operations or media recovery	Restoring Fault Tolerance After the Production Database Has Opened Resetlogs	Restoring Fault Tolerance After the Production Database Has Opened Resetlogs

9.2 Recovery Steps for Scheduled Outages

Scheduled outages are planned outages. They are required for regular maintenance of the technology infrastructure that supports the application and include tasks such as hardware maintenance, repair, and upgrades; software upgrades and patching; application changes and patching; and changes to improve performance and manageability of systems. Scheduled outages should be scheduled at times best suited for continual application availability.

Table 9-4 describes the scheduled outages that impact either the primary or secondary site components.

Table 9-4 Scheduled Outages

Outage Class	Description	Examples
Site-wide	The entire site where the current production database resides is unavailable. This is usually known well in advance and can be scheduled.	Scheduled power outages Site maintenance Regular planned switchovers to test infrastructure
Hardware maintenance (node impact)	This is scheduled downtime of a database server node for hardware maintenance. The scope of this downtime is restricted to a node of the database cluster.	Repair of a failed component such as a memory card or CPU board Addition of memory or CPU to an existing node in the database tier
Hardware maintenance (clusterwide impact)	This is scheduled downtime of the database server cluster for hardware maintenance.	Some cases of adding a node to the cluster Upgrade or repair of the cluster interconnect Upgrade to the storage tier that requires downtime on the database tier
System software maintenance (node impact)	This is scheduled downtime of a database server node for system software maintenance. The scope of the downtime is restricted to a node.	Upgrade of a software component such as the operating system Changes to the configuration parameters for the operating system
System software maintenance (clusterwide impact)	This is scheduled downtime of the database server cluster for system software maintenance.	Upgrade or patching of the cluster software Upgrade of the volume management software
Oracle patch upgrade for the database	Scheduled downtime for an Oracle patch	Patch Oracle software to fix a specific customer issue
Oracle patch set or software upgrade for the database	Scheduled downtime for Oracle patch set or software upgrade	Patching Oracle software with a patch set Upgrade Oracle software
Database object reorganization	These are changes to the logical structure or the physical organization of Oracle database objects. The primary reason for these changes is to improve performance or manageability. This is always a planned activity. The method and the time chosen to do the reorganization should be planned and appropriate. Using Oracle's online reorganization features enables objects to be available during the reorganization.	Moving an object to a different tablespace Converting a table to a partitioned table Renaming or dropping columns of a table

The rest of this section provides outage decision trees for scheduled outages. They appear in the following sections:

Recovery Steps for Scheduled Outages on the Primary Site
Recovery Steps for Scheduled Outages on the Secondary Site

The high-level recovery steps for each outage are listed with links to the detailed descriptions for each recovery step. The detailed descriptions of the recovery operations are found in Chapter 10, "Detailed Recovery Steps".

This section also includes the following topic:

Preparing for Scheduled Secondary Site Maintenance

9.2.1 Recovery Steps for Scheduled Outages on the Primary Site

Table 9-5 shows the recovery steps for scheduled outages on the primary site.

Table 9-5 Recovery Steps for Scheduled Outages on the Primary Site

Scope of Outage	Reason for Outage	Recovery Steps for "Database Only" Architecture	Recovery Steps for "RAC Only" Architecture	Recovery Steps for "Data Guard Only" Architecture	Recovery Steps for MAA
Site	Site shutdown	Downtime for entire duration	Downtime for entire duration	Database Switchover Complete or Partial Site Failover	Database Switchover Complete or Partial Site Failover
Primary database	Hardware maintenance (node impact)	Downtime for entire duration	Managed automatically by RAC Recovery	Database Switchover Complete or Partial Site Failover	Managed automatically by RAC Recovery
Primary database	Hardware maintenance (clusterwide impact)	Downtime for entire duration	Downtime for entire duration	Database Switchover Complete or Partial Site Failover	Database Switchover Complete or Partial Site Failover
Primary database	System software maintenance (node impact)	Downtime for entire duration	Managed automatically by RAC Recovery	Database Switchover Complete or Partial Site Failover	Managed automatically by RAC Recovery
Primary database	System software maintenance (clusterwide impact)	Downtime for entire duration	Downtime for entire duration	Database Switchover Complete or Partial Site Failover	Database Switchover Complete or Partial Site Failover
Primary database	Oracle patch upgrade for the database	Downtime for entire duration	RAC Rolling Upgrade	Downtime for entire duration	RAC Rolling Upgrade
Primary database	Oracle patch set or software upgrade for the database	Downtime for entire duration	Downtime for entire duration	Upgrade with Logical Standby Database	Upgrade with Logical Standby Database
Primary database	Database object reorganization	Online Object Reorganization	Online Object Reorganization	Online Object Reorganization	Online Object Reorganization

9.2.2 Recovery Steps for Scheduled Outages on the Secondary Site

Outages on the secondary site do not impact availability because the clients always access the primary site unless there is a switchover or failover. Outages on the secondary site may affect the MTTR if there are concurrent failures on the primary site. Outages on the secondary site can be managed with no impact on availability. If maximum protection database mode is configured, then downgrade the protection mode before a scheduled outage on the standby instance or database so that there will be no downtime on the production database.

Table 9-6 describes the recovery steps for scheduled outages on the secondary site.

Table 9-6 Recovery Steps for Scheduled Outages on the Secondary Site

Scope of Outage	Reason for Outage	Recovery Steps for "Data Guard Only" Architecture	Recovery Steps for MAA
Site	Site shutdown	Before the outage: "Preparing for Scheduled Secondary Site Maintenance" After the outage: "Restoring Fault Tolerance after Secondary Site or Clusterwide Scheduled Outage"	Before the outage: "Preparing for Scheduled Secondary Site Maintenance" After the outage: "Restoring Fault Tolerance after Secondary Site or Clusterwide Scheduled Outage"
Standby database	Hardware or software maintenance the node that is running the managed recovery process (MRP)	Before the outage: "Preparing for Scheduled Secondary Site Maintenance"	Before the outage: "Preparing for Scheduled Secondary Site Maintenance"
Standby database	Hardware or software maintenance on a node that is not running the MRP	N/A	No impact because the primary standby node or instance receives redo logs that are applied with the managed recovery process After the outage: Restart node and instance when available.
Standby database	Hardware or software maintenance (clusterwide impact)	N/A	Before the outage: "Preparing for Scheduled Secondary Site Maintenance" After the outage: "Restoring Fault Tolerance after Secondary Site or Clusterwide Scheduled Outage"
Standby database	Oracle patch and software upgrades	Downtime needed for upgrade, but there is no impact on primary node unless the configuration is in maximum protection database mode.	Downtime needed for upgrade, but there is no impact on primary node unless the configuration is in maximum protection database mode.

9.2.3 Preparing for Scheduled Secondary Site Maintenance

To achieve continued service during a secondary site scheduled outage, downgrade the maximum protection mode to maximum availability or maximum performance. When you are scheduling secondary site maintenance, consider that the duration of a site-wide or clusterwide outage adds to the time the standby lags behind the production database, which lengthens the time to restore fault tolerance.

Table 9-7 shows how to prepare for scheduled secondary site maintenance.

Table 9-7 Preparing for Scheduled Secondary Site Maintenance

Production Database Protection Mode	Reason for Outage	Preparation Steps for "Data Guard Only" Architecture and MAA
Maximum protection	Site shutdown	Switch the production data protection mode to either maximum availability or maximum performance See Also: "Changing the Data Protection Mode"
Maximum protection	Hardware maintenance (clusterwide impact)	Switch the production data protection mode to either maximum availability or maximum performance See Also: "Changing the Data Protection Mode"
Maximum protection	Software maintenance (clusterwide impact)	Switch the production data protection mode to either maximum availability or maximum performance See Also: "Changing the Data Protection Mode"
Maximum protection	Hardware maintenance on the primary node (the node that is running the recovery process)	Apply Instance Failover (MAA only) or Switch the production data protection mode to either maximum availability or maximum performance
Maximum protection	Software maintenance on the primary node (the node that is running the recovery process)	Apply Instance Failover (MAA only) or Switch the production data protection mode to either maximum availability or maximum performance
Maximum availability or maximum performance	Site shutdown	None; no impact on production database
Maximum availability or maximum performance	Hardware maintenance (clusterwide impact)	None; no impact on production database
Maximum availability or maximum performance	Software maintenance (clusterwide impact)	None; no impact on production database
Maximum availability or maximum performance	Hardware maintenance on the primary node (the node that is running the recovery process)	Apply Instance Failover (MAA only) or None; no impact on production database
Maximum availability or maximum performance	Software maintenance on the primary node (the node that is running the recovery process)	Apply Instance Failover (MAA only) or None; no impact on production database