Oracle® Database High Availability Architecture and Best Practices 10g Release 1 (10.1) Part Number B10726-02 |
|
|
View PDF |
This chapter describes scheduled and unscheduled outages and the Oracle recovery process and architectural framework that can manage each outage and minimize downtime. This chapter contains the following sections:
Unscheduled outages are unanticipated failures in any part of the technology infrastructure that supports the application, including the following components:
Hardware such as host machines, storage, switches, cables, and cards
Software, including the operating system, the Oracle database and application server, and application code
Network infrastructure
Naming services infrastructure
Front-end load balancers
The current production site
The monitoring and HA infrastructure should provide rapid detection and recovery from failures. Detection is described in Chapter 8, "Using Oracle Enterprise Manager for Monitoring and Detection", while this chapter focuses on the recovery operations for each outage.
Table 9-1 describes the unscheduled outages that impact the primary or secondary site components.
The rest of this section provides outage decision trees for unscheduled outages on the primary site and the secondary site. The decision trees appear in the following sections:
The high-level recovery steps for each outage are listed with links to the detailed descriptions for each recovery step. These descriptions are found in Chapter 10, "Detailed Recovery Steps".
Some outages require multiple recovery steps. For example, when a site failure occurs, the outage decision matrix states that Data Guard failover must occur before site failover. Some outages are handled automatically without any loss of availability. For example, instance failure is managed automatically by RAC. Multiple recovery options for each outage are listed wherever relevant.
If the primary site contains the production database and the secondary site contains the standby database, then the outages on the primary site are the ones of most interest. Solutions for these outages are critical for maximum availability of the system. Only the "Data Guard only" and MAA architectures have a secondary site to protect from site disasters. The estimated recovery times (ERT) are strictly examples derived from customer and actual testing experiences and do not reflect a guaranteed recovery time.
Table 9-2 summarizes the recovery steps for unscheduled outages on the primary site.
Table 9-2 Recovery Steps for Unscheduled Outages on the Primary Site
Reason for Outage | Recovery Steps for "Database Only" Architecture | Recovery Steps for "RAC Only" Architecture | Recovery Steps for "Data Guard Only" Architecture | Recovery Steps for MAA |
---|---|---|---|---|
Site failure |
ERT: hours to days
|
ERT: hours to days
|
ERT: minutes to an hour | ERT: minutes to an hour |
Node failure |
ERT: minutes to an hour
|
ERT: seconds to minutes
Managed automatically by RAC Recovery |
ERT: minutes to an hour
or ERT: minutes to an hour |
ERT: seconds to minutes
Managed automatically by RAC Recovery |
Instance failure |
ERT: minutes
|
ERT: seconds to minutes
Managed automatically by RAC Recovery |
ERT: minutes
|
ERT: seconds to minutes
Managed automatically by RAC Recovery |
Clusterwide failure |
N/A | ERT: hours to days
|
N/A | ERT: minutes to an hour |
Data failure |
ERT: minutes to an hour | ERT: minutes to an hour | ERT: minutes to an hour
Recovery Solutions for Data Failures or ERT: minutes to an hour Note: For primary database media failures or media corruptions, database failover may minimize data loss. |
ERT: minutes to an hour
Recovery Solutions for Data Failures or ERT: minutes to an hour Note: For primary database media failures or media corruptions, database failover may minimize data loss. |
User error |
ERT: minutes | ERT: minutes | ERT: minutes | ERT: minutes |
Outages on the secondary site do not directly affect availability because the clients always access the primary site unless there is a switchover or failover. Outages on the secondary site may impact the MTTR if there are concurrent failures on the primary site. For most cases, outages on the secondary site can be managed with no impact on availability. However, if maximum protection mode is part of the configuration, then an unscheduled outage on the last surviving standby database causes downtime on the production database. After downgrading the data protection mode, you can restart the production database.
Table 9-3 summarizes the recovery steps for unscheduled outages of the standby database on the secondary site.
Table 9-3 Recovery Steps for Unscheduled Outages of the Standby Database on the Secondary Site
Reason for Outage | Recovery Steps for "Data Guard Only" Architecture | Recovery Steps for MAA |
---|---|---|
Standby apply instance failure |
If there is only one standby database and if maximum database protection is configured, then the production database will shut down to ensure that there is no data divergence with the standby database. |
ERT: seconds
There is no effect on production availability if the production database Oracle Net descriptor is configured to use connect-time failover to an available standby instance. Restart node and instance when they are available. |
Standby non-apply instance failure | N/A | There is no effect on availability because the primary node or instance receives redo logs and applies them with the recovery process. The production database continues to communicate with this standby instance.
Restart node and instance when they are available. |
Data failure such as media failure or disk corruption | Step 2: Start Recovery |
Step 2: Start Recovery |
Primary database resets logs because of flashback operations or media recovery | Restoring Fault Tolerance After the Production Database Has Opened Resetlogs |
Restoring Fault Tolerance After the Production Database Has Opened Resetlogs |
Scheduled outages are planned outages. They are required for regular maintenance of the technology infrastructure that supports the application and include tasks such as hardware maintenance, repair, and upgrades; software upgrades and patching; application changes and patching; and changes to improve performance and manageability of systems. Scheduled outages should be scheduled at times best suited for continual application availability.
Table 9-4 describes the scheduled outages that impact either the primary or secondary site components.
Table 9-4 Scheduled Outages
Outage Class | Description | Examples |
---|---|---|
Site-wide | The entire site where the current production database resides is unavailable. This is usually known well in advance and can be scheduled. |
|
Hardware maintenance (node impact) | This is scheduled downtime of a database server node for hardware maintenance. The scope of this downtime is restricted to a node of the database cluster. |
|
Hardware maintenance (clusterwide impact) | This is scheduled downtime of the database server cluster for hardware maintenance. |
|
System software maintenance (node impact) | This is scheduled downtime of a database server node for system software maintenance. The scope of the downtime is restricted to a node. |
|
System software maintenance (clusterwide impact) | This is scheduled downtime of the database server cluster for system software maintenance. |
|
Oracle patch upgrade for the database | Scheduled downtime for an Oracle patch | Patch Oracle software to fix a specific customer issue |
Oracle patch set or software upgrade for the database | Scheduled downtime for Oracle patch set or software upgrade |
|
Database object reorganization | These are changes to the logical structure or the physical organization of Oracle database objects. The primary reason for these changes is to improve performance or manageability. This is always a planned activity. The method and the time chosen to do the reorganization should be planned and appropriate.
Using Oracle's online reorganization features enables objects to be available during the reorganization. |
|
The rest of this section provides outage decision trees for scheduled outages. They appear in the following sections:
The high-level recovery steps for each outage are listed with links to the detailed descriptions for each recovery step. The detailed descriptions of the recovery operations are found in Chapter 10, "Detailed Recovery Steps".
This section also includes the following topic:
If the primary site contains the production database and the secondary site contains the standby database, then the outages on the primary site are the ones of most interest. Solutions for these outages are critical for continued availability of the system.
Table 9-5 shows the recovery steps for scheduled outages on the primary site.
Table 9-5 Recovery Steps for Scheduled Outages on the Primary Site
Scope of Outage | Reason for Outage | Recovery Steps for "Database Only" Architecture | Recovery Steps for "RAC Only" Architecture | Recovery Steps for "Data Guard Only" Architecture | Recovery Steps for MAA |
---|---|---|---|---|---|
Site | Site shutdown | Downtime for entire duration | Downtime for entire duration |
|
|
Primary database | Hardware maintenance (node impact) | Downtime for entire duration | Managed automatically by RAC Recovery |
|
Managed automatically by RAC Recovery |
Primary database | Hardware maintenance (clusterwide impact) | Downtime for entire duration | Downtime for entire duration |
|
|
Primary database | System software maintenance (node impact) | Downtime for entire duration | Managed automatically by RAC Recovery |
|
Managed automatically by RAC Recovery |
Primary database | System software maintenance (clusterwide impact) | Downtime for entire duration | Downtime for entire duration |
|
|
Primary database | Oracle patch upgrade for the database | Downtime for entire duration | RAC Rolling Upgrade |
Downtime for entire duration | RAC Rolling Upgrade |
Primary database | Oracle patch set or software upgrade for the database | Downtime for entire duration | Downtime for entire duration | Upgrade with Logical Standby Database |
Upgrade with Logical Standby Database |
Primary database | Database object reorganization | Online Object Reorganization |
Online Object Reorganization |
Online Object Reorganization |
Online Object Reorganization |
Outages on the secondary site do not impact availability because the clients always access the primary site unless there is a switchover or failover. Outages on the secondary site may affect the MTTR if there are concurrent failures on the primary site. Outages on the secondary site can be managed with no impact on availability. If maximum protection database mode is configured, then downgrade the protection mode before a scheduled outage on the standby instance or database so that there will be no downtime on the production database.
Table 9-6 describes the recovery steps for scheduled outages on the secondary site.
Table 9-6 Recovery Steps for Scheduled Outages on the Secondary Site
Scope of Outage | Reason for Outage | Recovery Steps for "Data Guard Only" Architecture | Recovery Steps for MAA |
---|---|---|---|
Site | Site shutdown | Before the outage: "Preparing for Scheduled Secondary Site Maintenance"
After the outage: "Restoring Fault Tolerance after Secondary Site or Clusterwide Scheduled Outage" |
Before the outage: "Preparing for Scheduled Secondary Site Maintenance"
After the outage: "Restoring Fault Tolerance after Secondary Site or Clusterwide Scheduled Outage" |
Standby database | Hardware or software maintenance the node that is running the managed recovery process (MRP) | Before the outage: "Preparing for Scheduled Secondary Site Maintenance" | Before the outage: "Preparing for Scheduled Secondary Site Maintenance" |
Standby database | Hardware or software maintenance on a node that is not running the MRP | N/A | No impact because the primary standby node or instance receives redo logs that are applied with the managed recovery process
After the outage: Restart node and instance when available. |
Standby database | Hardware or software maintenance (clusterwide impact) | N/A | Before the outage: "Preparing for Scheduled Secondary Site Maintenance"
After the outage: "Restoring Fault Tolerance after Secondary Site or Clusterwide Scheduled Outage" |
Standby database | Oracle patch and software upgrades | Downtime needed for upgrade, but there is no impact on primary node unless the configuration is in maximum protection database mode. | Downtime needed for upgrade, but there is no impact on primary node unless the configuration is in maximum protection database mode. |
To achieve continued service during a secondary site scheduled outage, downgrade the maximum protection mode to maximum availability or maximum performance. When you are scheduling secondary site maintenance, consider that the duration of a site-wide or clusterwide outage adds to the time the standby lags behind the production database, which lengthens the time to restore fault tolerance.
Table 9-7 shows how to prepare for scheduled secondary site maintenance.
Table 9-7 Preparing for Scheduled Secondary Site Maintenance
Production Database Protection Mode | Reason for Outage | Preparation Steps for "Data Guard Only" Architecture and MAA |
---|---|---|
Maximum protection | Site shutdown | Switch the production data protection mode to either maximum availability or maximum performance
See Also: "Changing the Data Protection Mode" |
Maximum protection | Hardware maintenance (clusterwide impact) | Switch the production data protection mode to either maximum availability or maximum performance
See Also: "Changing the Data Protection Mode" |
Maximum protection | Software maintenance (clusterwide impact) | Switch the production data protection mode to either maximum availability or maximum performance
See Also: "Changing the Data Protection Mode" |
Maximum protection | Hardware maintenance on the primary node (the node that is running the recovery process) | Apply Instance Failover (MAA only)
or Switch the production data protection mode to either maximum availability or maximum performance |
Maximum protection | Software maintenance on the primary node (the node that is running the recovery process) | Apply Instance Failover (MAA only)
or Switch the production data protection mode to either maximum availability or maximum performance |
Maximum availability or maximum performance | Site shutdown | None; no impact on production database |
Maximum availability or maximum performance | Hardware maintenance (clusterwide impact) | None; no impact on production database |
Maximum availability or maximum performance | Software maintenance (clusterwide impact) | None; no impact on production database |
Maximum availability or maximum performance | Hardware maintenance on the primary node (the node that is running the recovery process) | Apply Instance Failover (MAA only)
or None; no impact on production database |
Maximum availability or maximum performance | Software maintenance on the primary node (the node that is running the recovery process) | Apply Instance Failover (MAA only)
or None; no impact on production database |