[ Bottom of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]

Kernel Extensions and Device Support Programming Concepts

FCP and iSCSI Error Recovery

If the device is in initiator mode, the error-recovery process varies depending on whether or not the device is supporting command queuing. Also some devices might support NACA=1 error recovery. Thus, error recovery needs to deal with the two following concepts.

Autosense Data

When a device returns a check condition or command terminated (the scsi_buf.scsi_status will have the value of SC_CHECK_CONDITION or SC_COMMAND_TERMINATED, respectively), it will also return the request sense data.

Note
Subsequent commands to the device will clear the request sense data.

If the device driver has specified a valid autosense buffer (scsi_buf.autosense_length > 0 and the scsi_buf.autosense_buffer_ptr field is not NULL), then the adapter device driver will copy the returned autosense data into the buffer referenced by scsi_buf.autosense_buffer_ptr. When this occurs, the adapter device driver will set the SC_AUTOSENSE_DATA_VALID flag in the scsi_buf.adap_set_flags.

When the device driver receives the SCSI status of check condition or command terminated (the scsi_buf.scsi_status will have the value of SC_CHECK_CONDITION or SC_COMMAND_TERMINATED, respectively), it should then determine if the SC_AUTOSENSE_DATA_VALID flag is set in the scsi_buf.adap_set_flags. If so then it should process the autosense data and not send a SCSI request sense command.

NACA=1 error recovery

Some devices support setting the NACA (Normal Auto Contingent Allegiance) bit to a value of one (NACA=1) in the control byte of the SCSI command . If a device returns a check condition or command terminated (the scsi_buf.scsi_status will have the value of SC_CHECK_CONDITION or SC_COMMAND_TERMINATED, respectively) for a command with NACA=1 set, then the device will require a Clear ACA task management request to clear the error condition on the drive. The device driver can issue a Clear ACA task management request by sending a transaction with the SC_CLEAR_ACA flag in the sc_buf.flags field. The SC_CLEAR_ACA flag can be used in conjunction with the SC_Q_CLR and SC_Q_RESUME flag in the sc_buf.flags to clear or resume the queue of transactions for this device, respectively. For more information, see Initiator-Mode Recovery During Command Tag Queuing.

FCP and iSCSI Initiator-Mode Recovery When Not Command Tag Queuing

If an error such as a check condition or hardware failure occurs, the transaction active during the error is returned with the scsi_buf.bufstruct.b_error field set to EIO. Other transactions in the queue might be returned with the scsi_buf.bufstruct.b_error field set to ENXIO. If the adapter driver decides not to return other outstanding commands it has queued to it, then the failed transaction will be returned to the device driver with an indication that the queue for this device is not cleared by setting the SC_DID_NOT_CLEAR_Q flag in the scsi_buf.adap_q_status field. The device driver should process or recover the condition, rerunning any mode selects or device reservations to recover from this condition properly. After this recovery, it should reschedule the transaction that had the error. In many cases, the device driver only needs to retry the unsuccessful operation.

The adapter device driver should never retry a SCSI command on error after the command has successfully been given to the adapter. The consequences for retrying a command at this point range from minimal to catastrophic, depending upon the type of device. Commands for certain devices cannot be retried immediately after a failure (for example, tapes and other sequential access devices). If such an error occurs, the failed command returns an appropriate error status with an iodone call to the device driver for error recovery. Only the device driver that originally issued the command knows if the command can be retried on the device. The adapter device driver must only retry commands that were never successfully transferred to the adapter. In this case, if retries are successful, the scsi_buf status should not reflect an error. However, the adapter device driver should perform error logging on the retried condition.

The first transaction passed to the adapter device driver during error recovery must include a special flag. This SC_RESUME flag in the scsi_buf.flags field must be set to inform the adapter device driver that the device driver has recognized the fatal error and is beginning recovery operations. Any transactions passed to the adapter device driver, after the fatal error occurs and before the SC_RESUME transaction is issued, should be flushed; that is, returned with an error type of ENXIO through an iodone call.

Note
If a device driver continues to pass transactions to the adapter device driver after the adapter device driver has flushed the queue, these transactions are also flushed with an error return of ENXIO through the iodone service. This gives the device driver a positive indication of all transactions flushed.

Initiator-Mode Recovery During Command Tag Queuing

If the device driver is queuing multiple transactions to the device and either a check condition error or a command terminated error occurs, the adapter driver does not clear all transactions in its queues for the device. It returns the failed transaction to the device driver with an indication that the queue for this device is not cleared by setting the SC_DID_NOT_CLEAR_Q flag in the scsi_buf.adap_q_status field. The adapter driver halts the queue for this device awaiting error recovery notification from the device driver. The device driver then has three options to recover from this error:

When the adapter driver's queue is halted, the device drive can get sense data from a device by setting the SC_RESUME flag in the scsi_buf.flags field and the SC_NO_Q flag in scsi_buf.q_tag_msg field of the request-sense scsi_buf. This action notifies the adapter driver that this is an error-recovery transaction and should be sent to the device while the remainder of the queue for the device remains halted. When the request sense completes, the device driver needs to either clear or resume the adapter driver's queue for this device.

The device driver can notify the adapter driver to clear its halted queue by sending a transaction with the SC_Q_CLR flag in the scsi_buf.flags field. This transaction must not contain a command because it is cleared from the adapter driver's queue without being sent to the adapter. However, this transaction must have the SCSI ID field ( scsi_buf.scsi_id) and the LUN field ( scsi_buf.lun_id) filled in with the device's SCSI ID and logical unit number (LUN), respectively. Upon receiving an SC_Q_CLR transaction, the adapter driver flushes all transactions for this device and sets their scsi_buf.bufstruct.b_error fields to ENXIO. The device driver must wait until the scsi_buf with the SC_Q_CLR flag set is returned before it resumes issuing transactions. The first transaction sent by the device driver after it receives the returned SC_Q_CLR transaction must have the SC_RESUME flag set in the scsi_buf.flags fields.

If the device driver wants the adapter driver to resume its halted queue, it must send a transaction with the SC_Q_RESUME flag set in the scsi_buf.flags field. This transaction can contain an actual command, but it is not required. However, this transaction must have the SCSI ID field (scsi_buf.scsi_id) and the LUN field (scsi_buf.lun_id) filled in with the device's SCSI ID and logical unit number (LUN). If this is the first transaction issued by the device driver after receiving the error (indicating that the adapter driver's queue is halted),then the SC_RESUME flag must be set as well as the SC_Q_RESUME flag.

Analyzing Returned Status

The following order of precedence should be followed by device drivers when analyzing the returned status:

  1. If the scsi_buf.bufstruct.b_flags field has the B_ERROR flag set, then an error has occurred and the scsi_buf.bufstruct.b_error field contains a valid errno value.

    If the b_error field contains the ENXIO value, either the command needs to be restarted or it was canceled at the request of the device driver.

    If the b_error field contains the EIO value, then either one or no flag is set in the scsi_buf.status_validity field. If a flag is set, an error in either the scsi_status or adapter_status field is the cause.

    If the status_validity field is 0, then the scsi_buf.bufstruct.b_resid field should be examined to see if the command issued was in error. The b_resid field can have a value without an error having occurred. To decide whether an error has occurred, the device driver must evaluate this field with regard to the command being sent and the device being driven.

    If the SC_CHECK_CONDITION or SC_COMMAND_TERMINATED is set in scsi_status, then a device driver must analyze the value of scsi_buf.adap_set_flags to determine if autosense data was returned from the device.

    If the SC_AUTOSENSE_DATA_VALID flag is set in the scsi_buf.adap_set_flags field for a device, then the device returned autosense data in the buffer referenced by scsi_buf.autosense_buffer_ptr. In this situation the device driver does not need to issue a SCSI request sense to determine the appropriate error recovery for the devices.

    If the device driver is queuing multiple transactions to the device and if either SC_CHECK_CONDITION or SC_COMMAND_TERMINATED is set in scsi_status, then the value of scsi_buf.adap_q_status must be analyzed to determine if the adapter driver has cleared its queue for this device. If the adapter driver has not cleared its queue after an error, then it holds that queue in a halted state.

    If scsi_buf.adap_q_status is set to 0, the adapter driver has cleared its queue for this device and any transactions outstanding are flushed back to the device driver with an error of ENXIO.

    If the SC_DID_NOT_CLEAR_Q flag is set in the scsi_buf.adap_q_status field, the adapter driver has not cleared its queue for this device. When this condition occurs, the adapter driver allows the device driver to send one error recovery transaction (request sense) that has the field scsi_buf.q_tag_msg set to SC_NO_Q and the field scsi_buf.flags set to SC_RESUME. The device driver can then notify the adapter driver to clear or resume its queue for the device by sending a SC_Q_CLR or SC_Q_RESUME transaction.

    If the device driver does not queue multiple transactions to the device (that is, the SC_NO_Q is set in scsi_buf.q_tag_msg ), then the adapter clears its queue on error and sets scsi_buf.adap_q_status to 0.

  2. If the scsi_buf.bufstruct.b_flags field does not have the B_ERROR flag set, then no error is being reported. However, the device driver should examine the b_resid field to check for cases where less data was transferred than expected. For some commands, this occurrence might not represent an error. The device driver must determine if an error has occurred.

    If a nonzero b_resid field does represent an error condition, then the device queue is not halted by the adapter device driver. It is possible for one or more succeeding queued commands to be sent to the adapter (and possibly the device). Recovering from this situation is the responsibility of the device driver.

  3. In any of the above cases, if scsi_buf.bufstruct.b_flags field has the B_ERROR flag set, then the queue of the device in question has been halted. The first scsi_buf structure sent to recover the error (or continue operations) must have the SC_RESUME bit set in the scsi_buf.flags field.

[ Top of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]