Equipment & System Failure - Data Centre

By Samit Banerjee
02-04-2019
221

Equipment and System Failure

Equipment and system forms the base of monitoring and performing complicated tasks related to calculations, data storage and information management. Organisations takes help of data centers to form huge network of servers for the purpose of storage, processing and managing huge amount of data. These data centers allows access to data and information of the organization to the employees, users and clients or customers of the organisation or enterprise from anywhere in the world at any point of time.

 

As far as equipment and system failure is concerned, starting from a single system, it may range to a complete network of systems described as data center earlier. Equipment and system failure results in loss of data and information along with wastage of time and resources.

 

Fault tolerance is that property of  a system which allows it to continue to operate in usual manner at the time of its failure. The same definition goes for the system components too. Fault tolerance is generally used for systems having high availability and life critical features.

 

 

How to provide fault tolerance for a single system?

Be it a single system or a network or systems, equipment and system failure provides great loss. If the system has fault tolerance property, it can efficiently protect the system from loss of data, information and resources.

 

Systems with fault tolerance design has the ability to continue to perform, although with reduced efficiency, get does not fails completely, when any error or failure in any parts of the system arises. Systems operating with this concept continues to operate even even at times of failure of software or hardware.

 

Fault tolerance of an individual system can be achieved by stabilizing the system in exceptional manner, so that, it automatically migrates towards an error or failure free condition.

 

If the cause of system failure is serious and unavoidable in nature, then actions like duplication, taking the system back to the safe mode or roll back recovery can prove handy to come out of the loop of equipment and system failure.

 

To provide fault tolerance to individual systems, following properties must be incorporated in them:

a. There should be no single point of failure. No certain point should exist due to whose failure the entire system performance collapses.

b. Like the single point failure idea, single point repair concept also provides fault tolerance. In this concept, the system continues to operate even when repair of any part is being done.

c. The system must be able to identify and isolate the fault. The system must be able to analyse and detect the faulty sector and prevent its interruption in the normal operating activity and efficiency of the system.

d. Failure in one element of a system may result in successive collateral damage of other elements. This series of damage can result to system failure. Fault containment feature can prevent this error.

e. The system must have robust variability control. This modifies the scope and pattern of operation of the system when any element of the system fails without hampering the normal operational ability of the system. Sometimes it is referred to as reversion state or fall back or limp along operation.

 

How to provide fault tolerance to a data center?

Data center always handles huge amount of data for the purpose of its storage, processing and transfer. Chances of equipment and system failure exists due to the following reasons:

a. Overheating of system,

b. Wrong pattern of cabling,

c. Human error at the time of installation and maintenance,

d. Absence of proper monitoring tools.

 

To achieve fault tolerance of data centers, the following methods can be used:

a. Proper technical and analytical training to maintainence employees of data center operations to prevent failure due to human error.

b. Data Server Monitoring utility services should be used to monitor the problems of the data center. This also helps to monitor the system temperature and prevent system fault due to overheating.

c. Cabling patterns must be simple and easy to align. This will prevent system fault due to wrong pattern of cabling.

d. Take preventive and precautionary measures to maintain the data center. It includes 24/7 monitoring and analysis of the data center and recruitment of technically efficient staffs for installation, monitoring, operation and maintenance of the data center.

 

Reference:

Dhiraj K. Pradhan,1996, Fault-tolerant computer system design book contents, Pages: 221 235.

Formal Techniques in Real-Time and Fault-Tolerant Systems: Second International Symposium, Nijmegen, the Netherlands, January 810, 1992.

Shunji Osaki, Toshihiko Nishio, 1980, Reliability Evaluation of Some Fault-Tolerant Computer Architectures Published by Springer.

 

 

 



Related Tutorials