• Novice
  • Aware
  • Competent

Reliability Centered Maintenance

This topic has a brief overview and covers:

Under the extreme pressure to reduce the costs of "doing business", there has been a period during which economists have targeted maintenance as a cost to be minimized. Maintenance personnel, without an argument or the science to defend their "black art", were denied the money and a resource to do what they believed is needed.

In the early 1970's, under the auspices of the US Air Transport Association a committee was formed to try to rationalize the spiraling costs and ever increasing requirements for aircraft maintenance. The committee's second report presented a concept that became known as MSG 2. (This was the second report of the Maintenance Steering Group). Oversimplified, it espoused that designers must design for easy maintenance (maintainability) and that maintenance requirements should have a basis in traceable logic not personal experience and opinion.

A couple of major aircraft manufacturers accepted the philosophy and the military aircraft operators began to seek maintainability criteria in their purchase agreements. Life Cycle Costs became a catchword and was closely followed by Design to Life Cycle Costs. Soon thereafter, the US Navy, the Royal Air Force, and the Royal Australian Air Force with great success instituted programs to implement this maintenance approach.

Today, maintenance personnel are able to determine their lowest long-term cost strategies and prove their argument. This development has been termed by one learned expert as "The Graying of a Black Art". Maintenance can now be scientifically and logically justified.

The most widely accepted term for this science is Reliability Centered Maintenance (RCM) and indeed that is what it is - the determination of what maintenance will provide the required reliability.

RCM can be applied at the design stage. This is ideal, but sadly because the science is relatively new and because it means designers will have to do more (i.e. more cost), not many people are designing for minimal maintenance cost. However, maintenance personnel can make sure that no matter how it was designed, they can have the best out of what they have for the least cost.

Reliability Characteristics

Reliability: "The probability that an item will perform a specified function under specified conditions for a specified length of time"

There are three factors relating to reliability characteristics:

  • The failure pattern experienced by the item
  • The failure rate for the item
  • The failure progression rate.

Failure Pattern

Most items begin to wear or deteriorate either from the time they are new, or following some maintenance action designed to restore the item to a serviceable condition. This wear or deterioration will eventually lead to failure of the item, i.e. the item will cease to operate at a specified level of performance. Every item of functional equipment follows a characteristic failure pattern. This failure pattern can be determined mathematically, however, the data required to do so is generally not available for most items. Instead, it is necessary to establish this characteristic in general terms, that is adequate for most maintenance analysis purposes. Most items follow one of three general types of failure:

  • Early life failure
  • Random failure
  • Wear out failure.

Early Life Failures

These failures occur in the early life of the item or after overhauls and are often referred to as "infant mortality". They are usually caused by poor design, faulty materials, manufacturing defects or incorrect assembly. The failure rate drops rapidly as the length of time in service increases and the item enters its useful life.

Random Failures

A large number of individual failure modes are possible and the occurrence of any particular failure mode is independent of total time in operation. The failure pattern is random in that a consistent cause of failure cannot be established nor can the time that a particular item will last between failures be predicted. Such items do not "wear out" in the short term, even though some degree of general deterioration will eventually occur, often at an interval that is greater than the useful life of the item. Most electronic items follow a random failure pattern and a light bulb is a classic example. Some factors that affect random failures are:

  • Insufficient/inadequate design margins
  • Incorrect application
  • Improper environment use
  • Intermittent failures.

For items that follow a random failure pattern it is rarely appropriate to prescribe scheduled maintenance in any form. A preventive task cannot be identified, since a constant cause of failure, that can be prevented or delayed by maintenance, does not exist. Similarly, a condition-monitoring task cannot normally be effective in predicting or detecting that a failure is about to occur. For items that exhibit this characteristic, corrective maintenance following failure is usually the only appropriate course of action.

Wear Failures

Many items, particularly those of mechanical construction, do follow a distinct wear pattern. This is usually attributable to relative motion in some form, e.g. rubbing, turning or alternating stresses. This wear pattern is usually predictable, may occur over any time span, and may be repeated a number of times during the useful life of an item with some form of restoration between successive wear out cycles. Some items within a group of items exhibiting this wear out characteristic still experience random failures before actual wear out takes place. However, for maintenance purposes, the existence of an underlying wear out characteristic is of primary importance. Some wear out condition that affect failures are:

  • Material wear
  • Scratching
  • Corrosion
  • Fatigue
  • Stress
  • Inadequate or improper maintenance
  • Misalignment.

Items that exhibit a distinct wear pattern are candidates for some form of preventive maintenance, since a particular cause of failure exists which may be prevented or delayed by a maintenance action. Alternatively, a specific condition-monitoring task may be employed to establish the condition of the item, followed by a preventive task if necessary. In either case, the task is potentially effective in delaying or preventing failure of the item.

Failure Rate

One of the characteristics of a particular type of item is the rate at which failure occurs. The failure rate may be expressed in a number of forms; however, the most common indicator is the Mean Time Between Failure (MTBF), which is defined as the mean or average life achieved by a group of items of a certain type between maintenance and repair.

The MTBF is of little direct significance in determining whether any maintenance should be performed on an item, since it only provides an indication of the level of reliability being achieved, and has no necessary relationship to the failure pattern.

Consequently, the fact that an item exhibits a low MTBF does not necessarily imply that some form of scheduled maintenance is warranted, but may simply be indicative of a low inherent reliability or degraded inherent reliability by some ineffective maintenance action or bad operation.

Where an item exhibits a high MTBF, i.e. it is very reliable, and the high reliability cannot be attributed to an existing effective maintenance action, it may not be efficient to perform any scheduled maintenance on the item.

Maintenance frequency should never be set at achieved MTBF. The Bath Tub Curve shown below is derived from the sum of the failure curves for a number of failure types:

 

 

Failure Progression Rate

The rate at which a failure develops from the time that an unsatisfactory condition could first be detected by some form of condition monitoring to the time that complete failure occurs is referred to as the "progression rate" (PR).

This is significant in determining whether specific condition monitoring could be effective, and as a guide to determining an appropriate inspection interval.

Although an accurate PR cannot usually be defined, certain classes of items exhibit characteristically high or low PRs. For example, the impending failure of electronic items is normally very difficult to detect and in many cases occurs almost instantaneously, i.e. the PR is high.

Consequently, any form of scheduled condition monitoring to detect failures of electronic items will normally be ineffective.

Corrosion normally develops over a long period, i.e. the PR is low, and a scheduled condition-monitoring task to detect corrosion is potentially effective.

An estimate of PR may also be used as a guide to the frequency with which a task should be performed. In general, if failure takes place over a short interval, i.e. the PR is high, frequent condition monitoring would be necessary to provide a reasonable chance of detecting an impending failure. Conversely, if the PR is relatively low, i.e. the failure develops over a long period; a longer interval can be tolerated to detect impending failures.

There are two methods of component configuration that affect plant reliability:

  • Series connection components
  • Parallel connection components.

Series connected components

In this situation both components must function for the system to function. Assuming that the failure characteristics of each component are not influenced by the other, the survival probability (P) for the system for a given time (t) is the product of the survival probability P1(t) and P2(t) of each component in the system.

P(t) = P1(t) P2(t) . . . . . . . Pn(t)

Parallel-connected components

This system fails only if both components fail - the probability that both will fail (F) before time (t) is derived from the product of the two separate failure probabilities.

F(t) = F1(t) F2(t)

System survival probability is then

P(t) = 1 - F1(t) F2(t)

If a system is designed for a specific level of reliability, the time required for maintenance can be calculated. The cost of achieving a given level of AVAILABILITY can be predicted.

Example: Required availability = 0.95

Predicted mean time to system failure (MTTF) = 250 HR

Mean time to repair = M?

 

Availability (Av) =
MTTF
 
MTTF + M
 
Av (MTTF + M) =
MTTF
 
Av x MTTF + Av x M =
MTTF
 
Av x M =
MTTF - Av x MTTF
 
=
MTTF x (1 - Av)
 
 
i.e. M =
MTTF x (1 - Av)
 
Av
 
if Av =
0.95
MTTF =
250 HRS
 
then M =
250 (1 - 0.95)
 
0.95
 
=
12.5 HRS

 

 

 

 

 

 

 

 

 

 

 

 







Effects of Failures

The effect of every failure must be evaluated to determine its potential effect on:

  • Safety
  • Environment
  • Operation
  • Damage.

If the failure cannot be detected or prevented by an effective maintenance task then the inclusion of a task in the maintenance program cannot be justified regardless of its effect. The solution lies in redesign or modification.

Effect on Safety

The effect on personnel, occupational health and safety and equipment must always be considered. While safety may be adversely affected by material failures, a direct relationship can only exist if the failure of an item has clear safety implications.

In may cases, failure of an item does not affect safety in any way, and in such cases there is no relationship between the reliability of the equipment and the operating safety of the plant.

Consequently, while the potential effect of material failures on safety must always be considered during development of maintenance requirements, a fixed relationship should not be assumed. Instead, the effect of failure must be established in each case before a valid relationship between the reliability of an item and safety can be identified.

Environmental and Operation Effect

The relationship between the ability of the plant to meet environmental standards and to perform its assigned function and the reliability of items of equipment fitted is very similar to that already defined for “safety” and “reliability”.

If failure of an item does not have a direct adverse effect on the environment or operation, there is no direct relationship between the reliability of the item and plant reliability.

Other Damage

The resultant damage to other plant and equipment caused by a failure is a factor when considering effective maintenance. Items whose failure will cause significant damage should be considered for inclusion in the maintenance program. In this case, the cost implications must also be considered.

System Performance

One of the key impacts on system performance in the electrical and telecommunications industry has been the introduction of automatic re-closures or redundancy switchovers.

In the case of the electricity industry the re-closures will open after a major failure impacting on a large area but will automatically re-close for those intact assets isolating only the immediate area affected by the failure.

In most cases, this glitch or system failure will be less than 2 seconds and average less than 1 second and unless it is at night or people are actually using appliances, the failure may go totally unnoticed except for its impact on electronic equipment. With many households now owning many items of electronic equipment that contain electronic clocks or memories this minor glitch can have a significant impact with the following assets:

In some instances, it can take a single person 30 minutes to reprogram all equipment and as such, these minute system failures can have a significant customer impact.

Many small businesses now operate computer-controlled equipment that is again effected by this type of failure with costly consequences.

Although complaints are improving in this area, few electricity distributors are able to keep records or accounts of the number of these activities or the size of the customer base affected. These anomalous failures are therefore not reported in customer minutes off supply (CMOS) type indicators.

This is a key area that interests both regulators and the distribution companies and the ways that these types of failures are monitored and recorded do require significant input and analysis to ensure a cost effective practical method is adopted.

The same can be said for telecommunications both fixed and mobile when glitches of this nature result in the loss of signal or communication with a subsequent impact on customer service and in some cases cost.


previous home next
Financial & Economic Considerations   Life Cycle Costs