#

Tuesday, August 15, 2017

The Theory of Network Troubleshooting & Root Cause Analysis

Misconfigurations, software bugs and hardware failures cause the problems in networks. To troubleshoot these errors,  an engineer can take several approaches and some are more effective than others. This post is written with the gathered experience in my career.

After a problem is reported, you will start the troubleshooting / root cause analysis process. The structured troubleshooting approach is like the following..

Problem Report
You will get informed about a problem  either by a user complain or by an event triggered by a monitoring tool. For both these cases we can say a "Problem Report"..

Collect Information
Problem report may say only about the visible issue. To troubleshoot it, you may need more technical information gathered..
Asking your users detailed questions or using network tools, you can gather the information for further root cause analysis..

Analyze Information
Once we have gathered all information we will analyze it to
understand what is wrong. We can compare our information to previously collected information or other devices with similar configurations..

Eliminate Possible Causes 
We need to think about the possible causes and isolate the suspected causes for the problem to give hypotheses. This requires thorough knowledge of the network and all the protocols that are involved.
There are 6 approaches in this step.

Hypothesize
After eliminating possible causes you will end up with a couple of
possible causes that could be the problem. We will select the most likely cause for the problem.

Verify Hypothesis
We will test our hypothesis to see if we are right or wrong. If we
are right the case is closed.
If we are wrong we test other possible causes.


In live troubleshooting, you may have to find the solution asap. So with your experience with the network, you will ignore Analyze Information & Eliminate Possible Causes steps and directly go to the Hypothesize after collecting information to reduce the downtime. This is called the "Shoot from the Hip" approach.

But after giving the solution, later you may have do a root cause analysis to put it in the KEDB (Known Error Data Base). For that you will have to go through all the steps above. And also if your hypothesis did not work, you will have to go though the Eliminating Possible Causes step to find the exact solution..

How to Eliminate Possible Causes?

Well, there are identified 6 approaches for this. With the experience you have with the type networks
in your network (Ex:- WAN network, Passive-Fiber network etc) you will select the best approach. It will better if you can create a chart for that, so that the novice engineers will also identify the problem easily. Anyway if you are not sure about the correct approach to take, you can start from the 1st one.

(1) Top-down
(2) Bottom-up
(3) Divide & Conquer
(4) Follow the traffic path
(5) Spot the difference
(6) Replace components

Top-down Approach

This means to start from the upper most layer of the OSI model (Application Layer).. If  the 1st layer peer-to-peer communication is not ok, you can move with the 2nd layer..
This approach is noted as the 1st because if you can verify one layer is working, all the layers below are working well too. It is much faster in most cases like user complains about desktop applications. But you will need to have the access to the application in order to use this approach..

Bottom-up Approach

It is the opposite of the above approach.. Here we will start with the Physical layer and move our way around. If  layer one is ok, we will move to the 2nd layer.. Better for up link, unreachable issues.

Divide & Conquer

This means to start in the middle of the OSI layers. If you know its a routing issue, you will verify the functionality of the other layers & look into the network layer. For a firewall blocking, you will start with the transport layer etc..

Follow the traffic path

This is useful when you cannot isolate a device with a routing issue. You will analyze trace routes and see where it is dropping and you will examine the routing / forwarding table for the issue.
Sometimes a firewall in the path will block icmps, so that you will have to access the devices / see through a configuration and think as the device (ex:-router) to figure out the decision it takes to the packet.

Spot the difference

You will examine a configuration file of a similar device and find out what is different. This is useful to identify misconfigurations in a device which uses identical technologies.

Replace components

This approach is suitable where you think it can be a hardware issue. Most of the time this will be used for known hardware issues.

1 comment:

  1. Unquestionably believe that which you stated. Your favorite reason appeared
    to be on the internet the easiest thing to be aware
    of. I say to you, I certainly get annoyed while people
    consider worries that they just don't know about. You managed to hit the nail upon the top and also defined out the whole thing without having side effect ,
    people could take a signal. Will likely be back to get
    more. Thanks

    ReplyDelete