When packets that support applications are dropped, these need to be retransmitted by either the client or the server. TCP retransmissions take time, which ultimately can cause performance to suffer. The tough part is finding the links and pathways on the network that are dropping traffic and resolving this packet loss if possible. Here is a quick list of things to look for when trying to root out network packet loss.
1. Identify the path between client and server.
This means more than simply running a traceroute. We need to know exactly which interfaces are responsible for the network path between the two endpoints, including layer two switches and firewalls. After identifying these interfaces, we can then begin to look for symptoms of packet loss.
2. Comb for Ethernet/Layer two Errors
If a packet becomes errored somewhere on the path, the next switch or router along the way will drop it. The nice thing for us as troubleshooters is that the device will make a record of the drop, marking it as an FCS error, late collision, overrun, or some other misalignment. These counters can help us to find bad cables, faulty interfaces, bad terminations, duplex problems, and other layer two issues. Problems like these can bring an application to its knees, so it is important to regularly watchdog the network for them.
3. Dig for Discards
First, let's get something straight - discards and errors are two different things. Discards may increment on a link that is perfectly healthy in terms of errors. Just because we see a discard does not mean that we will also see errors counting on that interface. Finding an interface that has discards does not always mean we have found the root cause of the problem. For example, an interface may be configured as a trunk, supporting several VLAN numbers. If a frame is received with a VLAN ID that is not configured on the receiving port, this frame will be discarded.
This configuration problem typically doesn't impact performance for applications though. If these were impacting the application, the user would experience connectivity problems since they would not be able to contact the server in the first place. So when troubleshooting slowness, don't be quick to blame every interface that you find with discards.
The discards we want to focus on are the ones caused by input or output congestion. When a link fills, there is only so much buffering it can do with remaining packets before it starts dropping them. Along with interface discards, we will often see a spike in the router CPU as it tries to keep up with the increase in traffic. If a link is persistently becoming congested, make sure to use a flow-based technology to identify the type and source of the traffic to determine if it is acceptable network usage. It could be that a backup or peer-to-peer file transfer is clogging the network, which can impact application performance.
Looking for these types of issues can be very time consuming if we don't have the right tools. In fact, unless they are impacting a critical application, discards and errors can go on for months and years without our knowledge. It is important to resolve them before they start affecting something important.
As a troubleshooter, a tool that I regularly use to spotlight discards, errors, and high utilization problems is the OptiView XG. I promise I'm not just saying that to plug the tool. It just makes these types of issues much easier to find on a network. In fact, with it's latest bump in software versions (v14) the XG got a handy new feature called Interface Health. This part of the tool shows the top 1000 interfaces network-wide that are experiencing errors, discards, and high utilization. This feature makes it very easy to find interfaces that are responsible for packet drops, which is a huge help in tracking down application performance problems. For sure, there are other ways to find these issues, but it often requires quite a bit of "show interface detail"!