Network Trouble Shooting: Made simple for Complex Network Issue.
The most efficient manner to trouble shoot network issue is to approach simple and systematic way. Collect the information then trouble shoot by following OSI (Open System Inter Connection) Networking Model.
Collect the and Understand Basic Information
Its very important to collect and very critical to obtain a complete picture of the issue. Carefully consider how the problem appears. For example its related to inbound traffic, out bond traffic or both. Try to identify the problem is constant or intermittent, and is this issue reproducible? If so how?
Trouble shoots up as per the OSI Reference Model:
As said earlier, best and simple way of trouble shooting network issue is by following OSI reference model. So start troubleshooting with Physical Layer and work up to the Application Layer. In most of the cases problem persists at first three layers.
Physical Layer Trouble Shooting.
The physical layer is one of the easiest to troubleshoot. It is also frequently overlooked. If there is connectivity issue, consider to check following.
Ensure the devices are powered on; examine the cable by LAN Ethernet Cat 5 tester or Fiber Optic Test Instrument. Closely examine the connector as well. Ensure that each cable connector clicks as it is inserted properly into the network port. Check network port indicator lights on each side.
- Ensure that proper type of cabling is in use. Cabling between Network Devices and Computer should be used “straight Through Cable”. A straight through cable has an identical wiring layout on both sides. For Computer to Computer or Switch to Switch you should use Cross over cable.
- From OS level you can trouble shoot Network card by checking driver is properly loaded or not or you can run some hardware diagnostic on it. Ethtool, load the driver using modeprob, getmib –l
- Finally check system log and Hardware log and examine the is there any error occurred for NIC.
Data Link Layer Trouble shooting:
Data link layer transfer data between two nodes in LAN Segment (system) or adjacent network nodes in a Wide Area Network. In this section only we will discuss Ethernet 802.3. In Data Link Layer handles frame delivery in digital form.
Note: Data Link Layer protocols are FDDI, Token Ring, Apple Talk, ATM, Cisco Discovery Protocol, Frame Relay, Ethernet, Multi protocol Label Switching (MPLS), Point-to-Point Protocol, Serial Line Internet Protocol (SLIP), Spanning Tree Protocol.
At the data link layer, local communication occurs by network port hardware addresses, also known Media Access Addresses (MAC). Improper configuration leads to failure in Data Link Layer.
- If there is network connectivity issue, first check Address Resolution Protocol (ARP) Table.
- From the arp command results above, determine if the MAC address matches the distant network port hosting that IP address. If the MAC address is incorrect, delete the offending ARP entry.
#arp –d <IP address>
The ARP entry will be automatically updated when network traffic arrives for that IP address. In the most cases this occurs almost immediately. If the incorrect ARP entry appears again, there is a duplicate IP address on the network
- NIC must be configured to auto negotiate or use the same speed and duplex settings. Otherwise there may be network performance issue or intermittent loss of connectivity. In this case make sure that network ports at each end of the wire are configured in the same manner(example auto negotiate or 100Mbps full duplex)
- If there are intermittent or constant connectivity problems, use the netstat command to check the status of network interfaces.
# netstat -in
Errors in the RX-ERR and TX-ERR are usually caused by the defective network hardware. Entries in the column indicates that the network is very busy or there is an issue with the network hardware.
Also make sure that the hardware (NIC) no error reported in logs and interface is up within the OS. Use ifconfig command to determine the status of the interface
In above output interface eth0 is up and operational. The eth1 interface is down because ifconfig does not list as UP. If eth1 not up, than use ifcofnig to bring it online
#ifconfig eth1 up
#ifconfig -a | grep -i MTU
Check the MTU (Maximum Transmission Unit) size(bytes) should be 1500. This is the default value of MTU. If you find MTU size other than 1500, the network may run very slow.
To set the default MTU size, you can use ifconfig command
#ifconfig eth0 mtu 1500
Note: If you are using VPN over DSL cable connection, try to set client workstation to MTU 1400
The DrTCP utility can be used for this purpose (http://www.dslreports.com/drtcp)
Network Layer Trouble Shooting.
To understand Network layer behavior, it’s better to understand about Network layer protocols, for example, IP (V4 and V6), IPX, IPsec, ICMP Protocol and ARP.
The main function of the Network Layer is Host addressing, Network addressing, and Message forwarding, Populating Routing Table and static route. Queuing incoming and outgoing data and then forwarding them according to quality of service constraints set for those packets. Provides connection oriented and connection less mechanism. Hence it’s very important to understand IP (Internet Protocol).
In order to communicate across a network, each system needs an IP address, sub net or net mask address also known as Network address and finally gateway or default gateway.
- Each node on the network has a unique IP address. If a system boots and advertises an IP address in network, if another system already using same IP, than it will shutdown its own networking to avoid conflict.
- Every system sends traffic to its default gateway. If the default gateway is incorrect or missing, network traffic will not flow. The only way to go is configured static route entries.
You can verify the default gateway is correct or not using netstat command
- The network mask or Subnet Mask or Network address tells the system, given IP or system or device is on its local network or remote network. Other than local network, all traffic will have to go through a gate way also known as router. You can find system subnet or net mask or Network address by using ifconfig command
#ifconfig –a | grep –I mask
Note: In linux, by default all network address traffic (incoming and outgoing) configured to default gateway. If you configure specific network address or host address than only that network or host traffic goes to gateway. If you have multiple NIC in system, each NIC better to have its own gateway.
Before starting the ping test, make sure that, in environment ICMP protocol is not blocked in Firewall. If ICMP is blocked you won’t get the reply from remote host. From server side you can netstat command to check ICMP packets are coming or not
# netstat –s –p icmp
- If there is issue with routing table outbound / outbond traffic will not flow. You can examine routing table
Use the lookup feature of the route command to determine how it will route traffic based on an IP address.
# route -n lookup <IP address>
- In most of the environment, traffic flows throw the gateway, very rare case you will find that NAT and Port forwarding is configured. If NAT, Port forwarding and proxy is configured, than you need to follow different methodology to trouble shoot.
Transport Layer Trouble Shooting.
Lay 1 to Layer 3, some of the PD is required in Host (server side), When comes to Transport layer, most of the problem resolution will be at firewall or router end. Here Network Team assistance is required, if you don’t have access to firewall or router.
To understand this Transport layer, better to understand about UDP and TCP. Primary responsibility of the transport layer is transporting the packet. So UDP and TCP is the transport vehicle or you can say packet carrier. Most of the Application or services data are carried or transported through UDP or TCP. TCP is reliable vehicle or also known as Connection oriented protocol. While UDP is unreliable, also known as Connection less Protocol.
TCP communication between two remote hosts is done by means of port number (TSAPs). Ports numbers can range from0-65535. Which can be further classified as?
- System Ports (Known Ports) (0-1023)
- User Ports (1024-49151)
- Private or Dynamic Ports ( 49152-65535)
- Most of time user raises the issue, I can ping but unable to connect ftp, http, telnet or other services. Possibility for this cause is port blocking at firewall end. Firewall team needs to enable UDP or TCP protocols as well as they need to enable respective port. For Example ftp 20 and 21, telnet 23 etc. Generally firewall rule like this, source port any , destination port 80, and they will give host or network address to allow.
Note: Most of the application or services use TCP as transporter, additionally some of the services UDP as well. Hence make sure that both protocol needs to be enabled in firewall end.
Why source port is any? Answer is here. Most of the client application such as telnet, ftp, web browser configured to source port as ephemeral port(means not fixed, varies on each session) and server is uses fixed port such httpd uses 80, sshd uses 22, ftpd uses 20 and 21. Some rare application also uses ephemeral port uses as fixed port.
Ephemeral port exhaustion:
This is problem is caused due to resource starvation problem where a machine is no longer able to use its TCP subsystem because it does not have any available connection slots. In our experience, this most often occurs on proxies.
- TCP is the bedrock of web services—it forms the basis for all inter node communication. Every TCP connection can be represented by a tuple of (source IP, source port, destination IP, destination port). For a given machine communicating with a single upstream host, three of these tuple elements (source IP, destination IP, and destination port) are fixed. This means the number of connections a single machine can make to a single web service is limited to the number of source ports it has available. On Linux, the source port for an outgoing connection is selected by the kernel from the ephemeral range.
This is how you can find the currently configured ephemeral port range:
This gives a range of about 28,232 connections. This seems like a number that would give one plenty of buffer room when designing for high scale
Web services—when would one ever have 28,232 active connections?
- Ephemeral Source port Strategy
|Operating System||Port Number Range||Selection Strategy|
|HP Tru64 UNIX||1024-5000||NA|
|MS Windows 8||NA||Global, Sequential|