Last Hop, Second Packet : The Mystery

The Back-story
============

Often I’ve noticed that when doing a trace-route from one Cisco router to another, the last hop will always show one lost packet. Like so;

Router1#traceroute 34.0.0.4

Type escape sequence to abort.
Tracing the route to 34.0.0.4

1 12.0.0.2 4 msec 4 msec 4 msec
2 23.0.0.3 20 msec 16 msec 16 msec
3 34.0.0.4 16 msec * 16 msec

Now at first sight, it seems there would be something wrong at the final hop, as each time you do this trace, you will see the second packet being dropped.

So, thinking it was a bug or something, I opened a TAC Case at Cisco asking if this was a known issue.

Their first reply was asking whether I already tried debugging the traces to find the root cause of these drops, nothing about whether or not they are aware of this issue.

Which is weird, because even in their on-line documentation you can see the same behavior, Which I also pointed out to the technician handling the case.

There was no reaction on my comment…

The Objective
===========

It’s my intention to investigate exactly how and why we’re seeing this behavior and probably learn some more about the inner workings of a trace-route.

I will log everything in this update as I go through my investigation, this should give you a nice overview of my troubleshooting process and train of thought.

Hopefully there won’t be any derailment!

The Scene
========= 

I have set up the a simple topology in GNS3 to test this.
Each router has EIGRP running in order to advertise the networks and everything responds to ping tests as it should.

R1 Lo 1 will be our starting point, R4 Lo 1 our destination.

Let’s verify we indeed have the same issue on this network as I described above. I will do a trace-route from R1 to R4’s loopback address.

R1#traceroute 4.4.4.4

Type escape sequence to abort.
Tracing the route to 4.4.4.4

1 10.1.1.2 68 msec 48 msec 0 msec
2 10.2.2.3 92 msec 52 msec 88 msec
3 10.3.3.4 244 msec * 140 msec
R1#

As I expected, the second packet on the last hop failed. Yet, an extensive ping test shows no issues. The response times are a bit high but this is due to the heavy load GNS has on my computer.

To even begin with the investigation, we must first understand how a trace-route actually works.

The Facts
========

What IS a trace-route anyway?

A Trace-route is often used as a troubleshooting tool to find out which path a packet takes when travelling to its destination. When a remote device becomes unreachable, for example, trace-route can be used as one of the tools to find out where the problem may lay.

How does it work?

When you start a default trace-route, three UDP packets are sent to the specified destination with an invalid port address. Each packet with a Time-To-Live value of ‘one’.

The TTL value of ‘one’ causes the packet to timeout as soon as it reaches the first hop in its path, this hop (or router) then replies with an ICMP Type 11 message, meaning “Time Exceeded”

Another 3 UDP packets are then sent, but this time with a TTL value of ‘two’. which will cause the second hop to respond with ICMP Type 11 messages. After this it will send another 3 packets with a TTL value of 3, and so on ..

Documenting the source and response times of these Type 11 messages is how a trace-route is able to show you information about every hop between your source and destination.

Once your UDP packets arrive at its final destination, that target device will respond with a Type 3, code 3 message (Port Unreachable). Once the source device receives this message, it will know the trace-route process has arrived at its destination.

This process does not sound that complicated, so why are we seeing that one dropped packet ..

The Investigation
===============

Let’s roll.

I will start out by enabling debugging on R1 and checking what kind of output I get when doing the trace-route.

I will also create an access-list to identify the trace-route traffic between R1 and R4 so my debug output is filtered a bit and I do not get too much information about irrelevant traffic.

R1(config)#access-list 100 permit icmp any any
R1#debug ip packet detail 100
IP packet debugging is on (detailed) for access list 100

And to keep things nice and neat, I will turn off console logging and enable buffer logging for debug type messages. This will keep my CLI from getting cluttered with debug output.

R1(config)#no logging console
R1(config)#logging buffered 7

I run another trace-route;

R1#traceroute 4.4.4.4 source 1.1.1.1

Type escape sequence to abort.
Tracing the route to 4.4.4.4

1 10.1.1.2 56 msec 72 msec 44 msec
2 10.2.2.3 100 msec 96 msec 132 msec
3 10.3.3.4 188 msec * 160 msec
R1#

And let’s see what is in the log now;

R1#show logging | begin Log Buffer
Log Buffer (4096 bytes):

*Mar 1 03:45:40.183: IP: tableid=0, s=10.1.1.2 (FastEthernet0/0), d=1.1.1.1 (Loopback1), routed via RIB
*Mar 1 03:45:40.183: IP: s=10.1.1.2 (FastEthernet0/0), d=1.1.1.1, len 56, rcvd 4
*Mar 1 03:45:40.183: ICMP type=11, code=0
*Mar 1 03:45:40.259: IP: tableid=0, s=10.1.1.2 (FastEthernet0/0), d=1.1.1.1 (Loopback1), routed via RIB
*Mar 1 03:45:40.259: IP: s=10.1.1.2 (FastEthernet0/0), d=1.1.1.1, len 56, rcvd 4
*Mar 1 03:45:40.259: ICMP type=11, code=0
*Mar 1 03:45:40.303: IP: tableid=0, s=10.1.1.2 (FastEthernet0/0), d=1.1.1.1 (Loopback1), routed via RIB
*Mar 1 03:45:40.303: IP: s=10.1.1.2 (FastEthernet0/0), d=1.1.1.1, len 56, rcvd 4
*Mar 1 03:45:40.303: ICMP type=11, code=0
*Mar 1 03:45:40.399: IP: tableid=0, s=10.2.2.3 (FastEthernet0/0), d=1.1.1.1 (Loopback1), routed via RIB
*Mar 1 03:45:40.399: IP: s=10.2.2.3 (FastEthernet0/0), d=1.1.1.1, len 56, rcvd 4
*Mar 1 03:45:40.403: ICMP type=11, code=0
*Mar 1 03:45:40.495: IP: tableid=0, s=10.2.2.3 (FastEthernet0/0), d=1.1.1.1 (Loopback1), routed via RIB
*Mar 1 03:45:40.495: IP: s=10.2.2.3 (FastEthernet0/0), d=1.1.1.1, len 56, rcvd 4
*Mar 1 03:45:40.499: ICMP type=11, code=0
*Mar 1 03:45:40.631: IP: tableid=0, s=10.2.2.3 (FastEthernet0/0), d=1.1.1.1 (Loopback1), routed via RIB
*Mar 1 03:45:40.631: IP: s=10.2.2.3 (FastEthernet0/0), d=1.1.1.1, len 56, rcvd 4
*Mar 1 03:45:40.631: ICMP type=11, code=0
*Mar 1 03:45:40.819: IP: tableid=0, s=10.3.3.4 (FastEthernet0/0), d=1.1.1.1 (Loopback1), routed via RIB
*Mar 1 03:45:40.819: IP: s=10.3.3.4 (FastEthernet0/0), d=1.1.1.1, len 56, rcvd 4
*Mar 1 03:45:40.819: ICMP type=3, code=3
*Mar 1 03:45:43.975: IP: tableid=0, s=10.3.3.4 (FastEthernet0/0), d=1.1.1.1 (Loopback1), routed via RIB
*Mar 1 03:45:43.975: IP: s=10.3.3.4 (FastEthernet0/0), d=1.1.1.1, len 56, rcvd 4
*Mar 1 03:45:43.975: ICMP type=3, code=3

Looking at the output above, we can see that for each hop in the path, we receive three type 11 code 0 messages. For the destination 10.3.3.4, we receive only two type 3 code 3 messages, the third response is completely missing.

I add UDP traffic to my access-list and check the debug output again for the last hop.

Extended IP access list 100
10 permit icmp any any
20 permit udp any any

Looks like we have sent the following three datagrams;

IP: tableid=0, s=1.1.1.1 (local), d=4.4.4.4 (FastEthernet0/0), routed via FIB
IP: s=1.1.1.1 (local), d=4.4.4.4 (FastEthernet0/0), len 28, sending
UDP src=49340, dst=33440

IP: tableid=0, s=1.1.1.1 (local), d=4.4.4.4 (FastEthernet0/0), routed via FIB
IP: s=1.1.1.1 (local), d=4.4.4.4 (FastEthernet0/0), len 28, sending
UDP src=49341, dst=33441

IP: tableid=0, s=1.1.1.1 (local), d=4.4.4.4 (FastEthernet0/0), routed via FIB
IP: s=1.1.1.1 (local), d=4.4.4.4 (FastEthernet0/0), len 28, sending
UDP src=49342, dst=33442

These look fine, so the problem can not be here at the source. Let’s investigate further.

I want to see what kind of debug output we can on our destination, R4, so I add the following config;

R4#conf t
Enter configuration commands, one per line. End with CNTL/Z.
R4(config)#access-list 100 permit icmp any any
R4(config)#access-list 100 permit udp any any

R4#debug ip packet detail 100
IP packet debugging is on (detailed) for access list 100

I run the trace again from R1 and an see the following debug output on R4;

When I go over this I can see that all three UDP packets arrive, each with one higher destination and source port number.

Yet, the ICMP Type 3 code 3 message is.. *cue dramatic music, move camera to close-up*

Never sent out…

So, it seems we have identified the perpetrator. It is R4.

The Motive
=========

Now I have identified exactly where this is going wrong, I will try to discover why.

The first thing I notice is that the Type 11 messages seem to go out fine, but somehow there is a problem with outgoing Type 3 messages.

This gets me thinking, are these messages limited somehow by default? Is there rate limiting like you have for some other kinds of services. (DHCP Snooping comes to mind)

It seems kind of silly, three outgoing type 3 message is the default for any device that is the destination of a trace-route.

Yet …

R4(config)#ip icmp ?
rate-limit rate limit ICMP messages generated
redirect Redirect type

No, could they really.. ?

Let’s try configuring a higher amount. I’m unsure what the default setting is here.

R4(config)#ip icmp rate-limit ?
unreachable ICMP type 3, Destination Unreachable

R4(config)#ip icmp rate-limit unreachable ?
<1-4294967295> Once per milliseconds
DF code 4, fragmentation needed and DF set

R4(config)#ip icmp rate-limit unreachable 50

And now let’s try the trace-route again on R1 *fingers crossed*

R1#traceroute 4.4.4.4 source 1.1.1.1

Type escape sequence to abort.
Tracing the route to 4.4.4.4

1 10.1.1.2 60 msec 124 msec 76 msec
2 10.2.2.3 124 msec 140 msec 92 msec
3 10.3.3.4 216 msec 196 msec 256 msec
R1#

Amazing.

The Conclusion
=============

It seems this Cisco Router (and who knows which others) have a default ICMP rate-limit that is too low and this might cause issues when troubleshooting by using Trace-Route.

On the router I’m testing this, the following IOS is running;

R4#sh ver
Cisco IOS Software, 2600 Software (C2691-ADVENTERPRISEK9-M), Version 12.4(15)T14, RELEASE SOFTWARE (fc2)

Which is the most recent T release I could find, You would think this would be solved in more recent releases.

I checked some show commands and found the command “Show ip icmp rate-limit” which is exactly what we need in this case.

The current output is;

A rate limit of 50, just like I configured earlier. Now I will remove that configuration and will check again to find out the default value.

R4(config)#no ip icmp rate-limit unreachable

And another “Show ip icmp rate-limit” shows..

What, it does not show anything? Okay, I’m used to Cisco devices showing nothing in the actual configuration if set to its default values, but this is some output that would actually be nice to see.

In the previous screenshot, I can see a rate-limit of 20 for “All other unreachables” but this still does not tell me anything. 20 per milli-second would still be plenty of our tests.

Another fun thing I just noticed, I went back to R1 and did the traceroute again after removing the configuration on R4 like you can see before.

The traceroute is still working perfectly, there is no lost packet on the last hop.

So it seems that even when removing the rate-limit on R4, something was left in its configuration.

Let’s try rebooting R4;

R4#wr
Building configuration…
[OK]
R4#reload
Proceed with reload? [confirm]

And ..

It crashed. The router crashed on boot.

Now, this might be because I’m using GNS3, so I’m very curious to see what effect this could have on a production router !

After I try again a few times, the router boots but the startup config is now lost. Not entirely sure if this is related but it’s all very suspicious!

I think now is a good time to stop my investigation and ask you, the reader, to comment on this update with any experience, insight or questions you might have about this.

This was very interesting to try out and I can not wait to get to troubleshoot more complex stuff .

References:
ICMP Types.