Router is sending false ICMP "Host unreachable" messages

Discussion:

Todd Pytel

2004-08-02 20:47:03 UTC

Hi misc@,

What originally looked like a Samba problem has led back to my router
running 3.4. The crux of the problem is that, shortly after my
workstations mount an NFS share from a server, other IP traffic to that
same server is briefly denied with an ICMP "Host unreachable" message.

The details: Two workstations (in 192.168.0.x) running Debian Linux
use various services (including NFS and SMB) from an OBSD 3.5 server
(69.x.x.x). Between the workstations and the server is a three-legged
OBSD 3.4 router which manages traffic between the external link, the
69.x.x.x server, and the workstation network (for which it acts as a NAT
gateway).

The behavior I first noticed was that the workstations would sometimes
(roughly 50%) fail to mount the SMB share at bootup, failing with a "No
route to host" error. However, doing a "mount -a -t smbfs" after bootup
worked perfectly every time. Strange. Some more poking and debugging
reveals that traffic between the workstations and the server works
correctly for a while, including DNS lookups, NFS connections, and even
a successful NetBIOS query. But after the NetBIOS query, the connection
attempt is denied by the router with an ICMP "No route to host". Here's
some capture output - 69.x.x.x is the server IP throughout this
discussion:

# The successful NetBIOS name query
14:35:12.278208 192.168.0.1.32769 > 69.x.x.x.137: udp 50 (DF)
14:35:12.278544 69.x.x.x.137 > 192.168.0.1.32769: udp 62

# Client attempts the SMB connection
14:35:12.279082 192.168.0.1.32768 > 69.x.x.x.139:
S 3874542792:3874542792(0) win 5840
<mss 1460,sackOK,timestamp 4294709239 0,nop,wscale 0> (DF)

# The router (192.168.1.1) returns a "No route" ICMP reply
14:35:12.279155 192.168.1.1 > 192.168.0.1:
icmp: host 69.x.x.x unreachable

The capture goes on to show a CUPS/IPP connection denied in the same
way at 14:35:13.9. But by 14:35:17.8, DNS requests are routed correctly.
So to summarize, it appears that the router decides there is no route
to the server for a few seconds.

Sorry if this was all a bit long-winded, but the whole situation
seems strange to me, and I'm not sure what details are significant.
Since this whole exchange occurs immediately after NFS shares are
mounted on the client, my first thought was that some component just
needed a bit of extra time to get itself together. But adding a 5 second
sleep between the NFS mount and the SMB mount in the workstation's init
script doesn't change anything, not even the frequency that the problem
appears (again, about half the time). And, to repeat, the SMB mount
completes flawlessly once the workstation has booted - I've done at
least a hundred trials without a failure. I've also tried setting some
of the recommended Linux socket options for smbmount (TCP_NODELAY,
SO_SNDBUF, SO_RCVBUF), but that had no effect either.

So at this point, I'm stumped. If I can provide any more details about
the setup, just ask. Many thanks for any help you can provide.

--
Todd Pytel

Todd Pytel

2004-08-02 21:18:54 UTC

Permalink

Two more bits of information I realized are probably relevant:

1) While the router NAT's the workstations for Internet access, it does
not do so for local client-server interactions. The server talks to the
workstations via the 192.168.0.x addresses. So it does not appear that
NAT has anything to do with my problems.

2) The captures I posted were done on the interface of the router facing
the workstations. A simultaneous capture on the server-facing interface
shows that no traffic passes through for the denied SMB and IPP
connections. That is, the capture on the server-facing interface shows
the successful NetBIOS name query and then a successful DNS query a few
seconds later, but nothing in between. From this, I infer that the
router is not "reacting" to anything on the server-side of the network.
Whatever is causing the ICMP messages comes from the router itself.

On Mon, 2 Aug 2004 15:47:03 -0500

Post by Todd Pytel
What originally looked like a Samba problem has led back to my router
running 3.4. The crux of the problem is that, shortly after my
workstations mount an NFS share from a server, other IP traffic to
that same server is briefly denied with an ICMP "Host unreachable"
message.
(Details snipped)

--
Todd Pytel

Jason Opperisano

2004-08-03 13:16:41 UTC

Permalink

Post by Todd Pytel
1) While the router NAT's the workstations for Internet access, it does
not do so for local client-server interactions. The server talks to the
workstations via the 192.168.0.x addresses. So it does not appear that
NAT has anything to do with my problems.
2) The captures I posted were done on the interface of the router facing
the workstations. A simultaneous capture on the server-facing interface
shows that no traffic passes through for the denied SMB and IPP
connections. That is, the capture on the server-facing interface shows
the successful NetBIOS name query and then a successful DNS query a few
seconds later, but nothing in between. From this, I infer that the
router is not "reacting" to anything on the server-side of the network.
Whatever is causing the ICMP messages comes from the router itself.
On Mon, 2 Aug 2004 15:47:03 -0500

ICMP Host Unreachables from a router mean that the router did not get a
reply to it's ARP request on the segment attached to the server for the
server's MAC address.

Verify this on the router with:

tcpdump -n -nn -p -i <SERVER SEGMENT IF> arp

My guess is that you'll see a bunch of:

arp who-has <SERVER IP> tell <ROUTER IP>

Without seeing:

arp reply <SERVER IP> is-at <SERVER MAC>

As to *why* this is happening--i dunno. Could be related to load on
that network segment--maybe high rate of collisions or packet loss?

You could simultaneously on the server:

tcpdump -n -nn -p -i <SERVER SEGMENT IF> arp

And see if the server is even seeing the ARP requests. If not--i'd
point the finger at the layer 2 device connecting the router and
server. If it is seeing the ARP request and just not responding (or if
something else is responding first; i.e. duplicate IP's)...maybe there's
a buggy NIC driver involved...

-j

=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~
I base my fashion taste on what doesn't itch. -- Gilda Radner
=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~

Todd Pytel

2004-08-03 15:48:19 UTC

Permalink

Jason,

On Tue, 03 Aug 2004 09:16:41 -0400

Post by Jason Opperisano
ICMP Host Unreachables from a router mean that the router did not get
a reply to it's ARP request on the segment attached to the server for
the server's MAC address.
tcpdump -n -nn -p -i <SERVER SEGMENT IF> arp
arp who-has <SERVER IP> tell <ROUTER IP>
arp reply <SERVER IP> is-at <SERVER MAC>

Reasonable suggestion, but no. Captures on both the server interface and
the server-side router interface show no ARP's occuring. This is, in
some sense, what I'd expect - as my first captures show (not sure you
saw my original post), there is successful IP traffic to the server less
than a second before the router issues the ICMP's. Checking the router's
ARP tables (via "arp -na") immediately before the workstation boots also
verifies that the server has an entry.

So, the ARP tables on the router are correct, no ARP's are occuring on
the wire, and yet the router returns ICMP Host Unreachable. I must be
missing something here...

--
Todd Pytel

Todd Pytel

2004-08-03 18:00:22 UTC

Permalink

This keeps getting stranger. Someone reminded me off-list to disable PF
on the router, which I hadn't earlier (yes, I realize I should have).
Sure enough, it seems that PF is involved in this - disabling it solves
the problem. But I'll be damned if I can see why. My pf.conf follows -
the only bits I've snipped are the interface and variable definitions at
the top and a bunch of external interface rules at the bottom (again,
none of this traffic is passing through the external interface).

For those just tuning in, the short form of the problem is this - client
workstations (on the $priv segment below) fail an SMB mount at bootup
roughly %50 of the time with a "No route to host" error. Indeed, the
router does send an ICMP Host Unreachable message to the workstation,
even though IP connections to the server's IP (on the $pub segment
below) succeeded just a split-second before and will succeed again just
a few seconds later. The router does not ARP for the server before
sending the ICMP, and the server already had an ARP table entry before
the problematic exchange occurs. Also, for whatever reason, the same SMB
connection always completes successfully if started by hand (mount -a -t
smbfs) after the workstation has finished booting. Disabling PF solves
the problem, but I can't see where the conflict is.

pf.conf:

######################################################################
# Global settings

set block-policy drop

scrub in on $ext all fragment reassemble
scrub in on $pub proto udp fragment reassemble

######################################################################
# NAT settings

# Map the private network to an unused public IP
nat on $ext from $priv:network to any -> $natip

# Rewrite packets from this machine to get a routable address
nat on $ext from ($ext) to any -> $gateway

# Redirect Bittorrent connections to the desktop
rdr on $ext proto tcp from any to $natip port 6881:6889 -> $desktop

######################################################################
# Default policies

# Default block and log incoming traffic
block in log on $ext

# Default block outgoing traffic
block out on $ext

# Default pass on loopback
pass quick on lo0

# Block network and broadcast addresses in either direction on the
# external interface
block quick on $ext from any to $broadcast
block quick on $ext from any to $network

######################################################################
# Internal policies

# We keep state on $ext and $pub, so everything can pass on $priv
pass quick on $priv

# We'll filter outgoing traffic on the external interface, so default
# pass anything to or from the public machines...
pass in on $pub
pass out on $pub

# ...but the public machines cannot initiate connections to the
# private network
block in log on $pub from any to $natnet

# Uncomment the following if we need a nameserver in the lab
pass in on $pub proto tcp from $server to $natnet port = 53 \
flags S/SAFR keep state
pass in on $pub proto udp from $server to $natnet port = 53 keep state

# We need state table entries to allow private machines to talk to
# public ones - external connections already have this
pass out on $pub from $natnet to any keep state

# Windows file sharing and communication between the server and
# private network
pass in on $pub proto tcp from $server to $natnet port = 139 \
flags S/SAFR keep state
pass in on $pub proto tcp from $server to $natnet port = 445 \
flags S/SAFR keep state

# Pass FTP controls to the desktop for Dreamweaver
pass in on $pub proto tcp from $server port 21 to $desktop keep state

######################################################################

Now, the only thing that looks even vaguely suspicious to me is the
scrub on udp over $pub (placed there to solve some Linux NFS problems).
But disabling that line has no effect. Everything else looks harmless.

Also, I should have confirmed before - this is the generic 3.4 kernel.

--
Todd Pytel

ste5an

2004-08-04 10:36:27 UTC

Permalink

hi Todd,

Post by Todd Pytel
For those just tuning in, the short form of the problem is this - client
workstations (on the $priv segment below) fail an SMB mount at bootup
roughly %50 of the time with a "No route to host" error.

I don't get the point either.

Is your routing done via rdr only or also as native routing? How does
your routing table looks like?

--> stefan <--