Black Hole Routers Issue on Internet - mtu, mss, pmtud

Jephe Wu -

Objective: to clear all doubts regarding mtu, pmtu, mss etc

pmtud issue can mainly be described such that everything works perfectly from your firewall/router, but your local hosts behind the firewall can't exchange large packets. This could mean such things as mail servers being able to send small mails, but not large ones, web browsers that connect but then hang with no data received, and ssh connecting properly, but scp hangs after the initial handshake. In other words, everything that uses any large packets will be unable to work

Part I - Terms explaination: mtu, mss, pmtud, black hole, icmp
Maximum Transmission Unit (MTU) is the maximum IP packet size that can be transmitted without fragmentation - including IP headers but excluding headers from lower levels in the protocol stack.

The smallest MTU in general use is 576 bytes, so you should be able to safely start with an ICMP buffer of 548, then work up from there. For example, if Ping -f -l 972 returns packets and Ping -f -l 973 fails, the largest MTU that can be used over that route is 1000 (972+28).
ping computer_name or IP_address -f -l 1472 

For normal ethernet the MTU is 1500. This is the maximum amount of data available to IP, TCP, and the application, it excludes the bytes for the ethernet header and trailer.

The Maximum Segment Size (MSS) is the largest amount of data, specified in bytes, that a computer or communications device can handle in a single, unfragmented piece

maximum tcp data size the computer can handle without fragmentation

Maximum Segment Size. MSS describes the size of the payload that can in the layer 4 (TCP) packet that can fit inside the layer 2 (ethernet) frame. Since there are 20 bytes of ethernet headers, and 20 bytes of IPv4 headers, this should typically be set to 40 less than your MTU. (Note that IPv6 takes up an extra 20 bytes). When negotiationg a TCP connection this number is usually sent in the SYN packet to the remote host to essentially say "please don't send me packets bigger than X."

The way MSS now works is that each host will first compare its outgoing interface MTU with its own buffer and choose the lowest value as the MSS to send. The hosts will then compare the MSS size received against their own interface MTU and again choose the lower of the two values.

When a network router receives a packet that is larger than the size of the Maximum Transmission Unit (MTU) of the next segment of a communications network, and that packet's IP layer "don't fragment" bit is flagged, the router is expected to send an ICMP "destination unreachable" message back to the sending host.

MSS Clamping:
To solve black hole routers issue. A workaround used by some routers is to change the maximum segment size (MSS) of all connections passing through links with MTU lower than the Ethernet default of 1500. This is known as MSS clamping.

Path MTU:
the path MTU is  

the largest packet size that can traverse this path without fragmentation.
Minimum MTU of all hops in a path.

Path MTU Discovery:
Path MTU discovery is a simple protocol that aims to automatically find the optimal MTU for a TCP connection path. This helps to achieve optimum performance and network utilization.

Path MTU discovery overcomes a significant performance bottleneck in TCP - fragmentation. Originating and terminating ends of a TCP connection and the intermediate routers along the path between the two ends have the ability to break up TCP packets into fragments, if the data is too large to fit into the MTU.

The older practice was to use the lesser of 576 and the first-hop MTU as the PMTU for any destination that is not connected to the same network or subnet as the source. In many cases, this resulted in the use of smaller datagrams than necessary, because many paths have a PMTU greater than 576. A host sending datagrams much smaller than the Path MTU allows is wasting Internet resources and probably getting suboptimal throughput due to not fully utilizing the maximum data payload possible.

Path MTU Discovery is how hosts are supposed to find out how much information they can send in a packet from one host to another without having it be fragmented along the way. The way it works is, your machine (we'll call it Host A) sends a request to a webserver (we'll call it Host B). Host B then attempts to respond with what it feels is an appropriate sized packet (usually by looking at the MSS Host A sent and the known MTU of it's first hop) with the DF (Don't Fragment) bit set. If it's too large for a router somewhere in between (or even for Host A), the Router or Host that can't handle it will send back an ICMP message type 3 code 4 (called destination host unreachable) telling Host B that it was too big (newer routers will also send the maximum MTU possible). Now, Host B must resend with a smaller packets. Newer systems will just use the MTU provided in the ICMP message (if it was provided), but hosts that don't support this information will continue to send smaller packets until the packets reach their destination... at least for a while. If Host B does not support using the MTU in the ICMP error message it may periodically attempt to discover MTU increases by sending larger packets.

Having path MTU discovery enabled in Red Hat Enterprise Linux results in the special TCP flag 'Don't Fragment' set for all outgoing TCP packets. When a router along the path of a connection cannot handle the size of the packet, it should drop it and send a special ICMP 'Fragmentation Needed' packet back to the sender. The sender then rebuilds the packet with a smaller payload and resends the subsequently smaller packet. This process will repeat until the router successful passes the packet on to the destination.

all TCP/IP packets from the host will have the DF bit set. A host usually "remembers" the MTU value for a destination by creating a "host" (/32) entry in its routing table with this MTU value.

Problem with PMTUD

A router can generate and send an ICMP message but the ICMP message gets blocked by a router or
firewall between this router and the sender. (Common)

The following explain is from cisco website at

So, fragmentation is bad, but PMTUD solves all your problems, right? Unfortunately, no.
As previously described, PMTUD relies on a number factors. It is worth just briefly reiterating these:
  • The DF bit must be set in packets sent by a host or network device.
  • If a device in the path between a packet source and destination determines that a packet needs to be fragmented but that the DF bit is set, it must send an ICMP unreachable message (type 3, code 4) back to the packet source.
  • Devices such as firewalls must not block ICMP unreachable messages (type 3, code 4).
  • The host or device that sends packets must dynamically reduce the size of packets that it sends in response to ICMP unreachable (type 3, code 4) messages that it receives.
If just one of the factors listed does not operate as described, PMTUD will not work correctly, and large IPsec packets will be dropped.
The most common cause of PMTUD breaking is a misconfigured firewall dropping ICMP unreachable (type 3, code 4) messages. Figure 7-77 illustrates a scenario in which a misconfigured firewall causes PMTUD to break.

Figure 7-77. Misconfigured Firewall Causes PMTUD to Break

In Figure 7-77, Host A sends a 1442-byte packet (with the DF bit set) to the Paris IPsec VPN gateway. The Paris gateway encapsulates the packet in IPsec and forwards the packet to firewall, which then forwards the packet on to the ISP1 router. The ISP1 drops the IPsec packet because it is too large (larger than its outgoing interface MTU) and the DF bit is set. At this point, the ISP1 router sends an ICMP unreachable message to the Paris gateway, but this ICMP unreachable message is blocked by the firewall.
Because the ICMP message from the ISP1 router is blocked by the firewall, the Paris gateway is unaware that it should reduce the size of packets that it sends. And, because the Paris gateway is unaware that it should reduce the size of packets it sends, it does not in turn inform Host A (via an ICMP unreachable message) that Host A should reduce the size of packets that Host A sends. Host A, therefore, continues sending packets that are too large, the Paris gateway continues to encapsulate them in IPsec, and the ISP1 router continues to drop them. Not good!

PMTU Black Hole:
A PMTU black hole is where the ICMP message doesn't reach the sending host to inform it that it needs to adjust its MTU. This can be down to the router not sending the ICMP message (due to misconfiguration or software bugs) or the ICMP message being blocked on the way back to the sender, 

Black Hole Routers:
A number of vendors sell routers and other intermediate devices that are not compliant. Instead of returning ICMP destination unreachable messages to the originating host, they may silently discard IP datagrams that are too large to be passed on to the next media in a path. These devices are referred to as "Black Hole Routers."

Part II - solutions for fixing pmtu black hole or black hole router issue
There are some possible solutions:

   1.  If the problem happens somewhere on Internet which you have no control to it.

The first thing to do it is to Enable PMTU Black Hole Detection 
For Windows:
That will be communicating over a wide area connection, as documented in Microsoft Knowledge Base article 136970  ( ) . In this case, Windows NT 3.51 Service Pack 2 or later or Windows NT 4.0 should be used.
   Enable PMTU Black Hole Detection on the Windows-based hosts that will be communicating over a WAN connection. Follow these steps:

   1. Start Registry Editor (Regedit.exe).
   2. Locate the following key in the registry:
   3. On the Edit menu, click Add Value, and then add the following registry value:
      Value Name: EnablePMTUBHDetect
      Data Type: REG_DWORD
      Value: 1
   4. Quit Registry Editor, and then restart the computer.
   Microsoft Product Support engineers have encountered a number of routers and other intermediate devices that silently drop large frames, even when the Don't Fragment bit is not set. Because the existing Windows NT 3.5 and 3.51 TCP/IP PMTU Black Hole Detection algorithm does not detect and adapt for these devices, customers who encountered problems had no choice but to disable PMTU detection. Therefore, Microsoft has made the following change:
   When PMTUBHDetect is enabled, after a TCP segment is retransmitted 1/2 of TCPMaxDataRetransmissions (another registry parameter, default=5) times without being acknowledged, the Don't Fragment bit will be cleared on the remainder of the retransmission attempts. If the segment is acknowledged as a result, the MSS will be decreased, and the Don't Fragment bit will be set in future IP datagrams sent on that connection.
   This change should result in more reliable transfer of large files over wide-area networks with a mixture of intermediate devices, such as the Internet. 

For Linux:

    tcp_mtu_probing - INTEGER
    Controls TCP Packetization-Layer Path MTU Discovery.  Takes three
          0 - Disabled
          1 - Disabled by default, enabled when an ICMP black hole detected
          2 - Always enabled, use initial MSS of tcp_base_mss.

   2.You can also disable PMTU discovery on the hosts that communicate over troublesome routes. This will configure the default MTU to 576 bytes. This could cause significant degradation in network performance.
For Windows:
   PMTU discovery is enabled by default, but can be controlled by adding the following value to the registry:

\EnablePMTUDiscovery (REG_DWORD, 0=disabled, 1=enabled)

When PMTU discovery is disabled, an MTU of 576 bytes is used for all non-local destination IP addresses. (The TCP MSS=536).

For Linux:

Disable PMTUD:

Path MTU Discovery can be enabled or disabled when you change the content of the file ip_no_pmtu_disc to '0'(default) or '1' respectively. In order to disable PMTUD, use the command:

    # echo  1  >/proc/sys/net/ipv4/ip_no_pmtu_disc

What is path MTU discovery -   

For Solaris 10 (and Earlier Versions)

Disable PMTUD:
$ ndd -set /dev/ip ip_path_mtu_discovery 0
Set Maximum MSS to 1460:
$ ndd -set /dev/tcp tcp_mss_max 1460
  3.  Make sure your own network has no problem - Configure intermediate routers to send ICMP type 3 code 4 (destination unreachable don't fragment (DF) bit sent and fragmentation required) messages. This may require upgrading router software or firmware, router configuration or router replacement.

For Cisco Router:
access−list 101 permit icmp any any unreachable
access−list 101 permit icmp any any time−exceeded
access−list 101 deny icmp any any
access−list 101 permit ip any any

For Linux:

Ipchains - Allow ICMP type 3, code 4, then block all other type 3
ipchains -A output -s <$external_FW_interface_IP> 3 -d 4 -p ICMP -j ACCEPT ipchains -A output -s <$internal_network_CIDR> 3 -d 4 -p ICMP -j ACCEPT ipchains -A output -s <$external_FW_interface_IP> 3 -p ICMP -j DENY ipchains -A output -s <$internal_network_CIDR> 3 -p ICMP -j DENY
Iptables - Allow only ICMP type 3, code 4 to be passed through in the FORWARD table, drop everything else. Remember, the FORWARD chain has no effect on the packets destined for this firewall, only packets traveling through it. For packets destined for this firewall you add the same rules to the INPUT chain.
    iptables -A FORWARD -p icmp --icmp-type fragmentation-needed -j ACCEPT
    iptables -A FORWARD -p icmp -j DROP
    iptables -A INPUT -p icmp --icmp-type fragmentation-needed -j ACCEPT
    iptables -A INPUT -p icmp -j DROP 


The iptables state module can allow ICMP error messages for an existing connection by using the RELATED keyword after the --state option:
iptables -A input -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A FORWARD -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

If you can't or don't want to use the state module, you can allow the required ICMP packets manually.

For IP Tables to pass destination unreachable, fragmentation needed but don't fragment bit set ICMP messages, execute the following command for each active chain. Replace CHAIN-name by the name of your chain.
iptables -I CHAIN-name -p ICMP --icmp-type 3/4 -j ACCEPT
for Linux iptables, consider to enable the following icmp types and codes:

#Rules for icmp
#  Firstly echo reply/request 
$IPTABLES -A icmpin -p ICMP --icmp-type 0 -j ACCEPT
$IPTABLES -A icmpin -p ICMP --icmp-type 8 -j ACCEPT
#  Secondly time exceeded (for traceroute)
$IPTABLES -A icmpin -p ICMP --icmp-type 11 -j ACCEPT
#  Thirdly want to get dest unreachable because 3 error codes important:
#    1 is "Host unreachable"
$IPTABLES -A icmpin -p ICMP --icmp-type 3/1 -j ACCEPT
#    3 is "Port unreachable"
$IPTABLES -A icmpin -p ICMP --icmp-type 3/3 -j ACCEPT
#    4 is "Fragmentation Required but DF Bit Is Set", for PMTUD.
$IPTABLES -A icmpin -p ICMP --icmp-type 3/4 -j ACCEPT

For BSD:

Packet Filter(PF) (fxp0 is internal NIC, fxp1 is facing Internet)

pass in log quick on fxp1 inet proto icmp all icmp-type unreach code 4         
pass in log quick on fxp0 inet proto icmp all icmp-type unreach code 4        
pass out log quick on fxp1 inet proto icmp all icmp-type unreach code 4       
pass out log quick on fxp0 inet proto icmp all icmp-type unreach code 4      

block in log quick on fxp1 proto icmp from any to any
block in log quick on fxp0 proto icmp from any to any
block out log quick on fxp1 proto icmp from any to any
block out log quick on fxp0 proto icmp from any to any

IP Filter

IP Filter will automatically accept ICMP error messages belonging to an existing connection if the keep state option is used:
pass in quick proto tcp from any to any port = 80 flags S keep state
This rule will allow people to access a webserver on or behind the firewall and will allow all traffic related to that TCP session (including related ICMPs) in and out of your network.

If you can't or don't want to use IP Filter's state machine, you can allow the required ICMP packets manually.
For IP Filter to pass destination unreachable, fragmentation needed but don't fragment bit set ICMP messages, put the following lines high enough in your ipf.conf file (before any lines that might block ICMP).
pass in quick proto icmp from any to any icmp-type 3 code 4
pass out quick proto icmp from any to any icmp-type 3 code 4 


If you already know the path mtu. Set the MTU of the host interface to be the largest the black hole router can handle. This guarantees the largest possible packet size will be sent over that connection, but will cause local traffic, and traffic over routed connections without problems, to use smaller packets than they would otherwise. This workaround assumes that you have determined the MTU and the state of all possible links that could be used by the host in question.

For Windows:
Set Interface MTUs to 1500:

These parameters for TCP/IP are specific to individual network adapter cards. These appear under this Registry path, where "adapter ID" refers to the Services subkey for the specific adapter card:

    Interfaces\[Adapter ID]
    MTU: Set it to equal the required MTU size in decimal (default 1500)
    Data Type:  DWORD

This parameter overrides the default MTU for a network interface. The MTU is the maximum packet size in bytes that the transport transmits over the underlying network. The size includes the transport header. Note that an IP datagram can span multiple packets. Values larger than the default for the underlying network result in the transport using the network default MTU. Values smaller than 68 result in the transport using an MTU of 68.
Note: complete tcpip parameters, please refer to TCP/IP and NBT configuration parameters for Windows XP -;EN-US;q314053

For Linux:

The MTU value of the interface can be modified when you edit the ifcfg- file and change the 'MTU' parameter, where refers to the name of the device that the configuration file controls. For example, in order to modify the configuration for the Ethernet interface, modify the file with the name 'ifcfg-eth0'. This file controls the first network interface card (NIC) in the system

Source: RedHat Linux manual -

  5. If you already know the path mtu - you can also change the maximum segment size (MSS) of all connections passing through links with MTU lower than the Ethernet default of 1500. This is known as MSS clamping.  This can be done on router or firewall.

 This option may be used at the time a connection is established (only) to indicate the
maximum size TCP segment that can be accepted on that connection. This Maximum
Segment Size announcement is sent from the data receiver to the data sender and says "I
can accept TCP segments up to size X". The size (X) may be larger or smaller than the
default. The process of setting the maximum packet size through the MSS option is
known as MSS clamping. With MSS option being part of TCP no ICMP traffic is needed
to adjust the MTU values between peers. The MSS can be used completely independently
in each direction of data flow, as a result there can be different maximum sizes in two

This will require to modify TCP packets, If you know something in your own network requires lesser MTU (<1500), you can manually enable this MSS clamping feature on the firewall, this will not require ICMP to work. 

Sometimes, you know the path mtu is less than 1500 between you company network and the destination host. And somethow the path mtu discovery doesn't work. You can use mss clamping on one of the firewall along this path. Because you don't want everyone on your network to change something on their PC to make this connection work, or you don't know which devices along the path is having black hole mtu issue, this path might be withing your company, or Internet which you can't control. The solution is to use mss clamping which modify the tcp packet header during the 3-way handshake process.

# iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
# iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1400:1536 -j TCPMSS --clamp-mss-to-pmtu

    This will calculate the proper size of the MSS based on the MTU of the packet. It is only applied to packets that are traveling through the FORWARD chain and have an original MSS within the 1400 to 1536 range.

If you are feeling brave, or think that you know best, you can also do something like this:

# iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 800

You can also use INPUT and OUTPUT rules to set mss to a fixed value.

For Linux: use
To adjust the maximum segment size (MSS) value of TCP SYN packets going through a router, use the ip tcp adjust-mss command in interface configuration mode

1. enable
2. configure terminal
3. interface type number
4. ip tcp adjust-mss max-segment-size
5. ip mtu bytes
6. end

Part III - how to find the maximum MTU
According to You can do the following to find it manually.

Windows 2000/XP users:

Go to Start/ Programs/ Accessories/ Command Prompt and type the following:

ping -f -l 1472
(That is a dash lower case "L," not a dash "1." Also note the spaces in between the sections.)

Linux users:

ping -s 1472

OS X users:

ping -D -s 1472

Linux and OS X commands are case sensitive.

Press Enter. Then reduce 1472 by 10 until you no longer get the "packet needs to be fragmented" error message. Then increase by 1 until you are 1 less away from getting the "packet need to be fragmented" message again.

Add 28 more to this (since you specified ping packet size, not including IP/ICMP header of 28 bytes), and this is your MaxMTU.

Note:If you can ping through with the number at 1472, you are done! Stop right there. Add 28 and your MaxMTU is 1500.

For PPPoE, your MaxMTU should be no more than 1492 to allow space for the 8 byte PPPoE "wrapper," but again, experiment to find the optimal value. For PPPoE, the stakes are high: if you get your MTU wrong, you may not just be sub-optimal, things like UPLOADING or web pages may stall or not work at all!

(TCP, IP, MTU and MSS magic numbers)
1500The biggest-sized IP packet that can normally traverse the Internet without getting fragmented. Typical MTU for non-PPPoE, non-VPN connections.
1492The maximum MTU recommended for Internet PPPoE implementations.
1472The maximum ping data payload before fragmentation errors are received on non-PPPoE, non-VPN connections.
1460TCP Data size (MSS) when MTU is 1500 and not using PPPoE. When tcp timestamp is used, effective MSS will be 1448(1460-12=1448) since tcp timestamp option will be using 12 bytes.
1464The maximum ping data payload before fragmentation errors are received when using a PPPoE-connected machine.
1452TCP Data size (MSS) when MTU is 1492 and using PPPoE.
576Typically recommended as the MTU for dial-up type applications, leaving 536 bytes of TCP data.
48The sum of IP, TCP and PPPoE headers.
40The sum of IP and TCP headers.
28The sum of IP and ICMP headers.

Which are the tools to help you to find maximum MTU?
mturoute -


   1. How to Troubleshoot Black Hole Router Issues -
   2. Adjusting IP MTU, TCP MSS, and PMTUD on Windows and Sun Systems - (change mtu size and enable/disable PMTUD)    

 3. MTU/MSS Hints -
4.  Path MTU Discovery Problems -
How To Set Up An IP Filter Without Breaking PMTUD -
6. MSS talk PDF powerpoint - (demo a typical firewall blocking icmp type 3 code 4 diagram)
7. Over-Zealous Security Administrators Are Breaking the Internet -