This is G o o g l e's cache of http://www.worldgate.com/~marcs/mtu/. G o o g l e's cache is
the snapshot that we took of the page as we crawled the web. The
page may have changed since that time. Click here for the current
page without highlighting. To link to or bookmark this
page, use the following url:
http://www.google.com/search?q=cache:-WajfjfglKYC:www.worldgate.com/~marcs/mtu/
Google is not affiliated with the authors
of this page nor responsible for its
content. | |
Path MTU Discovery and Filtering
ICMP
Marc Slemko <marcs@znep.com>
Created: Thursday, January
18 1998
Last Modified:
This document explains the details of how path MTU discovery (PMTU-D)
combined with filtering ICMP messages can result in connectivity problems. If
you are familiar with the terms discussed
Let's start by defining what we are talking about
- MTU
- The maximum transmission unit is a link layer restriction on the maximum
number of bytes of data in a single transmission (ie. frame, cell, packet,
depending on the terminology). The below table shows some typical values for
MTUs, taken from RFC-1191:
MTU |
Where Commonly Used |
65535 |
Hyperchannel |
17914 |
16 Mbit/sec token ring |
8166 |
Token Bus (IEEE 802.4) |
4464 |
4 Mbit/sec token ring (IEEE 802.5) |
1500 |
Ethernet |
1500 |
PPP (typical; can vary widely) |
576 |
X.25 Networks |
- Path MTU
- The smallest MTU of any link on the current path between two hosts. This
may change over time since the route between two hosts, especially on the
Internet, may change over time. It is not necessarily symmetric and can even
vary for different types of traffic from the same host.
- Fragmentation
- When a packet is too large to be sent across a link as a single unit, a
router can fragment the packet. This means that it splits it into multiple
parts which contain enough information for the receiver to glue them together
again. Note that this is not done on a hop-by-hop basis, but once fragmented a
packet will not be put back together until it reaches its destination.
Fragmentation is undesirable for numerous reasons, including:
- If any one fragment from a packet is dropped, the entire packet needs to
be retransmitted. This is a very significant problem.
- It imposes extra processing load on the routers that have to split the
packets.
- In some configuration, simpler firewalls will block all fragments
because they don't contain the header information for a higher layer
protocol (eg. TCP) needed for filtering.
- DF (Don't Fragment) bit
- This is a bit in the IP header that can be set to indicate that the packet
should not be fragmented by routers, but instead an ICMP "can't fragment"
error is returned sent to the sender and the packet is dropped.
- ICMP Can't Fragment Error
- This error (type 3 (destination unreachable), code 4 (fragmentation needed
but don't-fragment bit set)) is returned by a router when it receives a packet
that is too large for it to forward and the DF bit is set. The packet is
dropped and the ICMP error is sent back to the origin host. Normally, this
tells the origin host that it needs to reduce the size of its packets if it
wants to get through. Recent systems also include the MTU of the next hop in
the ICMP message so the source knows how big its packets can be. Note that
this error is only sent if the DF bit is set; otherwise, packets are just
fragmented and passed through.
- MSS
- The MSS is the maximum segment size. It can be announced during the
establishment of a TCP connection to indicate to the other end the largest
amount of data in one packet that should be sent by the remote system.
Normally the packet generated will be 40 bytes larger than this; 20 bytes for
the IP header and 20 for the TCP header. Most systems announce a MSS that is
determined from the MTU on the interface that the traffic to the remote system
passes out from the system through.
- Path MTU Discovery (PMTU-D)
- Now you know that Path MTUs vary. You know that fragmentation is bad. The
solution? Well, one solution is Path MTU Discovery. The idea behind it is to
send packets that are as large as possible while still avoiding fragmentation.
A host does this by starting by sending packets that have a maximum size of
the lesser of the local MTU or the MSS announced by the remote system. These
packets are sent with the DF bit set. If there is some MTU between the two
hosts which is too small to pass the packet successfully, then an ICMP can't
fragment error will be sent back to the source. It will then know to lower the
size; if the ICMP message includes the next hop MTU, it can pick the correct
size for that link immediately, otherwise it has to guess.
The exact process that systems go through is somewhat more complicated to
account for special circumstances. For full details, see RFC-1191.
A good indication of if a system is trying to do PMTU-D is to watch the
packets it is sending with something like tcpdump or snoop and see if they
have the DF bit set; if so, it is most likely trying to do PMTU-D.
Now, to the problem with ICMP filtering and PMTU-D
Now we get to the
problem. Many network administrators have decided to filter ICMP at a router or
firewall. There are valid (and many invalid) reasons for doing this, however it
can cause problems. ICMP is an integral part of the Internet and can not be
filtered without due consideration for the effects.
In this case, if the ICMP can't fragment errors can not get back to the
source host due to a filter, the host will never know that the packets it is
sending are too large. This means it will keep trying to send the same large
packet, and it will keep being dropped--silently dropped from the view of any
system on the other side of the filter. While a small handful of systems that
implement PMTU-D also implement a way to detect such situations, most don't and
even for those that do it has a negative impact on performance and the network.
If this is happening, typical symptoms include the ability for small packets
(eg. request a very small web page) to get through, but larger ones (eg. a large
web page) will simply hang. This situation can be confusing to the novice
administrator because they obviously have some connectivity to the host, but it
just stops working for no obvious reason on certain transfers.
There is one solution, and several workarounds, for this problem. They
include:
- Fix your filters! The real problem
here is filtering ICMP messages without understanding the consequences. Many
packet filters will allow you to setup filters to only allow certain types of
ICMP messages through. If you reconfigure them to let ICMP can't fragment
(type 3, code 4) messages through, the problem should disappear. If the filter
is somewhere between you and the other end, contact the administrator of that
machine and try to convince them to fix the problem.
- Reduce the MTU on the machines at one end or the other. This is a
workaround and should not be done unless necessary. If you reduce the MTU on
the system trying to do path MTU discovery to a point where it is less than or
equal to the former path MTU, it will no longer try sending packets large
enough to cause problems. Similarly, if you change the MTU on the system on
the other end, it will advertise a lower MSS so the sending system will only
send packets with data that fits into that MSS.
- Disable PMTU-D; if you control access to the machine that is trying to do
PMTU-D, and are unable to get the person administering the bogus filter to fix
it, disabling PMTU-D will fix the problem for data sent by that machine. Data
being received by the machine, however, can still run into the problem. With
the size that HTTP requests are growing to, this could start to be a problem
more and more; historically, HTTP requests have nearly always been small
enough to fit through links with small MTUs in one packet. Disabling PMTU-D is
simply a workaround, and should not generally be done unless necessary or you
know what you are doing.
So how can using RFC 1918 addresses for router links cause problems?
On many routers, a separate IP address in the same subnet is required for
each end of a point to point link. This can use address space if there are a
large number of such links. Since the actual address of the links doesn't appear
to impact much, many people use RFC 1918 private address space
for such links. The blocks included in this are:
10.0.0.0 - 10.255.255.255 (10/8 prefix) |
172.16.0.0 - 172.31.255.255 (172.16/12 prefix) |
192.168.0.0 - 192.168.255.255 (192.168/16 prefix) |
If you are using such addresses, then ICMP messages (including "can't
fragment" errors) will normally be generated using such addresses. Since many
networks filter incoming traffic from such reserved addresses, the net result is
the same as if all ICMP were being filtered and can cause the same problems.