Linux SysAdmin and Oracle/MySQL DBA: Understanding TCP/IP headers and options

Jephe Wu - http://linuxtechres.blogspot.com

1. TCP and IP headers

20 bytes for the TCP header and 20 bytes for the IP header, this is standard header , total 40 bytes.

ICMP has 8 bytes header.

IP header:
a. An IP packet is composed of the IP header (with or without options) and the data.
b. the length of the IP header must be a multiple of 4 bytes. The minimum length of IP headers is 20 bytes which means IP header length part hex digit is 5. (indicates ip header length is 5 4-byte chunks)

TCP header:
a. the TCP header may also contain options. And like the IP header, the TCP header's length must be a multiple of 4 bytes
b. the TCP options change the length of the TCP header, the length is set in the header,

e.g. 7 - means that the TCP header is made of seven 4-byte chunks, or a total of 28 bytes: 20 bytes of standard TCP header and 8 bytes of TCP options.

c. the maximum value that can be set is hex F, which is decimal 15. That means that fifteen 4-byte chunks, or 60 bytes, is the maximum length for the entire TCP header, including TCP options.

TCP encapsulation:
1. the data package at the Application Layer is called a message, while the same data package at the Internet Layer is called a datagram.
2. Transport Layer may have one of two names - a segment or a datagram. If the TCP protocol is being used, it is called a segment. If the UDP protocol is being used, it is called a Datagram.
3. Onto the Network Access Layer, where a frame is created

IP fragmentation and assembly
An application message is typically fragmented into a consecutive sequence of TCP segments and all except the last segment is of size MSS.
http://www.cisco.com/en/US/tech/tk827/tk369/technologies_white_paper09186a00800d6979.shtml

2. Maximum Segment Size (MSS)[SYN or SYN/ACK]
some concepts:
a. Every host is required to be able to handle an MSS of at least 536 bytes.
b. For most computer users, the MSS option is set automatically by the operating system on the SYN packet during the TCP handshake. The MSS can be used completely independently in each direction of data flow.
c. When IPv4 is used as the network protocol, the MSS is calculated as the maximum size of an IPv4 datagram minus 40 bytes. When IPv6 is used as the network protcol, the MSS is calculated as the maximum packet size minus 60 bytes. An MSS of 65535 should be interpreted as infinity.
d. The MSS is only avertised during the SYN and SYN/ACK packets in the TCP three-way handshake, so the MSS option should only be seen in those packets. MSS value is not negotiated between hosts. The sending host is required to limit the size of data in a single TCP segment to a value less than or equal to the MSS reported by the receiving host.
d. When a connection is established, the two ends agree to use the smaller of each end's value. Because headers are typically 40 bytes, MSS is usually 40 less than the MTU (Maximum Transmission Unit).
e. The way MSS now works is that each host will first compare its outgoing interface MTU with its own buffer and choose the lowest value as the MSS to send. The hosts will then compare the MSS size received against their own interface MTU and again choose the lower of the two values.

3. tcp timestamp [throughout a TCP session if both end use it]
a. Timestamps serve two main purposes:
– To allow for more accurate RTT calculations
– For Protection Against Wrapped Sequence numbers (PAWS)

According to microsoft KB at http://support.microsoft.com/kb/224829
The TCP sequence number field is limited to 32 bits, which limits the number of sequence numbers available. With high capacity networks and a large data transfer, it is possible to wrap sequence numbers before a packet traverses the network. If sending data on a 1 Giga-byte per second (Gbps) network, the sequence numbers could wrap in as little as 34 seconds. If a packet is delayed, a different packet could potentially exist with the same sequence number. To avoid confusion in the event of duplicate sequence numbers, the TCP timestamp is used as an extension to the sequence number. Packets have current and progressing time stamps. An old packet has an older time stamp and is discarded.

b. All popular Operating Systems implement Timestamps, although Windows does not like to use them by default.
c. Timestamps add 12 bytes to the TCP header of each packet, reducing the space available for useful data. Typical packets have a payload of 1448 bytes, coupled with 20 byte TCP header and a 10 byte TCP timestamp option header, a 2 byte padding field and a 20 byte IPv4 packet header.
d. The Timestamp option has two timestamp fields: the Timestamp Value and the Timestamp Echo Reply. When a host first timestamps a packet in a connection, it puts the timestamp in the Timestamp Value field and leaves the Timestamp Echo Reply field set to 0 (generally). When the second host receives that packet and prepares to acknowledge it, it transfers the timestamp from the old packet's Timestamp Value field to the new packet's Timestamp Echo Reply field, and it puts a new timestamp in the Timestamp Value field. So the Timestamp Value field always contains the latest timestamp, while the Timestamp Echo Reply field contains the previous timestamp.
e. Why when tcp timestamp is used, the mss is still advertised as 1460, not 1448(1460-12) since 12 bytes will be used by timestamp option
MSS is not a negotiated value, it is an advertisement. It means "do not
send me a TCP segment larger than this". It excludes TCP and IP headers parts.

MSS is advertised only in SYN or SYN/ACK packet, only one direction of SYN packet was sent when MSS is advertised. Because Timestamps can only be used if both ends of the connection agree to use them. So when MSS is advertised in the SYN or SYN/ACK packet, it cannot be adjusted by tcp options used later. Even more, the size of the TCP header may change on each packet due to SACK and TS.

So, the MSS advertised in the SYN is not the MSS that is eventually used which is sometimes called "effective MSS". Unlike the MSS and WSCALE options, Timestamp options are typically used throughout a TCP session, so if they are being used, you should see them in most of the packets.

Summary:
•12 byte option –drops MSS of 1460 to 1448•Allows better Round Trip Time Measurement (RTTM)
•Prevention Against Wrapped Sequence numbers (PAWS)
–32 bit sequence number wraps in 17 sec at 1 Gbps
–TCP assumes a Maximum Segment Lifetime (MSL) of 120 seconds

4. Window Scale (WSCALE) [SYN or SYN/ACK during handshake]
The Window Scale value is used to shift the window size field's value up to a maximum value of approximately a gigabyte. Like the MSS option, the WSCALE option should only appear in SYN and SYN/ACK packets during the handshake. However, if both ends of the connection do not use the WSCALE option, the window sizes will remain unchanged.

For more efficient use of high bandwidth networks, a larger TCP window size may be used. The TCP window size field controls the flow of data and is limited to 2 bytes, or a window size of 65,535 bytes.

Since the size field cannot be expanded, a scaling factor is used. TCP window scale is an option used to increase the maximum window size from 65,535 bytes to 1 Gigabyte.

The window scale value represents the number of bits to left-shift the 16-bit window size field. The window scale value can be set from 0 (no shift) to 14.

To calculate the true window size, multiply the window size by 2^S where S is the scale value.
For Example:
If the window size is 65,535 bytes with a window scale factor of 3.
True window size = 65535*2^3
True window size = 524280

if windows scale is 14, the true windows size will be 65535*2^14=2^16*2^14=2^30=1G (2^32=4G)

5. No Operation (NOP):
NOP is used to provide padding around other options. The length of the TCP header must be a multiple of 4 bytes; however, most of the options are not 4 bytes long. If the total length of the options is not a 4-byte multiple, one or more 1-byte NOPs will be added to the options in order to adjust the overall length. For example, if there were 6 bytes of options, two NOPs would be added. NOPs are sometimes used between options, particularly if an option needs to start on a certain byte boundary, so it is not unusual to see several NOPs throughout a set of TCP options.

6. Selective Acknowledgments (SACK): [2 parts, SackOK in SYN and SYN/ACK, Sack data in established connection]
This technique allows the data receiver to inform the sender about all segments that have arrived successfully, so the sender need retransmit only the segments that have actually been lost. This extension uses two TCP options. The first is an enabling option, SACK permitted, which may be sent in a SYN segment to indicate that the SACK option can be used once the connection is established. The other is the SACK option itself, which may be sent over an established connection once permission has been given.

Normally, when a host acknowledges data, it can only acknowledge the packets up to and including the sequence number immediately before a missing packet. This means that if a thousand packets are received but the second one is missing, the host can only acknowledge the receipt of the first packet, so the sender would have to resend all packets from number 2 through 1000. By using selective acknowledgments, the receiver could acknowledge the receipt of the packets from 3 through 1000, so the sender would only have to resend packet 2.

This two-byte option may be sent in a SYN by a TCP. It MUST NOT be sent on non-SYN segments.

The SACK option is not mandatory and it is used only if both parties support it. This is negotiated when connection is established. SACK uses the optional part of the TCP header

There are two TCP options for selective acknowledgments:
Selective Acknowledgment Permitted (SackOK): This option simply says that selective acknowledgments are permitted for this connection. SackOK must be included in the TCP options in both the SYN and SYN/ACK packets during the TCP three-way handshake, or it cannot be used. SackOK should not appear in any other packets.

Selective Acknowledgment Data: This option contains the actual acknowledgment information for a selective acknowledgment. It lists one or more pairs of sequence numbers, which define ranges of packets that are being acknowledged.

Summary:
•TCP originally could only ACK the last in sequence byte received (cumulative ACK)
• RFC2018 SACK allows ranges of bytes to be ACK’d
–Sender can fill in holes without resending everything
–Up to four blocks fit in the TCP option space (three with the timestamp option)
•Surveys have shown that SACK implementationoften have errors
– RFC3517 addresses how to respond to SACK

References:

TCP options explained - http://support.microsoft.com/kb/224829

Linux SysAdmin and Oracle/MySQL DBA

Understanding TCP/IP headers and options

1 comment:

Total Pageviews

Bookmark

Search This Blog

Followers

keywords

All Posts

About me