HTTP packet reconstruction

2019-01-14 14:37发布

问题:

If I have a large HTTP packet which has been split up into a number of TCP packets, how can I reconstruct them back into a single HTTP packet? Basically, where in the packet do I look to tell when a HTTP packet is starting/ending? I can't seem to see any flags/fields in the TCP header that denote the start or end of the HTTP packet.

EDIT: In follow up to the responses. If TCP manages the stream, how does it know when the stream starts and ends? Is that determined by the socket opening and closing? Some protocol, at some level, must be able to know when the HTTP stream/packet has started and ended. That is what I would like to know.

The situation I am in is I am using a packet sniffer in C# which reads in TCP packets, and I would like to be able to reconstruct the HTTP requests/responses/etc. going through the interface like how wireshark and various other sniffers manage to. Alternatively are there any C# libraries that let you tap into the HTTP streams at the higher level, saving me having to reconstruct the HTTP stream/packets myself?

Thanks.

回答1:

OK I worked out how to do this (dodgy but it gets the job done).

It is simple to strip away the Ethernet, IP, and TCP headers leaving you with the 'raw' data message. Looking inside the message, it is easy to detect whether it is the start of a HTTP packet by looking for the "HTTP/1.1 ..." at the start of the packet. This indicates the packet is the start of a HTTP stream/larger packet/whatever. You can also do some simple parsing to read the "Content-Length" field which is the total length of the entire HTTP packet.

You can also use the Source/Destination IP & Port numbers to form a unique ID for the link. So after receiving the header packet, take note of these 4 things (SRCIP, SRCPORT, DESTIP, DESTPORT). Next time you receive a packet matching this port/ip combo, you can check whether it's the next part of the HTTP packet. You can use the sequence numbers to do some validation and probably other stuff, but generally the packets are in order so it's OK. I think a new port is opened for each HTTP stream so you shouldn't receive random packets that aren't part of the stream, but this could be an area prone for error.

Anyway, once you received this packet, once again strip away the headers and get the raw message. Add it onto the already known part of the message. If the length of the total message received so far is equal to the length read from "Content-Length" field, the packet is complete!

This method is obviously prone to a huge amount of errors, but I am not after an extremely robust way of doing it. I thought I would answer my own question in case someone else comes across this same issue in the future! Good luck with your sniffing :D



回答2:

You should not be using any information from the TCP level to determine HTTP request boundaries. TCP provides a reliable byte stream service; you can't see any fields or flags in TCP that help with this because they are not there.

To determine where the boundaries are in an HTTP request you should follow RFC 2616. The boundaries are well-defined, and you can determine them by parsing the data you receive.



回答3:

In each TCP packet, the start of the payload data is immediately after the TCP header, and the end of the payload data is the end of the IP packet.

The end of the TCP header is easily found - the Data Offset is a 4-bit field in the header that contains the length of the header in 32-bit words (so multiply it by 4 to get the length in 8-bit bytes).

Use the TCP sequence numbers from the Sequence field to string the payloads together in the right order. Note that there might be duplicates, in the case of retransmissions.



回答4:

TCP is a stream protocol, not a packet protocol. The application layer (i.e. you) gets a stream of data, not a bunch of packets. You just keep reading bytes in from the stream and you'll get your entire http payload, while TCP does the error checking, resends, etc underneath.



回答5:

You can use code of the open source project named Xplico: http://www.xplico.org



回答6:

We had to work on solving the same problem. We were able extract some of the core functionality out in an open source project.

http://code.google.com/p/pcap-reconst/

Please do check it out and let me know if it help you out.