I want to extract raw data using pcap and wincap. Since i will be testing it against a neural network trained with NSLKDD dataset, i want to know how to get those 41 attributes from raw data?.. or even if that is not possible is it possible to obtain features like src_bytes, dst host_same_srv_rate, diff_srv_rate, count, dst_host_serror_rate, wrong_fragment from raw live captured packets from pcap?
问题:
回答1:
The 1999 KDD Cup Data is flawed and should not be used anymore
Even this "cleaned up" version (NSL KDD) is not realistic.
Furthermore, many of the "cleanups" they did are not sensible. Real data has duplicates, and the frequencies of such records is important. By removing duplicates, you bias your data towards the more rare observations. You must not do this blindly "just because", or even worse: to reduce the data set size.
The biggest issue however remains:
KDD99 is not realistic in any way
It wasn't realistic even in 1999, but the internet has changed a lot since back then.
It's not reasonable to use this data set for machine learning. The attacks in it are best detected by simple packet inspection firewall rules. The attacks are well understood, and appropriate detectors - highly efficient, with 100% detection rate and 0% false positives - should be available in many cases on modern routers. They are so omnipresent that these attacks virtually do not exist anymore since 1998 or so.
If you want real attacks, look for SQL injections and similar. But these won't show up in pcap files, yet the largely undocumented way the KDDCup'99 features were extracted from this...
Stop using this data set.
Seriously, it's useless data. Labeled, large, often used, but useless.
回答2:
If someone would like to experiment with KDD '99 features despite the bad reputation of the dataset, I created a tool named kdd99extractor to extract subset of KDD features from live traffic or .pcap file.
This tool was created as part of one university project. I haven't found detailed documentation of KDD '99 features so the resulting values may be bit different compared to original KDD. Some sources used are mentioned in README. Also the implementation is not complete. For example, the content features dealing with payload are not implemented.
It is available in my github repository.
回答3:
It seems that I am late to reply. But, as other people already answered, the KDD99 data-set is outdated.
I don't know about the usefulness of the NSL-KDD dataset. However, there is a couple of things:
- When getting information from network traffic, the best you can do is to get statistical information (content-based information is usually encrypted). What you can do is to create your own data-set to describe the behaviors you want to consider as "normal". Then, train the neural network to detect deviations from that "normal" behavior.
- Be careful knowing that even the definition of "normal" behavior changes from network to network and from time to time.
You can have a look to this work, I was involved in it, in which besides taking the statistical features of the original KDD, takes additional features from a real network environment.
The software is under request and it is free for academic purposes! Here two links to publications:
- http://link.springer.com/chapter/10.1007/978-94-007-6818-5_30
- http://www.iaeng.org/publication/WCECS2012/WCECS2012_pp30-35.pdf
Thanks!