Feature Selection Technique In The Network Traffic Dataset: Solution Essays

Download PDF

Introduction

Nowadays security is a big threat to the digital world. The use of internet, computers, mobile, tablets has become ubiquitous and the cyber-attack has grown rapidly. There are various kinds of cyber-attacks such as Spoofing, sniffing, denial-of service, phishing, evil twins, pharming, click fraud and malware. Malicious software’s are harmful for both computer and network. Cyber-attack growth has increased drastically and has compromise the systems, take away valuable information and destroy important structure, producing vast losses, per incident it costs dollar 345 in average.

Not only the growth of internet uses but also number of new malware is become another reason of digital threat. More than 317 million new pieces of malware were created in 2014. Conventional anti-virus and intrusion detection system cannot detect zero day attack. According to the Symantec Internet Security Threat Report 2010 the circulation of malware over 5 million on the internet. As a result, security specialist are very much devoted to develop an efficient malware detection method. In this work we describe several feature selection technique, due to detect malware from network traffic dataset using machine learning algorithm. Because feature selection is very important task for malware detection. Malware can be detect through static and dynamic features. Although anti-virus software are developed based on signature of malware, it fails when zero day malware attack occur. Malware detection system captures network traffic dataset to distinguish between malware and goodware (normal and suspicious activity).

The network traffic dataset has lots of packets with huge features. Some feature may be very important but some are may not be relevant for making decision. However, it increases the processing time and decreases the efficiency of malware detection system. That’s why, the main purpose of feature selection technique is to reduce the dimensionality of feature space, remove the redundant and irrelevant feature from network traffic dataset.

There are many approach developed to represent the proliferation number of malware that revolt every day. Hansen et al. introduced an approach named Random Forests Classifier for detecting and classifying the vast amount of malware which comes from known or unknown malware family. This approach reduce the feature space expressively. And Cuckoo sandbox also used as a behavioral traces of analyzed samples due to achieving high malware detection rate and family classification.

Tian et al. were used logs of API calls to distinguish malware from cleanware by scrutinizing the behavioral features. This work also proposed for both malware family classification and detection by applying pattern recognition algorithms in virtual environment. They achieved approximately 97% accuracy by using a dataset of 1, 368 malware and 456 cleanware. In another study the applicability of sandbox environment to obtain the run-time behavior of malware was discussed. The proposed work differentiate malware by using a heuristic method termed N-grams analysis and adopt Information Gain feature selection technique to choose the best features for classification. Cuckoo sandbox examine the malware behavior which are running on Virtual Machine. They found SPegasos, achieved highest accuracy, better detection rate from different feature length such as 200, 400 and 600.

Authors proposed a method of bilayer abstraction based on the dynamic analysis of API sequences for malware detection. Behavioral features are abstracted by low layer and high layer behavior. They also propose an enriched support vector machine named OC-SVM Neg due to use benign software samples available which provide false alarm rate better. The number of 14863 malware and 2623 benign programs are collected from VXHeaven and Malheur. This work conveyed good result to detect unknown malware.

On the other hand, Santos et al. developed a hybrid malware detector for detecting unknown malware by attaining feature statically and dynamically. For testing their proposed system they collect malware and benign programs from two different source. One is VXHeaven for malware samples yet for benign programs they rely on their setup. For feature vector they used opcode sequence, system call, exceptions, etc. This hybrid approach is efficient for extracting feature both statically and dynamically.

In another research a supervised system introduced for detecting malware. From different observation area they extracted 972 behavioral features. They used naïve bayes, decision tree (J48) and random forest as machine learning algorithm to come up with decision. In this paper, unknown malware could be detected within one month if static rule pre-defined by Snort or Suricata systems.

Fukishima et al. have implemented a prototype for malware detection. Authors evaluated apprehensive process behavior on windows OS due to avoid false positives. This behavior based method achieved about 60% accuracy for detecting malware without false positive. That’s why, they used 83 malware and 41 goodware for evaluation.

Nari et al. proposed an automated method for classifying malware considering network activity of malware. They created a behavioral graph which not only characterize the samples network behavior but also dependencies on the network flows. This method were efficient for malware sample classification.

According to authors represented a data mining technique to detect new malicious executables. Three different types of feature: Portable Executable (PE), byte-sequence n-grams and string features were used for feature extraction. Their dataset consist of 3265 malware and 1001 clean programs where total number of programs 4266. For malware classification they also used multi-Naïve Bayes method which highest accuracy of detection rate 97. 76% over unfamiliar programs.

In the other study authors developed an efficient malware classification technique based on string information which executables. They extracted printable strings from 1367 sample containing viruses, unpacked Trojan and clean files. They flourished to gain 97% classification accuracy using k-fold cross validation from unpacked malicious and used also Random forest as an effective classifier.

R. Islam et al. introduced a classification systems which is integrated static and dynamic features. For this work they composed two set of dataset where first one is collected between 2003 and 2007 another one is collected between 2009 and 2010. Using Random forest classifier they achieved accuracy of 97%.

Ahmed et al. combined two different dynamic features (from spatial and temporal information) in sandbox to detect malware available in run-time API calls. They achieved classification accuracy of 96. 3% using 516 executables files. In similar way, Wagener et al. executed small amount of malware files (104) to generate lists of API calls and then calculated the similarity between two API call sequences by using similarity matrix. They succeeded to detect 93% accuracy.

Place your order
(550 words)

Approximate price: $22

Homework help cost calculator

600 words
We'll send you the complete homework by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 customer support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • 4 hour deadline
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 300 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more
× How can I help you?