ROAL OF CONFUSION MATRIX IN CYBER CRIME

HM
7 min readJun 3, 2021

👩‍💻What is Confusion matrix?

Well, it is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values.

It is extremely useful for measuring Recall, Precision, Specificity, Accuracy and most importantly AUC-ROC Curve.

Let’s understand TP, FP, FN, TN in terms of pregnancy analogy.

◼True Positive:

Interpretation: You predicted positive and it’s true.

You predicted that a woman is pregnant and she actually is.

◼True Negative:

Interpretation: You predicted negative and it’s true.

You predicted that a man is not pregnant and he actually is not.

◼False Positive: (Type 1 Error)

Interpretation: You predicted positive and it’s false.

You predicted that a man is pregnant but he actually is not.

◼False Negative: (Type 2 Error)

Interpretation: You predicted negative and it’s false.

You predicted that a woman is not pregnant but she actually is.

Just Remember, We describe predicted values as Positive and Negative and actual values as True and False.

👩‍💻Is it necessary to check for recall (or) precision if you already have a high accuracy?

The term recall refers to the proportion of genuine positive examples that a predictive model has identified. To put that another way, it is the number of true positive examples divided by the total number of positive examples and false negatives.

Recall is the percentage of positive examples, from your entire set of positive examples, your model was able to identify. Recall is also sometimes called the hit rate, while sensitivity describes a model’s true positive prediction rate or the recall likelihood.

📌Recall answers this question: What percentage of all total positive examples in your dataset did your model identify?

Precision:

Precision is similar to recall, in the respect that it’s concerned with your model’s predictions of positive examples. However, precision measures something a little different.

Precision is interested in the number of genuinely positive examples your model identified against all the examples it labeled positive. Mathematically, it is the number of true positives divided by the true positives plus the false positives. If the distinction between recall and precision is still a little fuzzy to you, just think of it like this: Precision answers this question: What percentage of all chosen positive examples is genuinely positive?

Specificity:

If sensitivity/recall is concerned with the true positive rate, specificity is concerned with tracking the true negative rate.

Specificity is the number of genuinely negative examples your model identified divided by total negative examples in dataset. It is mathematically defined by the proportion of true negative examples to true negative and false positive examples.

Accuracy:

Accuracy is the simplest. It defines your total number of true predictions in total dataset. It is represented by the equation of true positive and true negative examples divided by true positive, false positive, true negative and false negative examples.

Confusion matrices are used to visualize important predictive analytics like recall, specificity, accuracy, and precision. Confusion matrices are useful because they give direct comparisons of values like True Positives, False Positives, True Negatives and False Negatives. In contrast, other machine learning classification metrics like “Accuracy” give less useful information, as Accuracy is simply the difference between correct predictions divided by the total number of predictions.

👩‍💻Why you we need Confusion matrix?

  • It shows how any classification model is confused when it makes predictions.
  • Confusion matrix not only gives you insight into the errors being made by your classifier but also types of errors that are being made.
  • This breakdown helps you to overcomes the limitation of using classification accuracy alone.
  • Every column of the confusion matrix represents the instances of that predicted class.
  • Each row of the confusion matrix represents the instances of the actual class.
  • It provides insight not only the errors which are made by a classifier but also errors that are being made.

💻Cyber Attack Detection and Classification Using Parallel Support Vector Machine💻

Cyber attack is becoming a critical issue of organizational information systems. A number of cyber attack detection and classification methods have been introduced with different levels of success that is used as a countermeasure to preserve data integrity and system availability from attacks. The classification of attacks against computer network is becoming a harder problem to solve in the field of network security.

The rapid increase in connectivity and accessibility of computer system has resulted frequent chances for cyber attacks. Attack on the computer infrastructures are becoming an increasingly Serious problem. Basically the cyber attack detection is a classification problem, in which we classify the normal pattern from the abnormal pattern (attack) of the system. Subset selection decision fusion method plays a key role in cyber attack detection. It has been shown that redundant and/or irrelevant features may severely affect the accuracy of learning algorithms. The SDF is very powerful and popular data mining algorithm for decision-making and classification problems. It has been using in many real life applications like medical diagnosis, radar signal classification, weather prediction, credit approval, and fraud detection etc.

KDD CUP ‘’99 Data Set Description:

To check performance of the proposed algorithm for distributed cyber attack detection and classification, we can evaluate it practically using KDD’99 intrusion detection datasets. In KDD99 dataset these four attack classes (DoS, U2R,R2L, and probe) are divided into 22 different attack classes that tabulated in Table I. The 1999 KDD datasets are divided into two parts: the training dataset and the testing dataset. The testing dataset contains not only known attacks from the training data but also unknown attacks. Since 1999, KDD’99 has been the most wildly used data set for the evaluation of anomaly detection methods. This data set is prepared by Stolfo et al. and is built based on the data captured in DARPA’98 IDS evaluation program . DARPA’98 is about 4 gigabytes of compressed raw (binary) tcpdump data of 7 weeks of network traffic, which can be processed into about 5 million connection records, each with about 100 bytes. For each TCP/IP connection, 41 various quantitative (continuous data type) and qualitative (discrete data type) features were extracted among the 41 features, 34 features (numeric) and 7 features (symbolic). To analysis the different results, there are standard metrics that have been developed for evaluating network intrusion detections. Detection Rate (DR) and false alarm rate are the two most famous metrics that have already been used. DR is computed as the ratio between the number of correctly detected attacks and the total number of attacks, while false alarm (false positive) rate is computed as the ratio between the number of normal connections that is incorrectly misclassified as attacks and the total number of normal connections.

In the KDD Cup 99, the criteria used for evaluation of the participant entries is the Cost Per Test (CPT) computed using the confusion matrix and a given cost matrix. A Confusion Matrix (CM) is a square matrix in which each column corresponds to the predicted class, while rows correspond to the actual classes. An entry at row i and column j, CM (i, j), represents the number of misclassified instances that originally belong to class i, although incorrectly identified as a member of class j. The entries of the primary diagonal, CM (i, i), stand for the number of properly detected instances. Cost matrix is similarly defined, as well, and entry C (i, j) represents the cost penalty for misclassifying an instance belonging to class i into class j.

  • True Positive (TP): The amount of attack detected when it is actually attack.
  • True Negative (TN): The amount of normal detected when it is actually normal.
  • False Positive (FP): The amount of attack detected when it is actually normal (False alarm).
  • False Negative (FN): The amount of normal detected when it is actually attack.

In the confusion matrix above, rows correspond to predicted categories, while columns correspond to actual categories.

Confusion matrix contains information actual and predicted classifications done by a classifier. The performance of cyber attack detection system is commonly evaluated using the data in a matrix.

THANK YOU!!!

--

--