Data Cleaning and Visualisation
Scenario
You have been provided an export from DCE’s incident response team’s security information and event management (SIEM) system. The incident response team extracted alert data from their SIEM platform and have provided a .CSV file (MLData2023.csv), with 500,000 event records, of which approximately 3,000 have been ‘tagged’ as malicious.
The goal is to integrate machine learning into their Security Information and Event Management (SIEM) platform so that suspicious events can be investigated in real-time. security data.
Data description
Each event record is a snapshot triggered by an individual network ‘packet’. The exact triggering conditions for the snapshot are unknown. But it is known that multiple packets are exchanged in a ‘TCP conversation’ between the source and the target before an event is triggered and a record created. It is also known that each event record is anomalous in some way (the SIEM logs many events that may be suspicious).
A very small proportion of the data are known to be corrupted by their source systems and some data are incomplete or incorrectly tagged. The incident response team indicated this is likely to be less than a few hundred records. A list of the relevant features in the data is given below.
Assembled Payload Size (continuous)
The total size of the inbound suspicious payload. Note: This would contain the data sent by the attacker in the “TCP conversation” up until the event was triggered
DYNRiskA Score (continuous)
An un-tested in-built risk score assigned by a new SIEM plug-in
IPV6 Traffic (binary)
A flag indicating whether the triggering packet was using IPV6 or IPV4 protocols (True = IPV6)
Response Size (continuous)
The total size of the reply data in the TCP conversation prior to the triggering packet
1 | P a g e
Source Ping Time (ms) (continuous)
Operating System (Categorical)
Connection State (Categorical)
Connection Rate (continuous)
Ingress Router (Binary)
Server Response Packet Time (ms)
(continuous)
Packet Size (continuous)
Packet TTL (continuous)
Source IP Concurrent Connection (Continuous)
The ‘ping’ time to the IP address which triggered the event record. This is affected by network structure, number of ‘hops’ and even physical distances.