-
Notifications
You must be signed in to change notification settings - Fork 97
Large Language Models
Large Language Models (LLMS) are a hot research area in the Natural Lanaguge Processing (NLP) community. With the release of ChatGPT, LLMs have brought generative AI into the mainstream more rapidly than any other previous technical approach. Starting with TRAM 1.3, we have developed and included an ATT&CK labeling engine based on the LLM known as BERT. LLMs are pre-trained on large amounts of text, and in the process they learn to perform useful tasks such as being able to fill in missing words or finish sentences given a prompt. The successful completion of these tasks indicates that LLMs learn important semantic features of language, such as distinguishing parts of speech or identifying words with similar meanings. Previous versions of TRAM used linear models or decision trees based on simple text features (e.g. n-grams) that cannot learn rich representations of language, especially with small datasets. (And, since annotated data is very expensive to produce, we must accept that we will be working with a limited amount of data.)
Our research goal for using LLMs in TRAM is to try to leverage an LLM's understanding of language and then "fine tune" the model on our specific problem, i.e. labeling ATT&CK techniques in text. Our hypothesis is that the model will be able to quickly identify synonymous usage of words. For example, "forking a process" and "spawning a process" are two ways of saying the same thing, and both are relevant to understanding how malware might behave. An ideal LLM will be able to quickly learn that forking and spawning are synonyms in this context.
To focus our research efforts, we selected a subset of 50 ATT&CK techniques from the complete set of over 600 techniques and subtechniques. The criteria for choosing these 50 techniques are:
- The most commonly found techniques in the TRAM 1.0 data
- The most commonly discovered techniques from the Sightings Report
- The most common techniques as defined by Actionability, Choke Point, and Prevalence from Top ATT&CK Techniques
Table of 50 ATT&CK Techniques | ||||
---|---|---|---|---|
T1548.002 Abuse Elevation Control Mechanism: Bypass User Account Control | T1484.001 Domain Policy Modification: Group Policy Modification | T1070.004 Indicator Removal: File Deletion | T1566.001 Phishing: Spearphishing Attachment | T1518.001 Software Discovery: Security Software Discovery |
T1557.001 Adversary-in-the-Middle: LLMNR/NBT-NS Poisoning and SMB Relay | T1573.001 Encrypted Channel: Symmetric Cryptography | T1105 Ingress Tool Transfer | T1057 Process Discovery | T1218.011 System Binary Proxy Execution: Rundll32 |
T1071.001 Application Layer Protocol: Web Protocols | T1041 Exfiltration Over C2 Channel | T1056.001 Input Capture: Keylogging | T1055 Process Injection | T1082 System Information Discovery |
T1547.001 Boot or Logon Autostart Execution: Registry Run Keys / Startup Folder | T1190 Exploit Public-Facing Application | T1570 Lateral Tool Transfer | T1090 Proxy | T1016 System Network Configuration Discovery |
T1110 Brute Force | T1068 Exploitation for Privilege Escalation | T1036.005 Masquerading: Match Legitimate Name or Location | T1012 Query Registry | T1033 System Owner/User Discovery |
T1059.003 Command and Scripting Interpreter: Windows Command Shell | T1210 Exploitation of Remote Services | T1112 Modify Registry | T1219 Remote Access Software | T1569.002 System Services: Service Execution |
T1543.003 Create or Modify System Process: Windows Service | T1083 File and Directory Discovery | T1106 Native API | T1021.001 Remote Services: Remote Desktop Protocol | T1552.001 Unsecured Credentials: Credentials In Files |
T1074.001 Data Staged: Local Data Staging | T1564.001 Hide Artifacts: Hidden Files and Directories | T1095 Non-Application Layer Protocol | T1053.005 Scheduled Task/Job: Scheduled Task | T1204.002 User Execution: Malicious File |
T1005 Data from Local System | T1574.002 Hijack Execution Flow: DLL Side-Loading | T1003.001 OS Credential Dumping: LSASS Memory | T1113 Screen Capture | T1078 Valid Accounts |
T1140 Deobfuscate/Decode Files or Information | T1562.001 Impair Defenses: Disable or Modify Tools | T1027 Obfuscated Files or Information | T1072 Software Deployment Tools | T1047 Windows Management Instrumentation |
Large language models benefit from being pre-trained on vast amounts of data. These models start off having a rich understanding of human language, and therefore require a much smaller amount of domain-specific training data to learn domain-specific tasks. A main benefit of using LLMs is the ability to predict text not included in the training data. This means that LLM-based models are more robust to unseen words and are capable of perceiving subtle relationships between words that are indicative of an ATT&CK technique.
We considered three different LLMs between two architectures. While other models and LLM architectures exist, these three:
- are open access
- have appropriate licenses for our use case
- are associated with reputable labs
The two architectures considered were BERT and GPT-2. In both cases, the LLMs are intended for different use cases than text classification but can be adapted during fine-tuning. We considered two BERT models, namely the original BERT model (Devlin et al) as well as SciBERT, which is a variation trained on scientific literature. BERT is designed to predict masked words in text, while GPT-2 is designed for generating text, and produces sequences of words by considering what next word would make sense given words it has already produced.
To confirm our hypothesis that LLMs could have better performance we needed a way to analyze and compare results. Precision, Recall, and F1 score are common metrics we can use to compare the performance of models. Precision is the metric that penalizes false positives (a score of 1 indicates no false positives), and recall is the metric that penalizes false negatives (a score of 1 indicates no false negatives). F1 is the harmonic mean of precision and recall, which means that instead of being half way between precision and recall (as you would get from adding them and dividing by two), the F1 score is skewed towards the lower of the two scores.
Each of these three metrics are calculated for each individual ATT&CK technique. When talking about the aggregate performance of the whole model, we can take the micro or macro average. The micro average is where we treat each instance the same, and calculate precision, recall, and F1 based on true positives, false positives, and false negatives across every technique. The macro average is where we treat each technique the same (even if it appears more or less often than other techniques) and take the precision, recall, and F1 scores that are already calculated, and take the average of each.
Typical metrics for Machine Learning performance - source “The Role of Machine Learning in Cybersecurity“ https://doi.org/10.1145/3545574
To compare the performance of each model, all three (SciBERT, BERT, GPT-2) were trained to perform single-label classification on ten epochs of a dataset that combined the TRAM tool’s embedded training data with the annotated CTI reports.
The results show SciBERT performs best during the first epoch and reaches peak performance more quickly than the other two. This is likely because its vocabulary is more aligned with the vocabulary of our data, and by extension, the kinds of documents on which the final model will be applied. As a result, we selected SciBERT as the best performing LLM architecture to integrate into TRAM.
The fine-tuned SciBERT model shows improvement over the logistic regression model in all but one area where we measured precision, recall, and F1-score. For TRAM users this means our new LLM correctly identified the correct ATT&CK technique 88 of 100 times; and missed finding 12 techniques out of 100 samples. F1 score indicates a balance between precision and recall scores.
The LLM functionality has been built into a Jupyter notebook that you can run locally or hosted online through Google Colab. With Colab, you can import your own data and run our LLM training code on Google’s GPU-enabled systems using either paid or free tiers. This alternative approach offers advanced users a step-by-step process to executing the code behind the text classifier. You can use this to create your own model weights, train on additional data, or even set up training for ATT&CK techniques not included in our subset of 50 described above.
To use the notebook, follow the comment sections in each of the cells to download the model, setup the analysis parameters, then upload a report. Machine learning engineers can customize the configuration to further refine the results.
The TRAM notebook divides uploaded reports into partially overlapping n-grams. An n-gram is a sequence of n adjacent words. By extracting n-grams from each document, we can produce segments that might be more similar in construction to the segments that the model was trained on than are complete sentences. The notebooks will allow you to specify the value of n, as the model was trained on segments of varying length, and adjusting the number may allow the model to make predictions that it wouldn’t make on larger or shorter segments.