Exploring Fine-Tuning Meta LLaMA 2 for Malicious Traffic Detection: A Proof of Concept
Fine-tuning Meta LLaMA 2 for malicious traffic detection showcased promising results, highlighting the importance of resource selection in model training. Explore more in this detailed proof of concept.
Project Overview
The goal of this POC was to fine-tune Meta LLaMA 2, a pre-trained language model, to analyze a chain of network logs and detect malicious traffic.
Quick explanation on fine-tuning
Fine-tuning a model involves taking a pre-trained model and further training it on a specific dataset to adapt it to a particular task. This is particularly useful when you have a large, pre-trained model that has learned general language patterns but needs to be tailored to a specific domain or task.
Why Use Fine-Tuning?
- Customization: It allows you to adapt the model to specific business needs, making it more effective for your particular use case.
- Domain-Specific Language: Fine-tuning helps the model understand and correctly interpret domain-specific jargon and vocabulary.
- Enhanced Performance: It improves the model’s performance on specific tasks by focusing on relevant data.
Key Components in Fine-Tuning:
- Epochs: The number of times the model goes through the entire training dataset. More epochs can improve the model’s performance but also increase the training time.
- Instance Type: The type of computational resource used for training. More powerful instances can handle larger datasets and more complex models but are also more expensive.
- Training Jobs: The actual process of training the model on the dataset. This involves configuring the training parameters and running the training process.
In our context, we used SageMaker for fine-tuning the Meta LLaMA 2 model. We had to create a training dataset in JSONL format and a template file to define the structure of the input and output for the model.
How It Works:
- Dataset Preparation: We collected logs and structured them in a JSONL format, which the model uses for training. Each log entry was converted into a JSON object with fields like 'instruction', 'context', and 'response'.
- Template Creation: The template file defines how the logs are presented to the model. It includes placeholders for the 'instruction' and 'context' and specifies the format of the expected 'response'.
- Training: Using SageMaker, we configured the training job with the dataset and template. SageMaker handles the process of feeding the data to the model, training it, and storing the fine-tuned model.
Data Collection, Structure, and Preparation
We collected network logs stored in S3 buckets. These logs included various fields such as date, time, c-ip, cs-method, sc-status, ssl-protocol, ThreatLevel, and score. The aim was to use these logs as training data to teach the model to recognize patterns indicative of malicious activities.
The dataset we used for training was structured as the follows:
Date | Time | C-IP | cs-method | sc-status | ssl-protocol | score | ThreatLevel |
2024-05-01 |
12:00 |
000.010.00.1 | GET | 200 | TLSv1.3 | -2.4700 | Good Traffic |
For this project, we utilized various tools and libraries including:
Amazon S3: For storing and retrieving network logs.
SageMaker: For model training and deployment.
Pandas: For data manipulation.
Boto3: AWS SDK for Python to interact with S3 and SageMaker.
JSON: For data serialization.
First Tentative: For the initial quick test with a smaller dataset, we used ml.g5.24xlarge. This instance type consumes fewer resources and is suitable for smaller datasets. This test covered 29 days of logs, with approximately 330 CSV files per day, each containing about 200 lines. This resulted in roughly 9,570 CSV files in total.
Second Tentative: For the more extensive test with a larger dataset, we used ml.g5.48xlarge. This instance type has more computational power and is better suited for larger datasets. This test covered 60 days of logs, with approximately 660 CSV files per day, each containing about 200 lines. This resulted in roughly 39,600 CSV files in total.
The rationale behind choosing these instance types was to balance the cost and efficiency of training. The larger instance type was necessary to handle the increased data volume in the second tentative.
Initial Setup and First Tentative
We started by setting up the environment and preparing the dataset. The initial attempt involved using a subset of logs from our S3 bucket (XPTO/xyz-com). Here is a simplified version of the code used for data processing:
import boto3
import json
import pandas as pd
from io import BytesIO
class LogAnalyzer:
def __init__(self, bucket_name, prefix, max_logs=5000):
self.bucket_name = bucket_name
self.prefix = prefix
self.max_logs = max_logs
self.s3 = boto3.client('s3')
self.start_time = time.time()
self.max_execution_time = 350
def list_csv_files(self, year, month, day, max_items):
prefix = f"{self.prefix}/{year}/{month:02d}/{day:02d}/"
paginator = self.s3.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket=self.bucket_name, Prefix=prefix, PaginationConfig={'MaxItems': max_items})
csv_files = []
for page in page_iterator:
csv_files.extend([content['Key'] for content in page.get('Contents', []) if content['Key'].endswith('.csv')])
if len(csv_files) >= max_items:
break
return csv_files
def download_and_read_csv(self, csv_file):
response = self.s3.get_object(Bucket=self.bucket_name, Key=csv_file)
csv_data = response['Body'].read()
df = pd.read_csv(BytesIO(csv_data))
return df
def analyze_logs(self, df):
logs = df.to_dict('records')
return logs
def run(self):
all_logs = []
year = 2024
months = [4, 5]
days = list(range(1, 32))
for month in months:
for day in days:
if time.time() - self.start_time > self.max_execution_time:
print("Execution time exceeded 350 seconds, restarting notebook.")
return
csv_files = self.list_csv_files(year, month, day, self.max_logs)
for csv_file in csv_files:
df = self.download_and_read_csv(csv_file)
logs_filtered = self.analyze_logs(df)
all_logs.extend(logs_filtered)
if len(all_logs) >= self.max_logs:
break
if not all_logs:
break
with open('train.jsonl', 'w') as f:
for log in all_logs:
log_entry = {
"instruction": "Analyze the log entry to identify any malicious IP addresses or suspicious activities.",
"context": json.dumps(log),
"response": f"This log entry is identified as '{log.get('ThreatLevel', 'Unknown')}' based on its score and other parameters."
}
f.write(json.dumps(log_entry) + '\n')
print("Training data saved to train.jsonl")
log_analyzer = LogAnalyzer(bucket_name='XPTO', prefix='xyz-com')
log_analyzer.run()
Fine-Tuning the Model
We fine-tuned the Meta LLaMA 2 model using SageMaker with the following configuration:
Training Time: 2 hours
Epochs: 5
Instance Type: ml.g5.24xlarge
Template for Fine-Tuning:
template = { "prompt": "Below is a set of logs. Analyze the logs to identify any malicious IP addresses or suspicious activities.\n\n" "### Logs:\n{context}\n\n",
"completion": " {response}", }
with open("template.json", "w") as f:
json.dump(template, f)
Fine-Tuning Code:
from sagemaker.jumpstart.estimator import JumpStartEstimator
model_id, model_version = "meta-textgeneration-llama-2-7b", "4.*"
estimator = JumpStartEstimator(
model_id=model_id,
model_version=model_version,
environment={"accept_eula": "true"},
disable_output_compression=True,
instance_type="ml.g5.48xlarge"
)
estimator.set_hyperparameters(
instruction_tuned="True",
epoch="2",
max_input_length="1024",
per_device_train_batch_size="2"
)
print(f"Training with data located at: {train_data_location}")
estimator.fit({"training": train_data_location})
Initial Results
The initial responses from the model were not very accurate:
[
{
"generated_text": "d25lthpqnccel7.cloudfront.net\n\n\n\n### Logs:\nd25lthpqnccel7.cloudfront.net\n\n\n\n### Response:\nd25lthpqnccel7.cloudfront.net\n\n\n\n### Logs:\nd25lthpqnccel7.cloudfront.net\n\n\n\n### Response"
}
]
After some small adjustments to the prompt structure, the responses improved slightly, but they weren't still what needed from Llama:
[
{
"generated_text": "Below is a set of logs. Analyze the logs to identify any malicious IP addresses or suspicious activities, tell me which one looks suspicious and why. If not, just write YES on the response\n\n### Logs:\n{\"date\": \"2024-04-30\", \"time\": \"23:56:10\", \"x-edge-location\": \"HEL51-P1\", \"sc-bytes\": 16766, \"c-ip\": \"196.244.191.18\", \"cs-method\": \"GET\", \"cs(Host)\": \"d28muy3dhw3j8o.cloudfront.net\", \"cs-uri-stem\": \"/\", \"sc-status\": 200, \"cs(Referer)\": \"-\", \"cs(User-Agent)\": \"Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)\", \"cs-uri-query\": \"-\", \"cs(Cookie)\": \"-\", \"x-edge-result-type\": \"Miss\", \"x-edge-request-id\": \"g0v9W73cZfrsBYru0hG80StZZxNAok5WOx42CbG_cTLDAvHdV2SdvA==\", \"x-host-header\": \"www.metaltoad.com\", \"cs-protocol\": \"https\", \"cs-bytes\": 150, \"time-taken\": 0.101, \"x-forwarded-for\": \"-\", \"ssl-protocol\": \"TLSv1.3\", \"ssl-cipher\": \"TLS_AES_128_GCM_SHA256\", \"x-edge-response-result-type\": \"Miss\", \"cs-protocol-version\": \"HTTP/1.1\", \"fle-status\": \"-\", \"fle-encrypted-fields\": \"-\", \"c-port\": 39792, \"time-to-first-byte\": 0.099, \"x-edge-detailed-result-type\": \"Miss\", \"sc-content-type\": \"text/html; charset=UTF-8\", \"sc-content-len\": \"-\", \"sc-range-start\": \"-\", \"sc-range-end\": \"-\", \"customer_id\": \"Bxy4m9XiyC4\", \"Autoblock\": false}\n\n### Response:\n N/A"
}
]
Second Tentative: Larger Dataset and More Powerful Instance
In the second attempt, we expanded our dataset to include two months (April and May 2024) from the bucket (XPTO/xyz-com). Each day's folder contained approximately 660 CSV files, each with around 200 lines. This amounted to about 39,600 CSV files.
Training Configuration:
Training Time: 6 hours
Epochs: 5
Instance Type: ml.g5.48xlarge
Additional information: The use of different instance types (ml.g5.24xlarge
for the first tentative and ml.g5.48xlarge
for the second) was crucial in balancing cost and computational efficiency. The larger instance type allowed us to process a much larger dataset in a reasonable time frame, highlighting the importance of selecting the right resources for fine-tuning tasks.
Enhanced Template:
{
"instruction": "Analyze the log entry to identify any malicious IP addresses or suspicious activities.",
"context": json.dumps(log),
"response": f"This log entry is identified as '{log.get('ThreatLevel', 'Unknown')}' based on its score and other parameters."
}
Although we increased not only the number of logs but the instance type on the second attempt, we decided to halt the process because it significantly raised the operational costs, making our proof of concept unviable.
Conclusion
This POC demonstrated the potential of fine-tuning Meta LLaMA 2 for malicious traffic detection. The initial results showed some promise, but there is still room for improvement. Future work will focus on optimizing the training process, refining the model's accuracy, and exploring more advanced configurations.
For further reference and detailed examples, you can check out the AWS SageMaker Example Notebook.