Crafting Avatars with AI: The Future of Personalized Digital Identities
Revolutionize your digital identity with AI-driven avatars that capture your unique personality and style, offering unprecedented customization and...
Fine-tuning Meta LLaMA 2 for malicious traffic detection showcased promising results, highlighting the importance of resource selection in model training. Explore more in this detailed proof of concept.
Fine-tuning a model involves taking a pre-trained model and further training it on a specific dataset to adapt it to a particular task. This is particularly useful when you have a large, pre-trained model that has learned general language patterns but needs to be tailored to a specific domain or task.
Why Use Fine-Tuning?
Key Components in Fine-Tuning:
In our context, we used SageMaker for fine-tuning the Meta LLaMA 2 model. We had to create a training dataset in JSONL format and a template file to define the structure of the input and output for the model.
How It Works:
Date | Time | C-IP | cs-method | sc-status | ssl-protocol | score | ThreatLevel |
2024-05-01 |
12:00 |
000.010.00.1 | GET | 200 | TLSv1.3 | -2.4700 | Good Traffic |
For this project, we utilized various tools and libraries including:
Amazon S3: For storing and retrieving network logs.
SageMaker: For model training and deployment.
Pandas: For data manipulation.
Boto3: AWS SDK for Python to interact with S3 and SageMaker.
JSON: For data serialization.
First Tentative: For the initial quick test with a smaller dataset, we used ml.g5.24xlarge. This instance type consumes fewer resources and is suitable for smaller datasets. This test covered 29 days of logs, with approximately 330 CSV files per day, each containing about 200 lines. This resulted in roughly 9,570 CSV files in total.
Second Tentative: For the more extensive test with a larger dataset, we used ml.g5.48xlarge. This instance type has more computational power and is better suited for larger datasets. This test covered 60 days of logs, with approximately 660 CSV files per day, each containing about 200 lines. This resulted in roughly 39,600 CSV files in total.
The rationale behind choosing these instance types was to balance the cost and efficiency of training. The larger instance type was necessary to handle the increased data volume in the second tentative.
We started by setting up the environment and preparing the dataset. The initial attempt involved using a subset of logs from our S3 bucket (XPTO/xyz-com). Here is a simplified version of the code used for data processing:
import boto3
import json
import pandas as pd
from io import BytesIO
class LogAnalyzer:
def __init__(self, bucket_name, prefix, max_logs=5000):
self.bucket_name = bucket_name
self.prefix = prefix
self.max_logs = max_logs
self.s3 = boto3.client('s3')
self.start_time = time.time()
self.max_execution_time = 350
def list_csv_files(self, year, month, day, max_items):
prefix = f"{self.prefix}/{year}/{month:02d}/{day:02d}/"
paginator = self.s3.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket=self.bucket_name, Prefix=prefix, PaginationConfig={'MaxItems': max_items})
csv_files = []
for page in page_iterator:
csv_files.extend([content['Key'] for content in page.get('Contents', []) if content['Key'].endswith('.csv')])
if len(csv_files) >= max_items:
break
return csv_files
def download_and_read_csv(self, csv_file):
response = self.s3.get_object(Bucket=self.bucket_name, Key=csv_file)
csv_data = response['Body'].read()
df = pd.read_csv(BytesIO(csv_data))
return df
def analyze_logs(self, df):
logs = df.to_dict('records')
return logs
def run(self):
all_logs = []
year = 2024
months = [4, 5]
days = list(range(1, 32))
for month in months:
for day in days:
if time.time() - self.start_time > self.max_execution_time:
print("Execution time exceeded 350 seconds, restarting notebook.")
return
csv_files = self.list_csv_files(year, month, day, self.max_logs)
for csv_file in csv_files:
df = self.download_and_read_csv(csv_file)
logs_filtered = self.analyze_logs(df)
all_logs.extend(logs_filtered)
if len(all_logs) >= self.max_logs:
break
if not all_logs:
break
with open('train.jsonl', 'w') as f:
for log in all_logs:
log_entry = {
"instruction": "Analyze the log entry to identify any malicious IP addresses or suspicious activities.",
"context": json.dumps(log),
"response": f"This log entry is identified as '{log.get('ThreatLevel', 'Unknown')}' based on its score and other parameters."
}
f.write(json.dumps(log_entry) + '\n')
print("Training data saved to train.jsonl")
log_analyzer = LogAnalyzer(bucket_name='XPTO', prefix='xyz-com')
log_analyzer.run()
Fine-Tuning the Model
We fine-tuned the Meta LLaMA 2 model using SageMaker with the following configuration:
Training Time: 2 hours
Epochs: 5
Instance Type: ml.g5.24xlarge
Template for Fine-Tuning:
template = { "prompt": "Below is a set of logs. Analyze the logs to identify any malicious IP addresses or suspicious activities.\n\n" "### Logs:\n{context}\n\n",
"completion": " {response}", }
with open("template.json", "w") as f:
json.dump(template, f)
Fine-Tuning Code:
from sagemaker.jumpstart.estimator import JumpStartEstimator
model_id, model_version = "meta-textgeneration-llama-2-7b", "4.*"
estimator = JumpStartEstimator(
model_id=model_id,
model_version=model_version,
environment={"accept_eula": "true"},
disable_output_compression=True,
instance_type="ml.g5.48xlarge"
)
estimator.set_hyperparameters(
instruction_tuned="True",
epoch="2",
max_input_length="1024",
per_device_train_batch_size="2"
)
print(f"Training with data located at: {train_data_location}")
estimator.fit({"training": train_data_location})
Initial Results
The initial responses from the model were not very accurate:
[
{
"generated_text": "d25lthpqnccel7.cloudfront.net\n\n\n\n### Logs:\nd25lthpqnccel7.cloudfront.net\n\n\n\n### Response:\nd25lthpqnccel7.cloudfront.net\n\n\n\n### Logs:\nd25lthpqnccel7.cloudfront.net\n\n\n\n### Response"
}
]
After some small adjustments to the prompt structure, the responses improved slightly, but they weren't still what needed from Llama:
[
{
"generated_text": "Below is a set of logs. Analyze the logs to identify any malicious IP addresses or suspicious activities, tell me which one looks suspicious and why. If not, just write YES on the response\n\n### Logs:\n{\"date\": \"2024-04-30\", \"time\": \"23:56:10\", \"x-edge-location\": \"HEL51-P1\", \"sc-bytes\": 16766, \"c-ip\": \"196.244.191.18\", \"cs-method\": \"GET\", \"cs(Host)\": \"d28muy3dhw3j8o.cloudfront.net\", \"cs-uri-stem\": \"/\", \"sc-status\": 200, \"cs(Referer)\": \"-\", \"cs(User-Agent)\": \"Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)\", \"cs-uri-query\": \"-\", \"cs(Cookie)\": \"-\", \"x-edge-result-type\": \"Miss\", \"x-edge-request-id\": \"g0v9W73cZfrsBYru0hG80StZZxNAok5WOx42CbG_cTLDAvHdV2SdvA==\", \"x-host-header\": \"www.metaltoad.com\", \"cs-protocol\": \"https\", \"cs-bytes\": 150, \"time-taken\": 0.101, \"x-forwarded-for\": \"-\", \"ssl-protocol\": \"TLSv1.3\", \"ssl-cipher\": \"TLS_AES_128_GCM_SHA256\", \"x-edge-response-result-type\": \"Miss\", \"cs-protocol-version\": \"HTTP/1.1\", \"fle-status\": \"-\", \"fle-encrypted-fields\": \"-\", \"c-port\": 39792, \"time-to-first-byte\": 0.099, \"x-edge-detailed-result-type\": \"Miss\", \"sc-content-type\": \"text/html; charset=UTF-8\", \"sc-content-len\": \"-\", \"sc-range-start\": \"-\", \"sc-range-end\": \"-\", \"customer_id\": \"Bxy4m9XiyC4\", \"Autoblock\": false}\n\n### Response:\n N/A"
}
]
In the second attempt, we expanded our dataset to include two months (April and May 2024) from the bucket (XPTO/xyz-com). Each day's folder contained approximately 660 CSV files, each with around 200 lines. This amounted to about 39,600 CSV files.
Training Configuration:
Training Time: 6 hours
Epochs: 5
Instance Type: ml.g5.48xlarge
Additional information: The use of different instance types (ml.g5.24xlarge
for the first tentative and ml.g5.48xlarge
for the second) was crucial in balancing cost and computational efficiency. The larger instance type allowed us to process a much larger dataset in a reasonable time frame, highlighting the importance of selecting the right resources for fine-tuning tasks.
Enhanced Template:
{
"instruction": "Analyze the log entry to identify any malicious IP addresses or suspicious activities.",
"context": json.dumps(log),
"response": f"This log entry is identified as '{log.get('ThreatLevel', 'Unknown')}' based on its score and other parameters."
}
Although we increased not only the number of logs but the instance type on the second attempt, we decided to halt the process because it significantly raised the operational costs, making our proof of concept unviable.
This POC demonstrated the potential of fine-tuning Meta LLaMA 2 for malicious traffic detection. The initial results showed some promise, but there is still room for improvement. Future work will focus on optimizing the training process, refining the model's accuracy, and exploring more advanced configurations.
For further reference and detailed examples, you can check out the AWS SageMaker Example Notebook.
Revolutionize your digital identity with AI-driven avatars that capture your unique personality and style, offering unprecedented customization and...
As Large Language Models have risen to the everyday usage, Metal Toad took a few days to have a hackathon and see how much they could accomplish in 2...
Dive deep into the top reasons organizations choose AWS
Be the first to know about new B2B SaaS Marketing insights to build or refine your marketing function with the tools and knowledge of today’s industry.