Fine-tuning

Exploring Fine-Tuning Meta LLaMA 2 for Malicious Traffic Detection: A Proof of Concept

Fine-tuning Meta LLaMA 2 for malicious traffic detection showcased promising results, highlighting the importance of resource selection in model training. Explore more in this detailed proof of concept.


Project Overview

The goal of this POC was to fine-tune Meta LLaMA 2, a pre-trained language model, to analyze a chain of network logs and detect malicious traffic.

Quick explanation on fine-tuning

Fine-tuning a model involves taking a pre-trained model and further training it on a specific dataset to adapt it to a particular task. This is particularly useful when you have a large, pre-trained model that has learned general language patterns but needs to be tailored to a specific domain or task.

Why Use Fine-Tuning?

  • Customization: It allows you to adapt the model to specific business needs, making it more effective for your particular use case.
  • Domain-Specific Language: Fine-tuning helps the model understand and correctly interpret domain-specific jargon and vocabulary.
  • Enhanced Performance: It improves the model’s performance on specific tasks by focusing on relevant data.

Key Components in Fine-Tuning:

  • Epochs: The number of times the model goes through the entire training dataset. More epochs can improve the model’s performance but also increase the training time.
  • Instance Type: The type of computational resource used for training. More powerful instances can handle larger datasets and more complex models but are also more expensive.
  • Training Jobs: The actual process of training the model on the dataset. This involves configuring the training parameters and running the training process.

In our context, we used SageMaker for fine-tuning the Meta LLaMA 2 model. We had to create a training dataset in JSONL format and a template file to define the structure of the input and output for the model.

How It Works:

  1. Dataset Preparation: We collected logs and structured them in a JSONL format, which the model uses for training. Each log entry was converted into a JSON object with fields like 'instruction', 'context', and 'response'.
  2. Template Creation: The template file defines how the logs are presented to the model. It includes placeholders for the 'instruction' and 'context' and specifies the format of the expected 'response'.
  3. Training: Using SageMaker, we configured the training job with the dataset and template. SageMaker handles the process of feeding the data to the model, training it, and storing the fine-tuned model.


Data Collection, Structure, and Preparation

We collected network logs stored in S3 buckets. These logs included various fields such as date, time, c-ip, cs-method, sc-status, ssl-protocol, ThreatLevel, and score. The aim was to use these logs as training data to teach the model to recognize patterns indicative of malicious activities.

The dataset we used for training was structured as the  follows:

Date Time C-IP cs-method sc-status ssl-protocol score ThreatLevel
2024-05-01

12:00

000.010.00.1 GET 200 TLSv1.3 -2.4700 Good Traffic

 

For this project, we utilized various tools and libraries including:

Amazon S3: For storing and retrieving network logs.
SageMaker: For model training and deployment.
Pandas: For data manipulation.
Boto3: AWS SDK for Python to interact with S3 and SageMaker.
JSON: For data serialization.

First Tentative: For the initial quick test with a smaller dataset, we used ml.g5.24xlarge. This instance type consumes fewer resources and is suitable for smaller datasets. This test covered 29 days of logs, with approximately 330 CSV files per day, each containing about 200 lines. This resulted in roughly 9,570 CSV files in total.

Second Tentative: For the more extensive test with a larger dataset, we used ml.g5.48xlarge. This instance type has more computational power and is better suited for larger datasets. This test covered 60 days of logs, with approximately 660 CSV files per day, each containing about 200 lines. This resulted in roughly 39,600 CSV files in total.
The rationale behind choosing these instance types was to balance the cost and efficiency of training. The larger instance type was necessary to handle the increased data volume in the second tentative.


Initial Setup and First Tentative


We started by setting up the environment and preparing the dataset. The initial attempt involved using a subset of logs from our S3 bucket (XPTO/xyz-com). Here is a simplified version of the code used for data processing:


import boto3
import json
import pandas as pd
from io import BytesIO

class LogAnalyzer:
    def __init__(self, bucket_name, prefix, max_logs=5000):
        self.bucket_name = bucket_name
        self.prefix = prefix
        self.max_logs = max_logs
        self.s3 = boto3.client('s3')
        self.start_time = time.time()
        self.max_execution_time = 350

    def list_csv_files(self, year, month, day, max_items):
        prefix = f"{self.prefix}/{year}/{month:02d}/{day:02d}/"
        paginator = self.s3.get_paginator('list_objects_v2')
        page_iterator = paginator.paginate(Bucket=self.bucket_name, Prefix=prefix, PaginationConfig={'MaxItems': max_items})
        csv_files = []
        for page in page_iterator:
            csv_files.extend([content['Key'] for content in page.get('Contents', []) if content['Key'].endswith('.csv')])
            if len(csv_files) >= max_items:
                break
        return csv_files

    def download_and_read_csv(self, csv_file):
        response = self.s3.get_object(Bucket=self.bucket_name, Key=csv_file)
        csv_data = response['Body'].read()
        df = pd.read_csv(BytesIO(csv_data))
        return df

    def analyze_logs(self, df):
        logs = df.to_dict('records')
        return logs

    def run(self):
        all_logs = []
        year = 2024
        months = [4, 5]
        days = list(range(1, 32))

        for month in months:
            for day in days:
                if time.time() - self.start_time > self.max_execution_time:
                    print("Execution time exceeded 350 seconds, restarting notebook.")
                    return

                csv_files = self.list_csv_files(year, month, day, self.max_logs)
                for csv_file in csv_files:
                    df = self.download_and_read_csv(csv_file)
                    logs_filtered = self.analyze_logs(df)
                    all_logs.extend(logs_filtered)
                    if len(all_logs) >= self.max_logs:
                        break

                if not all_logs:
                    break

        with open('train.jsonl', 'w') as f:
            for log in all_logs:
                log_entry = {
                    "instruction": "Analyze the log entry to identify any malicious IP addresses or suspicious activities.",
                    "context": json.dumps(log),
                    "response": f"This log entry is identified as '{log.get('ThreatLevel', 'Unknown')}' based on its score and other parameters."
                }
                f.write(json.dumps(log_entry) + '\n')
        print("Training data saved to train.jsonl")

log_analyzer = LogAnalyzer(bucket_name='XPTO', prefix='xyz-com')
log_analyzer.run()

 

Fine-Tuning the Model


We fine-tuned the Meta LLaMA 2 model using SageMaker with the following configuration:

Training Time: 2 hours
Epochs: 5
Instance Type: ml.g5.24xlarge

Template for Fine-Tuning:

template = { "prompt": "Below is a set of logs. Analyze the logs to identify any malicious IP addresses or suspicious activities.\n\n" "### Logs:\n{context}\n\n",

"completion": " {response}", }

with open("template.json", "w") as f:

json.dump(template, f)

Fine-Tuning Code:

from sagemaker.jumpstart.estimator import JumpStartEstimator

model_id, model_version = "meta-textgeneration-llama-2-7b", "4.*"
estimator = JumpStartEstimator(
    model_id=model_id,
    model_version=model_version,
    environment={"accept_eula": "true"},
    disable_output_compression=True,
    instance_type="ml.g5.48xlarge"
)

estimator.set_hyperparameters(
    instruction_tuned="True",
    epoch="2",  
    max_input_length="1024",
    per_device_train_batch_size="2" 
)

print(f"Training with data located at: {train_data_location}")
estimator.fit({"training": train_data_location})

 

Initial Results

The initial responses from the model were not very accurate:

[
  {
    "generated_text": "d25lthpqnccel7.cloudfront.net\n\n\n\n### Logs:\nd25lthpqnccel7.cloudfront.net\n\n\n\n### Response:\nd25lthpqnccel7.cloudfront.net\n\n\n\n### Logs:\nd25lthpqnccel7.cloudfront.net\n\n\n\n### Response"
  }
]

After some small adjustments to the prompt structure, the responses improved slightly, but they weren't still what needed from Llama:

[
  {
    "generated_text": "Below is a set of logs. Analyze the logs to identify any malicious IP addresses or suspicious activities, tell me which one looks suspicious and why. If not, just write YES on the response\n\n### Logs:\n{\"date\": \"2024-04-30\", \"time\": \"23:56:10\", \"x-edge-location\": \"HEL51-P1\", \"sc-bytes\": 16766, \"c-ip\": \"196.244.191.18\", \"cs-method\": \"GET\", \"cs(Host)\": \"d28muy3dhw3j8o.cloudfront.net\", \"cs-uri-stem\": \"/\", \"sc-status\": 200, \"cs(Referer)\": \"-\", \"cs(User-Agent)\": \"Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)\", \"cs-uri-query\": \"-\", \"cs(Cookie)\": \"-\", \"x-edge-result-type\": \"Miss\", \"x-edge-request-id\": \"g0v9W73cZfrsBYru0hG80StZZxNAok5WOx42CbG_cTLDAvHdV2SdvA==\", \"x-host-header\": \"www.metaltoad.com\", \"cs-protocol\": \"https\", \"cs-bytes\": 150, \"time-taken\": 0.101, \"x-forwarded-for\": \"-\", \"ssl-protocol\": \"TLSv1.3\", \"ssl-cipher\": \"TLS_AES_128_GCM_SHA256\", \"x-edge-response-result-type\": \"Miss\", \"cs-protocol-version\": \"HTTP/1.1\", \"fle-status\": \"-\", \"fle-encrypted-fields\": \"-\", \"c-port\": 39792, \"time-to-first-byte\": 0.099, \"x-edge-detailed-result-type\": \"Miss\", \"sc-content-type\": \"text/html; charset=UTF-8\", \"sc-content-len\": \"-\", \"sc-range-start\": \"-\", \"sc-range-end\": \"-\", \"customer_id\": \"Bxy4m9XiyC4\", \"Autoblock\": false}\n\n### Response:\n N/A"
  }
]

 

Second Tentative: Larger Dataset and More Powerful Instance

In the second attempt, we expanded our dataset to include two months (April and May 2024) from the bucket (XPTO/xyz-com). Each day's folder contained approximately 660 CSV files, each with around 200 lines. This amounted to about 39,600 CSV files.

Training Configuration:

Training Time: 6 hours

Epochs: 5

Instance Type: ml.g5.48xlarge

Additional information: The use of different instance types (ml.g5.24xlarge for the first tentative and ml.g5.48xlarge for the second) was crucial in balancing cost and computational efficiency. The larger instance type allowed us to process a much larger dataset in a reasonable time frame, highlighting the importance of selecting the right resources for fine-tuning tasks.

Enhanced Template:

{
    "instruction": "Analyze the log entry to identify any malicious IP addresses or suspicious activities.",
    "context": json.dumps(log),
    "response": f"This log entry is identified as '{log.get('ThreatLevel', 'Unknown')}' based on its score and other parameters."
}

 

Although we increased not only the number of logs but the instance type on the second attempt, we decided to halt the process because it significantly raised the operational costs, making our proof of concept unviable.

 

Conclusion

This POC demonstrated the potential of fine-tuning Meta LLaMA 2 for malicious traffic detection. The initial results showed some promise, but there is still room for improvement. Future work will focus on optimizing the training process, refining the model's accuracy, and exploring more advanced configurations.
For further reference and detailed examples, you can check out the AWS SageMaker Example Notebook.

 

Similar posts

Get notified on new marketing insights

Be the first to know about new B2B SaaS Marketing insights to build or refine your marketing function with the tools and knowledge of today’s industry.