Selvahini Kamalarajan

Introduction

This project implements a movie review sentiment analysis system using a fine-tuned DistilBERT transformer model. The model is trained and evaluated on the IMDb dataset using Hugging Face’s Transformers library, with GPU acceleration via CUDA on an NVIDIA GeForce RTX 4060 Laptop GPU.

How Does Fine-Tuning Work?

Large Language Models (LLMs) are pre-trained on massive, generalized datasets to learn broad linguistic patterns. This generalization often lacks the specialization required for domain-specific applications such as sentiment analysis, customer support automation, or legal document review.

Traditional full fine-tuning updates all model parameters, which can be computationally expensive and time-consuming, often demanding hours or days of training and large amounts of GPU memory (VRAM).

Parameter-Efficient Fine-Tuning with LoRA

To address these limitations, this project uses LoRA (Low-Rank Adaptation), a parameter-efficient fine-tuning (PEFT) technique. Instead of updating all model weights, LoRA inserts small, trainable low-rank matrices into selected attention layers of the transformer.

What was trained: The base DistilBERT model weights were not fine-tuned directly and remained frozen. Only the LoRA adapter parameters were trained, which were stored separately from the base model. During inference, the adapters are dynamically applied on top of the frozen base model.

QLoRA (Conceptual Extension)

Quantized Low-Rank Adaptation (QLoRA) combines PEFT with model quantization, allowing large models to be fine-tuned on consumer GPUs by reducing memory footprint.

Why Hugging Face Trainer and Pipeline Were Used

The Hugging Face Trainer API was used to standardize and simplify the training of transformer models. It abstracts away much of the boilerplate required for training, including the forward and backward passes, gradient updates, batching, GPU/CPU device placement, logging, checkpointing, and evaluation scheduling. Without the Trainer, a custom PyTorch training loop would be required, which significantly increases complexity.

Model Performance Evaluation Workflow

After fine-tuning the DistilBERT sentiment classification model, an evaluation pipeline objectively measures its performance on unseen data. Evaluation was conducted using the IMDb test split, ensuring that the model was assessed on reviews it had not encountered during training.

Results indicate strong generalization to unseen IMDb reviews.

Performance Metrics

Accuracy: 0.8919
Precision: 0.882
Recall: 0.9044
F1 Score: 0.8933

Fine-Tuning vs. Retrieval-Augmented Generation (RAG)

RAG is best used for question-answering systems that require access to external or frequently changing knowledge. In a RAG setup, an LLM retrieves relevant documents from an external knowledge base and uses them as additional context when generating responses, reducing hallucination risks.

In contrast, fine-tuning modifies the model’s parameters to improve performance on a specialized task or domain. This approach has several drawbacks: model performance depends heavily on the quality and size of the training data, the model may forget knowledge acquired during pretraining and it may lack real-world knowledge beyond what is present in the fine-tuning dataset.

For very large models (e.g., GPT-4-scale systems with hundreds of billions or trillions of parameters), RAG is generally preferred over fine-tuning due to the computational cost and risks associated with modifying such large models. In this project, fine-tuning was the appropriate choice because distilbert-base-uncased is a relatively small model with approximately 66 million parameters.

Fine-Tuned Sentiment Analysis with LoRA and DistilBERT