jake@portfolio:~$
Loading...
$ cd ..
$ cat projects/nlp-transformer/README.md

Transformer-Based NLP on Energy & Gas News

Applied NLP project using Energy & Gas news articles to explore transformer-based techniques, including sentiment classification, summarization, parameter-efficient fine-tuning, and RAG.

Machine LearningNLPPyTorchLLM Fine-TuningRAG
Transformer-Based NLP on Energy & Gas News

Overview

An applied NLP project using Energy & Gas news articles to explore modern transformer-based techniques, including sentiment classification, summarization, parameter-efficient fine-tuning, and Retrieval-Augmented Generation (RAG). The focus was on building and evaluating NLP systems end-to-end while understanding their practical limitations on real, domain-specific text data.

Key Contributions

  • Built transformer-based models in PyTorch, including encoder-only and encoder–decoder architectures, to analyze energy and gas news articles
  • Applied parameter-efficient fine-tuning (LoRA, adapters, prefix-tuning) to adapt a pretrained language model to domain-specific text
  • Implemented a Retrieval-Augmented Generation (RAG) pipeline to ground model responses using retrieved news content
  • Designed preprocessing pipelines for noisy, real-world news data using pandas and scikit-learn
  • Evaluated model outputs qualitatively and quantitatively, analyzing where fine-tuning and retrieval improved or failed to improve results

Technical Implementation

  • Implemented and trained transformer models using PyTorch, focusing on attention mechanisms, sequence modeling, and training stability
  • Used Skip-gram embeddings and tokenized text representations for downstream NLP tasks
  • Built semantic retrieval using vector similarity to support RAG-based generation
  • Compared extractive and abstractive summarization approaches on energy-related news articles
  • Analyzed trade-offs between fine-tuning vs. retrieval-based approaches under limited data constraints

Results & Observations

  • Transformer-based classifiers captured sequence-level patterns better than embedding-only approaches
  • Extractive summarization was more reliable than abstractive summarization given limited training data
  • Parameter-efficient fine-tuning enabled domain adaptation with minimal computational cost
  • RAG improved transparency and grounding but depended heavily on retrieval quality
$ cat package.json

Tech stack

PythonPyTorchpandasscikit-learnNLPTransformersLLM Fine-TuningRAG