LLM-Driven Data Labeling for Training Machine Learning Models

Keywords: large language models (LLMs), networking, datasets

Personnel: Bigyan Karki, Kemal Akkaya, Hadi Sahin

Grant: US National Science Foundation Cybersecurity Innovation for Cyberinfrastructure (CICI)

Project Summary

The LLMDaL project leverages generative Artificial Intelligence (AI) along with data from the AmLight international research and education (R&E) network to provide an essential and previously unavailable building block necessary for automated network defense. The growing complexity and sophistication of modern networks are driving the need for automated cybersecurity and management. Operators of critical infrastructure must increasingly rely on AI to cope with the sheer scale of information and the growing use of AI by adversaries. However, effective AI defenses depend on both the quantity and quality of data for training. The lack of high-quality, labeled datasets from production environments presents a significant barrier. Without access to such datasets, advanced models often remain untested in real-world scenarios, limiting their effectiveness, as they fail to learn the complexity and uncertainty of production environments. Consequently, AI models essential for critical infrastructure defense will fail.

LLMDaL utilizes Large Language Models (LLMs) to automatically label packet-level data collected from AmLight maintained at Florida International University. Technical, financial and privacy challenges of providing such data remain substantial. To accurately and quickly label this real-world data, open-source LLMs are fine-tuned using data gathered from AmLight, along with known threat signatures, and expert-annotated cybersecurity events. Validation is performed through a Retrieval-Augmented Self-Refinement process, cross-checking with an ensemble of LLMs, and verification through a human-in-the-loop approach. LLMDaL fills a critical gap in automating dataset labeling, enabling effective testing of AI models for real-world environments. LLMDaL will release datasets from AmLight in batches to reflect the evolving threat landscape.