Harmful Prompt Classification for Large Language Models

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Over the last few years, using LLM chatbots like ClaudeAI, Co-pilot and ChatGPT for text generation has become a regular habit for many, with over 100 million weekly users flocking to ChatGPT alone. One side effect of such vast usage of these models is undesirable prompts being submitted to the model. This not only risks data and model poisoning if done repeatedly but also causes harm to society through the responses provided by the models, which are then in turn trained on such models. It is the onus of the model creators and company to address these undesirable prompts based on the type of harm present. To this end, model developers have started conducting Red-teaming of LLMs, a form of evaluation that elicits model vulnerabilities potentially leading to such undesirable behaviours. There are existing classifiers that detect whether a prompt is harmful and censor the models' responses for safe user access. However, for developers, more metadata on the category of danger that a prompt presents equips them with a capability to prepare for such attacks on the model based on the frequency and type of harm. Additionally, in cases where further investigation is required, companies can report problematic user behaviour to the authorities. Hence the importance of categorising harmful prompts. We propose a sub-category based prompt classifier that identifies the specific type of harm that a damaging prompt is addressing to help content moderators and governance functions take further actions. Using explainable AI methods, we focused on both black-box and white-box testing of models and explored Linear SVM, KNN and BERT for further sub-classifying harmful prompts, and obtained accuracies of 92%, 85% and 87% respectively.

Original languageEnglish
Title of host publicationHCAI-ep 2024 - Proceedings of the 2024 Conference on Human Centered Artificial Intelligence - Education and Practice
PublisherAssociation for Computing Machinery (ACM)
Pages8-14
Number of pages7
ISBN (Electronic)9798400711596
DOIs
Publication statusPublished - 2 Dec 2024
Event2nd Conference on Human Centered Artificial Intelligence - Education and Practice, HCAI-ep 2024 - Naples, Italy
Duration: 1 Dec 20242 Dec 2024

Publication series

NameACM International Conference Proceeding Series

Conference

Conference2nd Conference on Human Centered Artificial Intelligence - Education and Practice, HCAI-ep 2024
Country/TerritoryItaly
CityNaples
Period1/12/242/12/24

Fingerprint

Dive into the research topics of 'Harmful Prompt Classification for Large Language Models'. Together they form a unique fingerprint.

Cite this