TY - GEN
T1 - Harmful Prompt Classification for Large Language Models
AU - Gupta, Ojasvi
AU - De La Cuadra Lozano, Marta
AU - Busalim, Abdelsalam
AU - R Jaiswal, Rajesh
AU - Quille, Keith
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/12/2
Y1 - 2024/12/2
N2 - Over the last few years, using LLM chatbots like ClaudeAI, Co-pilot and ChatGPT for text generation has become a regular habit for many, with over 100 million weekly users flocking to ChatGPT alone. One side effect of such vast usage of these models is undesirable prompts being submitted to the model. This not only risks data and model poisoning if done repeatedly but also causes harm to society through the responses provided by the models, which are then in turn trained on such models. It is the onus of the model creators and company to address these undesirable prompts based on the type of harm present. To this end, model developers have started conducting Red-teaming of LLMs, a form of evaluation that elicits model vulnerabilities potentially leading to such undesirable behaviours. There are existing classifiers that detect whether a prompt is harmful and censor the models' responses for safe user access. However, for developers, more metadata on the category of danger that a prompt presents equips them with a capability to prepare for such attacks on the model based on the frequency and type of harm. Additionally, in cases where further investigation is required, companies can report problematic user behaviour to the authorities. Hence the importance of categorising harmful prompts. We propose a sub-category based prompt classifier that identifies the specific type of harm that a damaging prompt is addressing to help content moderators and governance functions take further actions. Using explainable AI methods, we focused on both black-box and white-box testing of models and explored Linear SVM, KNN and BERT for further sub-classifying harmful prompts, and obtained accuracies of 92%, 85% and 87% respectively.
AB - Over the last few years, using LLM chatbots like ClaudeAI, Co-pilot and ChatGPT for text generation has become a regular habit for many, with over 100 million weekly users flocking to ChatGPT alone. One side effect of such vast usage of these models is undesirable prompts being submitted to the model. This not only risks data and model poisoning if done repeatedly but also causes harm to society through the responses provided by the models, which are then in turn trained on such models. It is the onus of the model creators and company to address these undesirable prompts based on the type of harm present. To this end, model developers have started conducting Red-teaming of LLMs, a form of evaluation that elicits model vulnerabilities potentially leading to such undesirable behaviours. There are existing classifiers that detect whether a prompt is harmful and censor the models' responses for safe user access. However, for developers, more metadata on the category of danger that a prompt presents equips them with a capability to prepare for such attacks on the model based on the frequency and type of harm. Additionally, in cases where further investigation is required, companies can report problematic user behaviour to the authorities. Hence the importance of categorising harmful prompts. We propose a sub-category based prompt classifier that identifies the specific type of harm that a damaging prompt is addressing to help content moderators and governance functions take further actions. Using explainable AI methods, we focused on both black-box and white-box testing of models and explored Linear SVM, KNN and BERT for further sub-classifying harmful prompts, and obtained accuracies of 92%, 85% and 87% respectively.
UR - https://www.scopus.com/pages/publications/85216580620
U2 - 10.1145/3701268.3701271
DO - 10.1145/3701268.3701271
M3 - Conference contribution
AN - SCOPUS:85216580620
T3 - ACM International Conference Proceeding Series
SP - 8
EP - 14
BT - HCAI-ep 2024 - Proceedings of the 2024 Conference on Human Centered Artificial Intelligence - Education and Practice
PB - Association for Computing Machinery (ACM)
T2 - 2nd Conference on Human Centered Artificial Intelligence - Education and Practice, HCAI-ep 2024
Y2 - 1 December 2024 through 2 December 2024
ER -