Evaluating Large Language Model Versus Human Performance in Islamophobia Dataset Annotation

Rafizah Daud; Nurlida Basir; Nur Fatin Nabila Mohd Rafei Heng; Meor Mohd Shahrulnizam Meor Sepli; Melinda Melinda

Evaluating Large Language Model Versus Human Performance in Islamophobia Dataset Annotation

Date Issued

2025

Author(s)

Rafizah Daud

Universiti Sains Islam Malaysia

Nurlida Basir

Universiti Sains Islam Malaysia

Nur Fatin Nabila Mohd Rafei Heng

Universiti Sains Islam Malaysia

Meor Mohd Shahrulnizam Meor Sepli

Melinda Melinda

Abstract

Manual annotation of large datasets is a time consuming and resource-intensive process. Hiring annotators or outsourcing to specialized platforms can be costly, particularly for datasets requiring domain-specific expertise. Additionally, human annotation may introduce inconsistencies, especially when dealing with complex or ambiguous data, as interpretations can vary among annotators. Large Language Models (LLMs) offer a promising alternative by automating data annotation, potentially improving scalability and consistency. This study evaluates the performance of Chat GPT compared to human annotators in annotating an Islamophobia dataset. The dataset consists of fifty tweets from the X platform using the keywords Islam, Muslim, hijab, stop islam, jihadist, extremist, and terrorism. Human annotators, including experts in Islamic studies, linguistics, and clinical psychology, serve as a benchmark for accuracy. Cohen’s Kappa was used to measure agreement between LLM and human annotators. The results show substantial agreement between LLM and language experts (0.653) and clinical psychologists (0.638), while agreement with Islamic studies experts was fair (0.353). Overall, LLM demonstrated a substantial agreement (0.632) with all human annotators. Chat GPT achieved an overall accuracy of
82%, a recall of 69.5%, an F1-score of 77.2%, and a precision of 88%, indicating strong effectiveness in identifying Islamophobia related content. The findings suggest that LLMs can effectively
detect Islamophobic content and serve as valuable tools for preliminary screenings or as complementary aids to human annotation. Through this analysis, the study seeks to understand
the strengths and limitations of LLMs in handling nuanced and culturally sensitive data, contributing to broader discussion on the integration of generative AI in annotation tasks. While LLMs
show great potential in sentiment analysis, challenges remain in interpreting context-specific nuances. This study underscores the role of generative AI in enhancing human annotation efforts while highlighting the need for continuous improvements to optimize performance.

Subjects

Large Language Model

generative AI

human intelligence

automatic data annota...

sentiment analysis

islamophobia

ChatGPT

File(s)

Name

Evaluating Large Language Model Versus Human Performance in Islamophobia Dataset Annotation.pdf

Size

949.13 KB

Format

Adobe PDF

Checksum

(MD5):0608830c9007f5eefc3e7e2b33df02ed

Options

Evaluating Large Language Model Versus Human Performance in Islamophobia Dataset Annotation