MentaLLaMA: Interpretable Mental Health Analysis on Social Media with Large Language Models
Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, Sophia Ananiadou
Last updated
Copyright Continuum Labs - 2023
Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, Sophia Ananiadou
Last updated
This February 2024 paper addresses the problem of automatically analysing mental health conditions from social media posts in an interpretable manner using large language models (LLMs).
The introduction provides an overview of the current state of mental health analysis on social media and the limitations of existing methods.
Traditional discriminative methods, such as pre-trained language models (PLMs), achieve state-of-the-art performance in mental health-related text classification tasks. However, these methods often struggle with poor generalisation to unseen tasks, lack robustness in multi-task scenarios, and provide predictions with low interpretability.
To overcome these limitations, the authors explore the use of recent LLMs, such as ChatGPT and GPT-4, for interpretable mental health analysis on social media.
These models have demonstrated superior generalisation capabilities and can provide detailed explanations for their decisions.
However, closed-source LLMs like ChatGPT still fail to achieve comparable performance to state-of-the-art supervised methods in zero-shot or few-shot learning settings, and the low precision significantly affects the quality of the generated explanations.
The authors identify two key challenges in improving LLMs for interpretable mental health analysis through fine-tuning:
Lack of high-quality supervised training data that provides detailed and reliable explanations for detection results.
Absence of open-source LLMs specifically designed for interpretable mental health analysis.
To address these challenges, the authors formally model interpretable mental health analysis as a text-generation task, aiming to detect evidence of mental health conditions in social media posts and generate explanations for the predictions.
They build the first multi-task and multi-source Interpretable Mental Health Instruction (IMHI) dataset with 105K data samples to support LLM instruction tuning and evaluation.
The dataset is created by collecting raw data from various sources, using ChatGPT to generate explanations, and transforming the data into instruction-based query-answer pairs.
The authors collect raw data from 10 existing mental health analysis datasets spanning multiple social media sources, including Reddit, Twitter, and SMS texts.
These datasets come with high-quality annotations, which are crucial for explanation generation and AI-generated content evaluation.
The tasks covered in these datasets include:
Binary mental health detection: Identifying symptoms of a single mental health condition, with binary labels. Datasets used: Depression_Reddit (DR), CLPsych15 (CLP), Dreaded (stress detection), and a loneliness symptom detection dataset.
Multi-class mental health detection: Identifying symptoms of one mental health condition from a given list of multiple conditions, modelled as a multi-class single-label classification task. Datasets used: T-SID and SWMH, covering depression, PTSD, anxiety, etc.
Mental health cause/factor detection: Assigning a label to a post showing a mental health condition to identify a possible cause/factor from a given list. Datasets used: SAD (stress cause detection) and CAMS (depression/suicide cause detection).
Mental risk/wellness factors detection: Identifying psychological risk/wellness factors from social media posts, modeled as a classification task. Datasets used: IRF (interpersonal risk factors) and MultiWD (mental wellness dimensions).
Due to the lack of open-source data providing detailed explanations for the annotations, the authors leverage ChatGPT to generate explanations.
They ask domain experts to write 1 task-specific instruction and 35 explanation examples for each task, resulting in a gold explanation set G with 350 samples.
The explanations follow a template: "[label] Reasoning: [explanation]".
For each dataset, they randomly sample 2 explanations per class from G as few-shot examples and include supervised annotations from the raw datasets to construct prompts for ChatGPT to generate explanations.
Explanation Evaluation
The authors perform automatic and human evaluations to ensure the quality of the ChatGPT-generated explanations.
Automatic Evaluation
Three criteria are used:
Correctness: Explanations should make correct label predictions.
Consistency: Explanations should provide consistent analyses with the predicted labels.
Quality: Explanations should provide supportive evidence with high reliability and professionality.
For correctness, they compare dataset annotations with ChatGPT responses.
For consistency, they train classifiers using the explanation-label pairs and evaluate them on test splits and the gold explanation set G.
For quality, they compare the generated explanations with zero-shot, few-shot, and expert-written prompts using BART-score.
200 randomly selected explanations are assessed by domain experts on 4 aspects: consistency, reliability, professionality, and overall effectiveness, using a 0-3 rating scale.
The IMHI dataset is constructed using the posts from raw datasets and the evaluated ChatGPT-generated explanations.
Simplified instructions are used to adapt to less powerful LLMs. The training split consists of 72,095 samples, while the validation split has 14,346 samples.
An IMHI-completion dataset is also created using a different template for baseline models with poor instruction-following ability.
In this section, the authors describe the training process for their MentaLLaMA models using the IMHI dataset and the LLaMA2 models as the base.
They finetune the LLaMA2-7B model on the IMHI training set for 10 epochs.
The best model is selected based on the validation results on the IMHI validation set.
Training hyperparameters:
Batch size: 32
Gradient accumulation steps: 8 (leading to an effective batch size of 256)
Optimizer: AdamW
Max learning rate: 1e-5
Warm-up ratio: 3%
Max model input length: 2048
They use Flash-Attention to speed up the training process.
These models are trained on LLaMA2-chat-7B and LLaMA2-chat-13B, respectively.
LLaMA2-chat models are optimised with instruction tuning and are the first open-source LLMs tuned with reinforcement learning from human feedback (RLHF).
The training process and experimental settings are the same as for MentaLLaMA-7B.
To enable fair comparisons with baseline models that are fine-tuned in a completion-based manner, they train another LLaMA2-7B model on the IMHI-completion dataset.
All models are trained on 4 Nvidia Tesla A100 GPUs, each with 80GB of memory.
In this section, the authors present the experimental results and analysis of their proposed MentaLLaMA models in comparison to various baseline models on the IMHI test set.
The authors select the following baseline models for comparison:
Discriminative methods: Classification models that finetune PLMs like BERT and RoBERTa, including SOTA methods MentalBERT and MentalRoBERTa.
Zero-shot/few-shot methods: Open-source LLMs LLaMA2-7B and LLaMA2-13B for zero-shot prompting, and closed-source LLMs ChatGPT and GPT-4 for zero-shot and few-shot prompting.
Completion-based fine-tuning methods: SOTA generative PLMs BART-large and T5-large finetuned on the IMHI-completion dataset, along with a LLaMA-7B model for fair comparison.
MentalBERT and MentalRoBERTa achieve SOTA performance on 8 out of 10 test sets among discriminative methods.
ChatGPT significantly outperforms LLaMA2 models in zero-shot settings, and few-shot learning further improves its performance.
Fine-tuning methods show significant improvement over LLaMA2 zero-shot results.
MentaLLaMA-7B outperforms completion-based LLaMA2-7B on 8 out of 10 test sets, showing the efficiency of domain-specific instruction tuning.
MentaLLaMA-chat-13B surpasses or closely matches MentalRoBERTa in 7 out of 10 test sets.
In completion-based methods, LLaMA2-7B outperforms its zero-shot counterpart, and BART-large is recommended for building a completion-based interpretable mental health analysis model.
In instruction tuning methods, MentaLLaMA greatly outperforms zero-shot results on LLaMA2-7B, and MentaLLaMA-chat models further improve the quality of explanations.
MentaLLaMA models achieve comparable performance to ChatGPT and GPT-4 on the expert-written gold set with much smaller model sizes.
MentaLLaMA models significantly outperform LLaMA2-13B and ChatGPT in zero-shot settings on unseen tasks.
MentaLLaMA-chat models generate higher quality explanations compared to T5 and BART on unseen tasks, especially in mental health conditions/cause detection and high-level mental health factors.
MentaLLaMA-chat-13B further improves the explanation quality compared to MentaLLaMA-chat-7B, showing the benefit of model size expansion.
MentaLLaMA-chat-13B achieves high scores on consistency and reliability, comparable to ChatGPT.
However, MentalLLaMA underperforms ChatGPT in professionality, indicating a lack of domain-specific knowledge. Continual pre-training on high-quality mental health-related data is suggested as a solution.
Overall, the results demonstrate the effectiveness of the MentaLLaMA models in achieving high correctness, generating quality explanations, and exhibiting strong generalisability to unseen tasks in the domain of interpretable mental health analysis on social media.
In this paper, the authors introduce the novel task of interpretable mental health analysis and present the first multi-task and multi-source dataset, IMHI, which contains 105K data samples for instruction tuning.
They used ChatGPT to generate the training data and perform rigorous automatic and human evaluations to ensure its reliability.
Building upon the IMHI dataset, the authors propose MentaLLaMA, the first open-source large language model series designed for interpretable mental health analysis with instruction-following capabilities.
Evaluations on the IMHI benchmark demonstrate that MentaLLaMA achieves performance comparable to state-of-the-art discriminative methods in terms of correctness and generates explanations on par with human-level quality. Additionally, MentaLLaMA exhibits strong generalisability to unseen tasks.
However, the authors acknowledge that MentaLLaMA still lacks domain-specific knowledge compared to powerful models like ChatGPT.
Future work will explore continual pre-training of MentaLLaMA on large-scale, high-quality mental health-related data to enhance the professionality of its explanations.