Large Language Model Based Text Augmentation Enhanced Personality Detection Model
Last updated
Copyright Continuum Labs - 2023
Last updated
This March 2024 paper proposes an approach for personality detection using large language models (LLMs) to generate text augmentations and enhance the performance of a smaller model, even when the LLM itself fails at the task.
The paper addresses the challenge of personality detection from social media posts, which aims to infer an individual's personality traits based on their online content.
A major hurdle is the limited availability of ground-truth personality labels, as they are typically collected through time-consuming self-report questionnaires.
Existing methods often directly fine-tune pre-trained language models using these scarce labels, resulting in suboptimal post feature quality and detection performance.
Additionally, treating personality traits as one-hot classification labels fails to capture the rich semantic information they contain.
To tackle these issues, the authors propose an approach that leverages LLMs to enhance a small model for personality detection.
The key idea is to distil the LLM's knowledge to improve the small model's performance, even when the LLM itself may not excel at the task.
The proposed method consists of two main components:
Data augmentation
The LLM generates post analyses (augmentations) from semantic, sentiment, and linguistic aspects, which allow for personality detection.
Contrastive learning is then used to align the original post and its augmentations in the embedding space, enabling the post encoder to better capture psycho-linguistic information.
The LLM is used to generate explanations for the complex personality labels, providing additional information to enhance detection performance.
Experimental results on benchmark datasets demonstrate that the proposed model outperforms state-of-the-art methods for personality detection.
The research tackles a significant challenge in computational psycho-linguistics by addressing the scarcity of labeled data for personality detection.
The novelty of the approach lies in its clever use of LLMs to augment both the input data and the personality labels, even when the LLM itself may not perform well on the task.
From a linguistic perspective, the choice to generate augmentations from semantic, sentiment, and linguistic aspects is well-motivated, as these factors have been shown to be relevant for personality detection.
By incorporating this information through contrastive learning, the model can learn richer post representations that better capture personality-related cues.
Contrastive Learning Approach
Contrastive learning is a machine learning technique that aims to learn effective representations by comparing and contrasting similar and dissimilar samples.
In the context of the TAE model, contrastive learning is employed to learn better post representations by leveraging the post augmentations generated by the LLM.
The main idea behind contrastive learning in TAE is to pull the representations of the original post and its augmentations closer together in the embedding space, while pushing the representations of dissimilar posts further apart.
This is achieved by using a contrastive loss function, such as the InfoNCE loss, which maximises the similarity between the original post and its augmentations while minimising the similarity between the original post and other posts in the batch.
By training the model with this contrastive objective, the post encoder learns to capture more informative and discriminative features that are relevant to personality detection.
The LLM-generated augmentations provide additional personality-related information from semantic, sentiment, and linguistic perspectives, which helps the model to learn richer post representations.
Consequently, the learned representations are more effective for the downstream task of personality detection, leading to improved performance.
The idea of using LLMs to enrich personality labels is also compelling, as it helps to mitigate the issue of treating labels as one-hot vectors and leverages the LLM's ability to generate explanations.
This can provide the model with a more nuanced understanding of the labels and their implications.
From a deep learning standpoint, the use of contrastive learning to align the original post and its augmentations is a sound approach, as it has been successfully applied in various representation learning tasks.
The fact that the augmentations are not needed during inference is also a practical advantage, as it keeps the model efficient.
The model leverages LLMs to generate post analyses (augmentations) from three critical aspects for personality detection: semantic, sentiment, and linguistic.
By using contrastive learning to pull the original posts and their augmentations together in the embedding space, the post encoder can better capture the psycho-linguistic information within the post representations.
Label Enrichment: To address the complexity of personality labels and improve detection performance, the model uses LLMs to generate additional explanations for the labels, enriching the label information.
Contrastive Post Encoder: The model employs a contrastive post encoder to learn better post representations by capturing more psycho-linguistic information from the post augmentations. Notably, this method does not introduce extra costs during inference.
Knowledge Distillation: The proposed approach distils useful knowledge from LLMs to enhance the small model's personality detection capabilities, both on the data side (through text augmentation) and the label side (through label enrichment).
Traditional methods rely on manual feature engineering, such as extracting psycho-linguistic features from LIWC or statistical text features from bag-of-words models.
Deep learning approaches, including CNN, LSTM, and GRU, have been employed for personality detection tasks with significant success.
Recent works leverage pre-trained language models, such as BERT and RoBERTa, by concatenating a user's utterances into a document and encoding it.
Some methods improve upon this by incorporating contextual information and external psycho-linguistic knowledge from LIWC, using techniques like Transformer-XL, GAT, and DGCN.
However, these methods still suffer from limited supervision due to insufficient personality labels, leading to inferior post embeddings and affecting model performance.
Contrastive learning has been widely used for self-supervised representation learning in various domains, including NLP.
In self-supervised scenarios, data augmentation strategies have been proposed for contrastive learning using unlabelled data.
In supervised scenarios, SimCSE demonstrated the effectiveness of using NLI datasets for learning sentence embeddings by treating entailment labels as positive samples.
CLAIF uses LLMs to generate similar text and similarity scores, using them as positive samples and weights in info-NCE loss, achieving improved performance.
The proposed approach differs by generating task-specific data augmentations from LLMs to enhance personality detection performance.
Knowledge distillation is used to transfer knowledge from larger, more powerful teacher models into smaller student models, enhancing their performance in practical applications.
Generating additional training data from LLMs to improve smaller models has become a new trend in knowledge distillation.
Self-instruct distils instructional data to enhance the instruction-following capabilities of pre-trained language models.
The authors also provide an example of personality detection using LLMs, highlighting that LLMs primarily infer personality traits based on post semantics.
However, previous studies demonstrate that sentiment and linguistic patterns often reveal more about a person's psychological state than the semantics of their communication content. The authors argue that LLMs fail to capture these aspects for personality detection.
To address this issue, the authors propose generating knowledgeable post augmentations from LLMs.
The proposed method, named TAE (Text Augmentation Enhanced Personality Detection), aims to address the challenges in personality detection by leveraging the capabilities of Large Language Models (LLMs) to generate knowledgeable post augmentations and enrich label information.
The method consists of several key components:
The LLM is instructed to generate post analyses (augmentations) from three aspects: semantic, sentiment, and linguistic.
These augmentations serve as additional information to the original post, providing personality-related knowledge.
The model employs a contrastive learning approach to learn efficient post representations by aligning semantically similar entities closer and distancing dissimilar ones.
The post augmentations generated by the LLM are used as positive samples for contrastive learning.
A projection head (MLP) is added to mitigate the distribution difference between the original post and the analysis text.
LLM-Enriched Label Information
The LLM is used to generate explanations of each personality trait from semantic, sentiment, and linguistic perspectives, enriching the label information.
The analysis of LLM performance on personality detection reveals that LLMs primarily infer personality traits based on post semantics, failing to capture sentiment and linguistic patterns that are crucial for personality detection.
This observation motivates the proposed method to generate post augmentations from these aspects to enhance the model's ability to capture personality-related features.
In summary, the TAE method leverages LLMs to generate knowledgeable post augmentations and enrich label information, enabling the model to learn more effective representations for personality detection.
The contrastive learning approach helps align semantically similar entities and distance dissimilar ones, while the soft labelling technique addresses the over-confidence issue in binary classification.
The researchers used two widely-used datasets for personality detection in their experiments: Kaggle and Pandora. Let's explore these datasets in detail.
Source: The Kaggle dataset is collected from PersonalityCafe, an online platform where users share their personality types and engage in daily communications.
Size: It contains data from 8,675 users, with each user contributing 45-50 posts.
Labels: Personality labels are based on the MBTI (Myers-Briggs Type Indicator) taxonomy, which categorises personality into four dimensions: Introversion vs. Extroversion (I/E), Sensing vs. intuition (S/N), Thinking vs. Feeling (T/F), and Perception vs. Judging (P/J).
Source: The Pandora dataset is collected from Reddit, where personality labels are extracted from users' self-introductions containing MBTI types.
Size: It contains data from 9,067 users, with each user contributing dozens to hundreds of posts.
Labels: Personality labels are also based on the MBTI taxonomy, similar to the Kaggle dataset.
These datasets were chosen because they are widely used in personality detection research, allowing for comparisons with previous studies.
Additionally, the datasets provide a substantial amount of user-generated content along with self-reported MBTI personality labels, making them suitable for training and evaluating personality detection models.
The use of these datasets in the experiment suggests that the researchers aimed to evaluate their proposed method (TAE) on well-established benchmarks in the field.
By using the same data split as previous studies (60% training, 20% validation, 20% testing), they ensure a fair comparison with existing methods.
However, the authors mention that both datasets are severely imbalanced, which could potentially impact the model's performance and generalization ability.
To mitigate this issue, they employ the Macro-F1 metric, which gives equal importance to each personality dimension, regardless of its prevalence in the dataset.
To further improve the datasets, researchers could consider:
Collecting more diverse data from various online platforms to reduce dataset bias and improve generalisation.
Balancing the dataset by gathering more samples for underrepresented personality types or using data augmentation techniques.
Exploring alternative personality taxonomies beyond MBTI to capture a broader range of personality traits.
In the experiment, the datasets were used as follows:
The posts from each user were concatenated or encoded individually, depending on the model architecture.
The proposed method (TAE) utilised the LLM to generate post augmentations based on semantic, sentiment, and linguistic aspects, which were then used to enhance the post representations through contrastive learning.
The models were trained on the training set, with hyperparameters tuned using the validation set.
The final performance was evaluated on the testing set using the Macro-F1 metric, allowing for a comparison with baseline models.
Overall, the choice and use of the Kaggle and Pandora datasets provide a solid foundation for evaluating the proposed TAE method in the context of personality detection, while also highlighting potential areas for improvement in future research.
The overall results presented demonstrate that the proposed TAE (Text Augmentation Enhanced Personality Detection) model consistently outperforms all the baselines on Macro-F1 scores.
Compared to the best baselines D-DGCN and DDCGN+l0 on the Kaggle and Pandora datasets, respectively, TAE achieves improvements of 1.01% and 1.68%.
This superiority is attributed to two main factors:
The post augmentations generated by the LLM enable the contrastive post encoder to extract information more conducive to personality detection.
The generated explanations of personality labels effectively assist in accomplishing the detection task.
The authors also highlight that TAE achieves a marked improvement over BERTmean, with gains of 5.83% on the Kaggle dataset and 6.53% on the Pandora dataset. This demonstrates the advantage of data augmentations from LLMs in data-scarce situations.
Furthermore, the ablation study conducted on the Kaggle dataset reveals the importance of each component in the TAE model.
Among the three aspects of post augmentations (semantic, sentiment, and linguistic), linguistic augmentation proves to be the most influential, while semantic information is the least important.
This finding aligns with observations in previous works, suggesting that semantic information is relatively less crucial for personality detection compared to sentiment and linguistic aspects.
The ablation study also shows that removing the LLM-based label information enrichment slightly decreases the model's performance. This indicates that both the LLM-generated post augmentations and label information enrichment contribute to the overall effectiveness of the TAE model.
Experiments comparing the use of LLM-generated analysis texts as data augmentations for contrastive post representation learning versus directly using them as additional input demonstrate that the contrastive learning paradigm in TAE is more effective than simply incorporating the analysis texts as extra input.
The key lessons learned from this study are:
Leveraging LLMs to generate post augmentations and enrich label information can effectively improve personality detection performance, even when LLMs themselves struggle with the task.
Linguistic aspects play a more crucial role in personality detection compared to semantic information.
Using LLM-generated analysis texts for contrastive learning is more effective than directly incorporating them as additional input.
Distilling knowledge from LLMs to enhance small models is a promising approach for personality detection, rather than directly applying LLMs to the task.
These findings provide valuable insights for future research in personality detection and highlight the potential of leveraging large language models to improve the performance of smaller, task-specific models.
In conclusion, the research paper presents a novel approach called TAE (Text Augmentation Enhanced Personality Detection) that leverages large language models (LLMs) to improve personality detection performance.
By generating knowledgeable post augmentations and enriching label information, TAE effectively addresses the challenges of limited labeled data and the complexity of personality traits.
The contrastive learning approach employed in TAE helps to learn more informative post representations, while the LLM-based label enrichment provides a more nuanced understanding of personality labels.
Experimental results on benchmark datasets demonstrate the superiority of TAE over state-of-the-art methods, highlighting the potential of distilling knowledge from LLMs to enhance smaller, task-specific models.
This research opens up new avenues for leveraging the power of LLMs in personality detection and other related fields, paving the way for more accurate and efficient models that can better understand and predict human personality traits from online data.
The augmented data is represented as , where is the original post set, and correspond to the analysis texts for semantic, sentiment, and linguistic aspects, respectively.
The BERT model is used to obtain sentence representations for both the original post and its augmentations .
The contrastive loss is calculated using the post-wise info-NCE loss with in-batch negatives, encouraging the model to learn post embeddings that capture more personality-related features.
The generated label descriptions are represented as , where each contains semantic, sentiment, and linguistic descriptions.
Label representations are obtained by averaging the BERT embeddings of the label descriptions.
Soft labels are generated based on the similarity between the user embedding and the label embeddings, addressing the over-confidence issue in binary classification.
The original one-hot labels are combined with the soft labels using a controlling parameter α and an additional softmax function to obtain the final labels .
The model employs T softmax-normalized linear transformations to predict personality traits
The detection loss is calculated using KL-divergence between the predicted labels and the combined labels .
The overall training objective is a weighted sum of the detection loss and the contrastive loss , balanced by a trade-off parameter .