LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Related Work
  • Concept of soft attributes
  • Subjectivity and equality
  • Approaches for scoring items according to soft attributes
  • Unsupervised Ranking
  • Weakly Supervised Ranking
  • Fully Supervised Ranking
  • Evaluation
  • Conclusion

Was this helpful?

  1. DISRUPTION
  2. Recommendation Engines

On Interpretation and Measurement of Soft Attributes for Recommendation

PreviousRecommendation EnginesNextA Survey on Large Language Models for Recommendation

Last updated 11 months ago

Was this helpful?

This May 2021 paper, titled focuses on the challenge of interpreting and measuring soft attributes in the context of recommender systems.

Soft attributes are natural language refinements or critiques that people use to express their preferences about items, such as the originality of a movie plot, the noisiness of a venue, or the complexity of a recipe.

The authors argue that while binary tagging is widely studied in recommender systems, soft attributes often involve subjective and contextual aspects that cannot be reliably captured or represented as objective binary truth in a knowledge base.

This adds important considerations when measuring soft attribute ranking.

The paper makes three main contributions:

Development of a reusable test collection

The authors create a set of soft attributes and ground truth item orderings with respect to those attributes (for particular users), along with an evaluation metric.

They use a novel controlled multi-stage crowd labelling mechanism to collect ground truth of personalised partial orderings while keeping workers' cognitive load low.

They also propose a novel weighted extension to established rank correlation based on agreement with respect to the structured ground truth ranking.

Quantification of the subjectivity or "softness" of soft attributes

The authors identify ways to differentiate more from less subjective soft attributes and measure how this affects item scoring. This result also has implications for standard tagging, highlighting the role of subjectivity.

Addressing the problem of critiquing based on soft attributes

The authors present empirical evidence to demonstrate the importance of debiased collection of ground truth.

They introduce three families of methods for the task of ranking items relative to a given anchor item with respect to a given soft attribute: unsupervised, weakly supervised, and fully supervised.

They compare these methods on two test collections, one based on existing social tags and another constructed using their proposed approach. The results show a discrepancy between the two test collections, indicating that the tag-based test collection is blind to item ranking improvements, making progress in this area difficult.

The authors also analyse performance with respect to attribute "softness" and find that methods perform significantly better on attributes with higher agreement as opposed to those with low agreement.

In summary, this work formalises the notion of soft attributes, opening up new possibilities for more natural interactions with conversational recommender systems.

The authors' technical contributions include an efficient method for debiased collection of ground truth for comparing items with respect to a given soft attribute, a measure to quantify soft attribute subjectivity, and the introduction and formalisation of the task of critiquing based on soft attributes.

Related Work

Conversational Recommender Systems

The authors highlight that conversation has become a key modality for recommender systems.

Conversational interaction is of particular interest to the research community in the broader context of information seeking and recommendation.

Conversational recommendation is distinct from early work on slot filling and faceted search, as the sequence of exchanges between the user and the system is less rigid in structure and often allows for natural language dialogue.

The authors mention Radlinski and Craswell's work, which postulated specific desirable properties of conversational search and recommendation systems, with critiquing being a core property.

Various aspects of conversation have been addressed in the literature, including selecting preference elicitation questions, deep reinforcement learning models to understand user responses, multi-memory neural architectures to model preferences over attributes, and neural models for recommendation directly based on conversations.

The authors position their work as a continuation of this thread, with a focus on semantic understanding of user utterances at a level of detail not previously addressed.

Critiquing in Recommender Systems

Critiquing is a specific interaction in conversational recommendation where the system seeks user reactions to items or sets of items.

Critiquing-based recommendation systems make recommendations and then elicit feedback in the form of critiques. Users may provide feedback on various facets of importance, such as the airline and cost of a flight, or time and date of travel, with respect to the options presented.

This process is often repeated multiple times before the user makes a final selection. The authors mention previous work on user interfaces facilitating critiquing and conversational recommendation systems that allow users to affect recommendations along standard item attributes, such as movie genres.

They also discuss a product search model that incorporates negative feedback on specific item properties (aspect-value pairs).

However, the authors' work focuses on soliciting unconstrained natural language feedback, not limited to predefined item properties, which more closely resembles human-to-human conversation.

Additionally, the authors discuss previous work on modelling how different users may have different definitions for particular terms, such as what constitutes a "safe car."

Comparative Opinion Mining

Comparative opinion mining deals with identifying and extracting information expressed in a comparative form, which is different from opinion mining.

The authors discuss the early computational approach to comparative sentence extraction by Jindal and Liu, which involves identifying comparative elements such as entities, attributes/aspects, comparative predicates, and comparison polarity.

More recent work employs semantic role labelling techniques for this task. Comparative sentences can be used in various ways, such as determining which of two entities is better overall or obtaining a global ranking of entities on a given aspect.

A typical approach is to build a directed graph of entities, where edge weights encode the degree of belief that one entity is better than the other on a given aspect, and then rank entities by some measure of graph centrality. However, the aspects considered in previous work are often limited and come from a fixed ontology.

The authors also note that comparative opinion statements in natural text are uncommon, with estimates suggesting that only 10% of sentences in typical reviews contain a comparison.

The most important difference in the authors' work is that they aim to interpret arbitrary critiques, allowing direct navigation of the recommendation space, and design a data collection and evaluation specifically for this task without limiting themselves to common terms or reviews.

Concept of soft attributes

In this section, the authors introduce and formally define the concept of soft attributes, which is central to their work on interpreting natural language critiques in recommender systems.

Key points about soft attributes:

Definition: A soft attribute is a property of an item that is not a verifiable fact that can be universally agreed upon, and where it is meaningful to compare two items and say that one item has more of the attribute than another.

Degree: Soft attributes often involve a question of degree. For example, the attribute "violent" applied to movies is not binary but can exist on a spectrum. It is critical to model the degree to which each soft attribute applies to a given item.

Subjectivity: People may disagree in their assessment of soft attributes, even with a real-valued measure. Different people may have different norms, expectations, and thresholds for a given soft attribute.

Distinction from social tags: Unlike social tags, soft attributes do not necessarily apply to all items in the collection. For example, "realistic CGI" is a soft attribute that may not be applicable to all movies.

Additionally, social tagging approaches often bias users towards a consistent vocabulary, while soft attributes allow for more natural and varied language.

Personal partial order: For any given soft attribute, there is a personal partial order over items, where some items have the attribute more or less than others, while others are incomparable.

The authors highlight that soft attributes are common in natural dialogue and provide examples such as the immersiveness of a movie, the in-depth exploration of a time period, a relatable character, and the level of violence depicted.

These attributes are difficult to attach as definitive labels or tags to a movie, as they involve subjective assessments and degrees of applicability.

The concept of soft attributes is crucial for the authors' goal of interpreting natural language critiques in recommender systems, as it allows for a more nuanced and personalised understanding of user preferences. By recognising and modelling soft attributes, the system can better capture the subtleties and subjectivity inherent in human language and decision-making.

Subjectivity and equality

Quantifying Subjectivity

The authors investigate the subjectivity of soft attributes by measuring inter-judge agreement, i.e., whether different people considering the same soft attribute for the same items agree on which item has more of that attribute.

They argue that past attribute datasets have not analysed personal variations in the meaning of a term or the relative applicability of soft attributes.

To measure soft attribute subjectivity, the authors identify all pairs of movies that have been ranked for the same attribute by different raters. They define preference agreement as the fraction of pairs where either both raters agree on the direction of preference, or at least one rater indicates a lack of preference (i.e., there is no disagreement).

Based on the agreement rate, the authors divide the soft attributes into three equal-sized groups: High, Medium, and Low agreement attributes.

They observe that attributes with the highest agreement are reminiscent of typical tags in movie tag corpora, while many of the attributes with the lowest agreement relate to personal preferences. They also note that seemingly opposite attributes (e.g., intense and boring) can have quite different agreement rates.

Quantifying Equality

Since raters were asked to bucket movies into three categories (less, about the same, more), the authors can observe the distribution over the counts in these buckets.

On average, raters put 43.46 movies into X− (less), 3.28 movies into X◦ (about the same), and 3.19 movies into X+ (more).

The authors note that this distribution is likely influenced by the stratified sampling of items in X, but it confirms that there often exist many pairs of movies to which a given attribute applies equally, even when many others can be classified as more or less.

They suggest that considering the scores of movies in the X◦ set, it would be possible to identify thresholds that determine how different scores for items should be for a recommender to have satisfied a user's critique of more/less without necessitating extreme differences.

The authors also observe that certain attributes, such as long, documentary style, well directed, and original, have the most items in X◦, suggesting that critiques of these attributes are likely to eliminate many movies simply because significant differences are less common.

In contrast, attributes like playful, funny, sappy, and scary are much more likely to have raters provide a more complete order over movies, thus fewer would be eliminated on the grounds of being too similar to satisfy a critique.

This analysis of subjectivity and equality provides valuable insights into the nature of soft attributes and has implications for how they can be applied in recommendation settings, particularly when considering critiquing based on soft attributes.

Approaches for scoring items according to soft attributes

In this section, the authors present three approaches for scoring items according to soft attributes, moving from unsupervised to weakly supervised and fully supervised methods.

The goal is to devise a scoring function score(x, a), where x is an item in the collection and a is a soft attribute, to determine the relative ordering of items with respect to the attribute.

Generating Item Embeddings: The authors discuss the use of matrix factorisation to compute item representations from collaborative filtering datasets. The user-item rating matrix R is factorised into two low-rank matrices containing the user embeddings U and item embeddings X. The objective function is minimised using stochastic gradient descent to learn the embeddings.

Unsupervised Ranking

Two unsupervised ranking approaches are presented as baselines:

Term-based Ranking

This approach operates in the term space and leverages the corpus of item reviews, using soft attributes as search queries.

Items are represented by aggregating reviews following either an item-centric or a review-centric strategy.

In the item-centric method, a term-based representation is built for each item by concatenating all reviews mentioning the item, and then scored using standard text-based retrieval models (e.g., BM25).

In the review-centric method, reviews are ranked using retrieval models, and then the retrieval scores of reviews mentioning each item are aggregated.

Centroid-based Ranking

This approach operates in the embedding space and considers the top-ranked items as representative examples of the soft attribute.

The centroid of the top-ranked items' embeddings is taken as the representation of the soft attribute. Other items are then scored by computing their distance (cosine similarity) to the centroid.

Weakly Supervised Ranking

The weakly supervised method, called Weakly-supervised Weighted Dimensions (WWD), aims to learn which factors in the embedding space encode a particular soft attribute.

In the absence of explicit training labels, term-based models are used to obtain an initial ranking of items.

The top and bottom-ranked items are then taken as positive and negative training examples, respectively, to learn a logistic regression model. The model parameters reflect the importance (weight) of each dimension in the item embeddings in predicting the soft attribute. Items are scored by applying this model and taking the prediction probabilities as scores.

Fully Supervised Ranking

The fully supervised method, called Supervised Weighted Dimensions (SWD), leverages explicit item orderings.

Pairwise preferences are inferred from the ground truth judgments, and a linear ranking support vector machine is trained on these preferences. Each preference is transformed into a constraint, and the model learns a direction in the embedding space that represents the soft attribute. Items are then scored using the learned weights.

The authors compare the performance of these methods on two test collections: the MovieLens Attribute Collection and the Soft Attributes Collection.

The results show the best scores for each method block.

The unsupervised term-based methods perform well on the MovieLens collection but poorly on the Soft Attributes collection. The weakly supervised and fully supervised methods show improved performance on the Soft Attributes collection, highlighting the importance of learning weighted dimensions in the embedding space to represent soft attributes effectively.

Evaluation

In the evaluation section, the authors assess the performance of the proposed scoring algorithms based on how well they order item pairs in agreement with the ground truth data.

They use the original Goodman and Kruskal gamma (G) for the MovieLens Attribute Collection and their modified version (G') for the Soft Attributes Collection.

Key findings from the evaluation

  1. The term-based models perform remarkably well on the MovieLens collection, suggesting that the problem of ranking items for a given soft attribute could be substantially solved with a straightforward model. However, the authors argue that this formulation, which focuses on distinguishing items with a given tag from those without, is misleading.

  2. The Soft Attributes Collection proves to be considerably harder, with lower overall scores, indicating that it provides a more accurate abstraction of the attribute ranking problem.

  3. The relative ordering of systems differs between the two collections, highlighting the importance of the task encoded in the data for enabling meaningful progress.

  4. The weakly supervised approach (WWD+TB) outperforms the term-based baselines (TB) and the centroid-based ranking (CB+TB) on the Soft Attributes Collection, likely because it considers both positive and negative evidence.

  5. The fully supervised method (SWD) yields significant improvement in performance on the Soft Attributes Collection, emphasizing the value of the new data collection methodology in addressing the soft attribute ranking problem.

The authors further analyse the SWD model in terms of data efficiency and performance across different soft attributes:

Data efficiency: The SWD model is very data-efficient, requiring judgments from approximately 20 raters for any given soft attribute to obtain near-optimal performance. This reinforces the value of pairwise preferences over a controlled sample of known items.

Performance analysis by soft attribute: There is a clear correlation between the subjectiveness of a soft attribute (measured by inter-rater agreement) and ranking performance (measured by weighted gamma rank correlation G').

Soft attributes with less agreement are harder to predict, suggesting room for personalized soft attribute scoring models and the importance of predicting the "softness" of a soft attribute as future research directions.

In summary, the evaluation demonstrates the effectiveness of the proposed supervised learning approach (SWD) for ranking items according to soft attributes, particularly when trained on data collected using the authors' methodology. The analysis also highlights the challenges posed by subjective soft attributes and the potential for personalized models in this domain.

Conclusion

In this paper, the authors have formalised the concept of recommender system critiquing based on soft attributes, which are aspects of items that cannot be universally agreed upon as facts.

They have developed a general methodology for obtaining soft attribute judgments and presented a dataset of pairwise preferences over soft attributes in the domain of movies.

The research on soft attributes in recommender systems has several practical use cases and implications for improving user experience and enhancing the effectiveness of recommendation systems:

More natural and expressive critiquing

By incorporating soft attributes, recommender systems can allow users to provide feedback and refine their preferences using more natural language expressions.

Instead of being limited to predefined tags or categories, users can express their preferences in terms of subjective attributes like "less violent," "more thought-provoking," or "funnier." This enables a more intuitive and user-friendly interaction with the recommender system.

Improved recommendation quality

By understanding and modelling soft attributes, recommender systems can capture more nuanced and personalised user preferences.

This can lead to more accurate and relevant recommendations, as the system can better match users with items that align with their specific tastes and desires, even when these preferences are expressed in subjective terms.

Enhanced explainability and transparency

Soft attributes can be used to provide explanations for recommendations.

By highlighting the soft attributes that contribute to a recommendation (e.g., "recommended because you prefer thought-provoking and less violent movies"), the system can improve transparency and help users understand why certain items are being suggested to them.

Facilitating serendipitous discoveries

Soft attributes can be leveraged to introduce serendipity in recommendations.

By understanding the soft attributes a user appreciates, the system can recommend items that share similar attributes but may be in different categories or genres, leading to unexpected and delightful discoveries for the user.

Enabling more engaging conversations

In the context of conversational recommender systems, soft attributes can facilitate more natural and engaging dialogues. The system can elicit user preferences, respond to critiques, and refine recommendations based on the user's feedback expressed through soft attributes, making the interaction feel more human-like and personalized.

Addressing the "cold start" problem

Soft attributes can be helpful in tackling the "cold start" problem, where the system has limited information about a new user. By asking the user about their preferences in terms of soft attributes (e.g., "Do you prefer thought-provoking or light-hearted movies?"), the system can quickly gather valuable information to provide relevant recommendations from the start.

Enhancing user profiling and segmentation

By analysing user preferences and feedback in terms of soft attributes, recommender systems can build richer and more nuanced user profiles. This can enable better user segmentation and targeting, allowing for more personalized and effective marketing strategies.

Cross-domain recommendations

Soft attributes can potentially bridge the gap between different item domains. For example, if a user expresses a preference for "thought-provoking" movies, the system could recommend "thought-provoking" books or podcasts, even if the user has not explicitly interacted with items in those domains.

In summary, the practical applications of soft attributes in recommender systems are centred around enabling more natural and expressive user interactions, improving recommendation quality, enhancing explainability and serendipity, facilitating engaging conversations, addressing cold-start issues, enriching user profiling, and potentially enabling cross-domain recommendations.

By leveraging soft attributes, recommender systems can provide a more personalized, intuitive, and satisfying user experience.

LogoOn Interpretation and Measurement of Soft Attributes for RecommendationarXiv.org
On Interpretation and Measurement of Soft Attributes for Recommendation
Page cover image