On Interpretation and Measurement of Soft Attributes for Recommendation
This May 2021 paper, titled focuses on the challenge of interpreting and measuring soft attributes in the context of recommender systems.
Soft attributes are natural language refinements or critiques that people use to express their preferences about items, such as the originality of a movie plot, the noisiness of a venue, or the complexity of a recipe.
The authors argue that while binary tagging is widely studied in recommender systems, soft attributes often involve subjective and contextual aspects that cannot be reliably captured or represented as objective binary truth in a knowledge base.
This adds important considerations when measuring soft attribute ranking.
The paper makes three main contributions:
Development of a reusable test collection
The authors create a set of soft attributes and ground truth item orderings with respect to those attributes (for particular users), along with an evaluation metric.
They use a novel controlled multi-stage crowd labelling mechanism to collect ground truth of personalised partial orderings while keeping workers' cognitive load low.
They also propose a novel weighted extension to established rank correlation based on agreement with respect to the structured ground truth ranking.
Quantification of the subjectivity or "softness" of soft attributes
The authors identify ways to differentiate more from less subjective soft attributes and measure how this affects item scoring. This result also has implications for standard tagging, highlighting the role of subjectivity.
Addressing the problem of critiquing based on soft attributes
The authors present empirical evidence to demonstrate the importance of debiased collection of ground truth.
They introduce three families of methods for the task of ranking items relative to a given anchor item with respect to a given soft attribute: unsupervised, weakly supervised, and fully supervised.
They compare these methods on two test collections, one based on existing social tags and another constructed using their proposed approach. The results show a discrepancy between the two test collections, indicating that the tag-based test collection is blind to item ranking improvements, making progress in this area difficult.
The authors also analyse performance with respect to attribute "softness" and find that methods perform significantly better on attributes with higher agreement as opposed to those with low agreement.
In summary, this work formalises the notion of soft attributes, opening up new possibilities for more natural interactions with conversational recommender systems.
The authors' technical contributions include an efficient method for debiased collection of ground truth for comparing items with respect to a given soft attribute, a measure to quantify soft attribute subjectivity, and the introduction and formalisation of the task of critiquing based on soft attributes.
Related Work
Conversational Recommender Systems
The authors highlight that conversation has become a key modality for recommender systems.
Conversational interaction is of particular interest to the research community in the broader context of information seeking and recommendation.
Conversational recommendation is distinct from early work on slot filling and faceted search, as the sequence of exchanges between the user and the system is less rigid in structure and often allows for natural language dialogue.
The authors mention Radlinski and Craswell's work, which postulated specific desirable properties of conversational search and recommendation systems, with critiquing being a core property.
Various aspects of conversation have been addressed in the literature, including selecting preference elicitation questions, deep reinforcement learning models to understand user responses, multi-memory neural architectures to model preferences over attributes, and neural models for recommendation directly based on conversations.
The authors position their work as a continuation of this thread, with a focus on semantic understanding of user utterances at a level of detail not previously addressed.
Critiquing in Recommender Systems
Critiquing is a specific interaction in conversational recommendation where the system seeks user reactions to items or sets of items.
Critiquing-based recommendation systems make recommendations and then elicit feedback in the form of critiques. Users may provide feedback on various facets of importance, such as the airline and cost of a flight, or time and date of travel, with respect to the options presented.
This process is often repeated multiple times before the user makes a final selection. The authors mention previous work on user interfaces facilitating critiquing and conversational recommendation systems that allow users to affect recommendations along standard item attributes, such as movie genres.
They also discuss a product search model that incorporates negative feedback on specific item properties (aspect-value pairs).
However, the authors' work focuses on soliciting unconstrained natural language feedback, not limited to predefined item properties, which more closely resembles human-to-human conversation.
Additionally, the authors discuss previous work on modelling how different users may have different definitions for particular terms, such as what constitutes a "safe car."
Comparative Opinion Mining
Comparative opinion mining deals with identifying and extracting information expressed in a comparative form, which is different from opinion mining.
The authors discuss the early computational approach to comparative sentence extraction by Jindal and Liu, which involves identifying comparative elements such as entities, attributes/aspects, comparative predicates, and comparison polarity.
More recent work employs semantic role labelling techniques for this task. Comparative sentences can be used in various ways, such as determining which of two entities is better overall or obtaining a global ranking of entities on a given aspect.
A typical approach is to build a directed graph of entities, where edge weights encode the degree of belief that one entity is better than the other on a given aspect, and then rank entities by some measure of graph centrality. However, the aspects considered in previous work are often limited and come from a fixed ontology.
The authors also note that comparative opinion statements in natural text are uncommon, with estimates suggesting that only 10% of sentences in typical reviews contain a comparison.
The most important difference in the authors' work is that they aim to interpret arbitrary critiques, allowing direct navigation of the recommendation space, and design a data collection and evaluation specifically for this task without limiting themselves to common terms or reviews.
Concept of soft attributes
In this section, the authors introduce and formally define the concept of soft attributes, which is central to their work on interpreting natural language critiques in recommender systems.
Key points about soft attributes:
Definition: A soft attribute is a property of an item that is not a verifiable fact that can be universally agreed upon, and where it is meaningful to compare two items and say that one item has more of the attribute than another.
Degree: Soft attributes often involve a question of degree. For example, the attribute "violent" applied to movies is not binary but can exist on a spectrum. It is critical to model the degree to which each soft attribute applies to a given item.
Subjectivity: People may disagree in their assessment of soft attributes, even with a real-valued measure. Different people may have different norms, expectations, and thresholds for a given soft attribute.
Distinction from social tags: Unlike social tags, soft attributes do not necessarily apply to all items in the collection. For example, "realistic CGI" is a soft attribute that may not be applicable to all movies.
Additionally, social tagging approaches often bias users towards a consistent vocabulary, while soft attributes allow for more natural and varied language.
Personal partial order: For any given soft attribute, there is a personal partial order over items, where some items have the attribute more or less than others, while others are incomparable.
The authors highlight that soft attributes are common in natural dialogue and provide examples such as the immersiveness of a movie, the in-depth exploration of a time period, a relatable character, and the level of violence depicted.
These attributes are difficult to attach as definitive labels or tags to a movie, as they involve subjective assessments and degrees of applicability.
The concept of soft attributes is crucial for the authors' goal of interpreting natural language critiques in recommender systems, as it allows for a more nuanced and personalised understanding of user preferences. By recognising and modelling soft attributes, the system can better capture the subtleties and subjectivity inherent in human language and decision-making.
Subjectivity and equality
Quantifying Subjectivity
The authors investigate the subjectivity of soft attributes by measuring inter-judge agreement, i.e., whether different people considering the same soft attribute for the same items agree on which item has more of that attribute.
They argue that past attribute datasets have not analysed personal variations in the meaning of a term or the relative applicability of soft attributes.
To measure soft attribute subjectivity, the authors identify all pairs of movies that have been ranked for the same attribute by different raters. They define preference agreement as the fraction of pairs where either both raters agree on the direction of preference, or at least one rater indicates a lack of preference (i.e., there is no disagreement).
Based on the agreement rate, the authors divide the soft attributes into three equal-sized groups: High, Medium, and Low agreement attributes.
They observe that attributes with the highest agreement are reminiscent of typical tags in movie tag corpora, while many of the attributes with the lowest agreement relate to personal preferences. They also note that seemingly opposite attributes (e.g., intense and boring) can have quite different agreement rates.
Quantifying Equality
Since raters were asked to bucket movies into three categories (less, about the same, more), the authors can observe the distribution over the counts in these buckets.
On average, raters put 43.46 movies into X− (less), 3.28 movies into X◦ (about the same), and 3.19 movies into X+ (more).
The authors note that this distribution is likely influenced by the stratified sampling of items in X, but it confirms that there often exist many pairs of movies to which a given attribute applies equally, even when many others can be classified as more or less.
They suggest that considering the scores of movies in the X◦ set, it would be possible to identify thresholds that determine how different scores for items should be for a recommender to have satisfied a user's critique of more/less without necessitating extreme differences.
The authors also observe that certain attributes, such as long, documentary style, well directed, and original, have the most items in X◦, suggesting that critiques of these attributes are likely to eliminate many movies simply because significant differences are less common.
In contrast, attributes like playful, funny, sappy, and scary are much more likely to have raters provide a more complete order over movies, thus fewer would be eliminated on the grounds of being too similar to satisfy a critique.
This analysis of subjectivity and equality provides valuable insights into the nature of soft attributes and has implications for how they can be applied in recommendation settings, particularly when considering critiquing based on soft attributes.
Approaches for scoring items according to soft attributes
In this section, the authors present three approaches for scoring items according to soft attributes, moving from unsupervised to weakly supervised and fully supervised methods.
The goal is to devise a scoring function score(x, a), where x is an item in the collection and a is a soft attribute, to determine the relative ordering of items with respect to the attribute.
Generating Item Embeddings: The authors discuss the use of matrix factorisation to compute item representations from collaborative filtering datasets. The user-item rating matrix R is factorised into two low-rank matrices containing the user embeddings U and item embeddings X. The objective function is minimised using stochastic gradient descent to learn the embeddings.
Unsupervised Ranking
Two unsupervised ranking approaches are presented as baselines:
Term-based Ranking
This approach operates in the term space and leverages the corpus of item reviews, using soft attributes as search queries.
Items are represented by aggregating reviews following either an item-centric or a review-centric strategy.
In the item-centric method, a term-based representation is built for each item by concatenating all reviews mentioning the item, and then scored using standard text-based retrieval models (e.g., BM25).
In the review-centric method, reviews are ranked using retrieval models, and then the retrieval scores of reviews mentioning each item are aggregated.
Centroid-based Ranking
This approach operates in the embedding space and considers the top-ranked items as representative examples of the soft attribute.
The centroid of the top-ranked items' embeddings is taken as the representation of the soft attribute. Other items are then scored by computing their distance (cosine similarity) to the centroid.
Weakly Supervised Ranking
The weakly supervised method, called Weakly-supervised Weighted Dimensions (WWD), aims to learn which factors in the embedding space encode a particular soft attribute.
In the absence of explicit training labels, term-based models are used to obtain an initial ranking of items.
The top and bottom-ranked items are then taken as positive and negative training examples, respectively, to learn a logistic regression model. The model parameters reflect the importance (weight) of each dimension in the item embeddings in predicting the soft attribute. Items are scored by applying this model and taking the prediction probabilities as scores.
Fully Supervised Ranking
The fully supervised method, called Supervised Weighted Dimensions (SWD), leverages explicit item orderings.
Pairwise preferences are inferred from the ground truth judgments, and a linear ranking support vector machine is trained on these preferences. Each preference is transformed into a constraint, and the model learns a direction in the embedding space that represents the soft attribute. Items are then scored using the learned weights.
The authors compare the performance of these methods on two test collections: the MovieLens Attribute Collection and the Soft Attributes Collection.
The results show the best scores for each method block.
The unsupervised term-based methods perform well on the MovieLens collection but poorly on the Soft Attributes collection. The weakly supervised and fully supervised methods show improved performance on the Soft Attributes collection, highlighting the importance of learning weighted dimensions in the embedding space to represent soft attributes effectively.
Evaluation
In the evaluation section, the authors assess the performance of the proposed scoring algorithms based on how well they order item pairs in agreement with the ground truth data.
They use the original Goodman and Kruskal gamma (G) for the MovieLens Attribute Collection and their modified version (G') for the Soft Attributes Collection.
Key findings from the evaluation
The term-based models perform remarkably well on the MovieLens collection, suggesting that the problem of ranking items for a given soft attribute could be substantially solved with a straightforward model. However, the authors argue that this formulation, which focuses on distinguishing items with a given tag from those without, is misleading.
The Soft Attributes Collection proves to be considerably harder, with lower overall scores, indicating that it provides a more accurate abstraction of the attribute ranking problem.
The relative ordering of systems differs between the two collections, highlighting the importance of the task encoded in the data for enabling meaningful progress.
The weakly supervised approach (WWD+TB) outperforms the term-based baselines (TB) and the centroid-based ranking (CB+TB) on the Soft Attributes Collection, likely because it considers both positive and negative evidence.
The fully supervised method (SWD) yields significant improvement in performance on the Soft Attributes Collection, emphasizing the value of the new data collection methodology in addressing the soft attribute ranking problem.
The authors further analyse the SWD model in terms of data efficiency and performance across different soft attributes:
Data efficiency: The SWD model is very data-efficient, requiring judgments from approximately 20 raters for any given soft attribute to obtain near-optimal performance. This reinforces the value of pairwise preferences over a controlled sample of known items.
Performance analysis by soft attribute: There is a clear correlation between the subjectiveness of a soft attribute (measured by inter-rater agreement) and ranking performance (measured by weighted gamma rank correlation G').
Soft attributes with less agreement are harder to predict, suggesting room for personalized soft attribute scoring models and the importance of predicting the "softness" of a soft attribute as future research directions.
In summary, the evaluation demonstrates the effectiveness of the proposed supervised learning approach (SWD) for ranking items according to soft attributes, particularly when trained on data collected using the authors' methodology. The analysis also highlights the challenges posed by subjective soft attributes and the potential for personalized models in this domain.
Conclusion
In this paper, the authors have formalised the concept of recommender system critiquing based on soft attributes, which are aspects of items that cannot be universally agreed upon as facts.
They have developed a general methodology for obtaining soft attribute judgments and presented a dataset of pairwise preferences over soft attributes in the domain of movies.
The research on soft attributes in recommender systems has several practical use cases and implications for improving user experience and enhancing the effectiveness of recommendation systems:
More natural and expressive critiquing
By incorporating soft attributes, recommender systems can allow users to provide feedback and refine their preferences using more natural language expressions.
Instead of being limited to predefined tags or categories, users can express their preferences in terms of subjective attributes like "less violent," "more thought-provoking," or "funnier." This enables a more intuitive and user-friendly interaction with the recommender system.
Improved recommendation quality
By understanding and modelling soft attributes, recommender systems can capture more nuanced and personalised user preferences.
This can lead to more accurate and relevant recommendations, as the system can better match users with items that align with their specific tastes and desires, even when these preferences are expressed in subjective terms.
Enhanced explainability and transparency
Soft attributes can be used to provide explanations for recommendations.
By highlighting the soft attributes that contribute to a recommendation (e.g., "recommended because you prefer thought-provoking and less violent movies"), the system can improve transparency and help users understand why certain items are being suggested to them.
Facilitating serendipitous discoveries
Soft attributes can be leveraged to introduce serendipity in recommendations.
By understanding the soft attributes a user appreciates, the system can recommend items that share similar attributes but may be in different categories or genres, leading to unexpected and delightful discoveries for the user.
Enabling more engaging conversations
In the context of conversational recommender systems, soft attributes can facilitate more natural and engaging dialogues. The system can elicit user preferences, respond to critiques, and refine recommendations based on the user's feedback expressed through soft attributes, making the interaction feel more human-like and personalized.
Addressing the "cold start" problem
Soft attributes can be helpful in tackling the "cold start" problem, where the system has limited information about a new user. By asking the user about their preferences in terms of soft attributes (e.g., "Do you prefer thought-provoking or light-hearted movies?"), the system can quickly gather valuable information to provide relevant recommendations from the start.
Enhancing user profiling and segmentation
By analysing user preferences and feedback in terms of soft attributes, recommender systems can build richer and more nuanced user profiles. This can enable better user segmentation and targeting, allowing for more personalized and effective marketing strategies.
Cross-domain recommendations
Soft attributes can potentially bridge the gap between different item domains. For example, if a user expresses a preference for "thought-provoking" movies, the system could recommend "thought-provoking" books or podcasts, even if the user has not explicitly interacted with items in those domains.
In summary, the practical applications of soft attributes in recommender systems are centred around enabling more natural and expressive user interactions, improving recommendation quality, enhancing explainability and serendipity, facilitating engaging conversations, addressing cold-start issues, enriching user profiling, and potentially enabling cross-domain recommendations.
By leveraging soft attributes, recommender systems can provide a more personalized, intuitive, and satisfying user experience.
Last updated