Instruction Fine Tuning - Alpagasus
"ALPAGASUS: Data-Driven Data Selection for Instruction Fine-Tuning"
Neural language models are improved through instruction-finetuning (IFT) - where they are trained on specific sets of instructions and corresponding responses.
This paper highlights the importance of a curated and filtered dataset to improve the results from Instruction Fine-Tuning (IFT).
The training of Meta's open sourced model LLama by Stanford University using a datasets of 52,000 instructions created a model called "Alpaca" (An Alpaca is a type of LLama)
The team that wrote the AlpaGasus paper found that the training data used by Stanford to create Alpaca contained many low-quality entries.
They argued these entries can include incorrect or irrelevant responses, which negatively impact the finetuning process.
To address this, the team curated the Stanford dataset to filter out low-quality data from the original 52,000 Alpaca dataset.
As a result, AlpaGasus model fine tunes Meta's LLama model on only 9,000 high-quality instances.
Despite the smaller fine tuning dataset, the paper demonstrates that AlpaGasus significantly outperforms the original Alpaca model in various tests and human evaluations.
Alpagasus is evaluated on multiple test sets and compared to the original Alpaca and the results show it significantly outperforms Alpaca in terms of instruction-following capability.
Just as importantly, it also reduces training time substantially, making it more cost efficient.
This proves the ongoing movement around data - to prioritise data quality over quantity.
Key Takeaways for Data Curation
Threshold-based Filtering
Consider implementing a threshold-based filtering approach. Set a threshold value for a scoring metric that reflects the quality of data. Data points with scores equal to or higher than the threshold are retained, while those below it are filtered out. This approach allows you to select data that meets a certain quality standard.
Comparative Filtering
Explore the possibility of comparative filtering where you have multiple datasets or versions of a dataset. In the paper, they compared models trained on different subsets of data to assess the impact of data quality and quantity. You can create variations of your dataset, apply different filtering criteria, and then compare the performance of models trained on these subsets. This can help identify the best-performing subset of data.
Human vs. Machine Filtering
Consider filtering datasets based on whether the data is human-generated or machine-generated. In the paper, they applied their filtering method to both machine-generated and human-written datasets. You can use this approach to assess the impact of data quality on models when using different data sources. It may be beneficial to have a separate filtering process for each type of data source.
Last updated