PEFT is a technique for finetuning large pre-trained language models like GPT-3. It was proposed in a paper published by researchers at Anthropic in 2022 as a method to reduce the computational costs and carbon footprint of finetuning large models.
The key idea behind PEFT is to only update a small subset of the model’s parameters when finetuning on a downstream task. Large pretrained models like GPT-3 have billions of parameters. Finetuning all of these parameters on new data is computationally prohibitive and wasteful. PEFT proposes identifying a small set of task-specific parameters to update during finetuning while the rest of the model’s parameters remain fixed.
This is done by first training a small “adapter” module that learns to interpret the output of the frozen base model. The adapter has far fewer parameters than the base model, so it is much faster and more efficient to train. The adapter parameters are the only ones updated during finetuning. At inference time, the adapter modifies the original model’s output distribution to produce outputs adapted to the task.
Parameter-efficient strategies may involve a variety of techniques:
- Selective Layer Tuning: Instead of fine-tuning all the layers of a model, one might fine-tune only a subset of layers. This reduces the number of parameters that need to be updated.
- Adapters: Adapter layers are small neural networks that are inserted between the layers of a pre-trained model. During fine-tuning, only these adapter layers are trained, keeping the pre-trained parameters frozen. This way, the adapters learn to adapt the features extracted by the pre-trained model for the new task.
- Sparse Fine-Tuning: Traditional fine-tuning adjusts all parameters slightly, but sparse fine-tuning involves changing only a subset of the model’s parameters. This is typically done based on some criteria that identify the most relevant parameters for the new task.
- Low-Rank Approximations: Another strategy is to approximate the fine-tuned model with a model that has fewer parameters but performs similarly on the task.
- Regularization Techniques: Regularization terms can be added to the loss function to discourage large changes in the parameters, thus effectively fine-tuning the model in a more “parameter-efficient” way.
- Task-specific Heads: Sometimes, a task-specific layer or “head” is added to the pre-trained model architecture, and only this head is fine-tuned, reducing the number of parameters that need to be learned.
Experiments by Anthropic demonstrated that PEFT can match the accuracy of full finetuning while only updating 0.1% of the model’s parameters. This provides orders of magnitude improvements in efficiency and carbon emissions. For example, they found GPT-3 could be adapted to a medical question answering dataset with only 0.3% of the computational cost of full finetuning.
The key innovations that enable PEFT are 1) freezing nearly all of the base model’s parameters during finetuning and 2) using a lightweight adapter module to interpret the frozen model’s outputs. By limiting training to a small set of task-specific parameters, PEFT provides a scalable and eco-friendly approach to adapting large pretrained LMs.
In summary, PEFT allows finetuning massive models in a parameter-efficient way, reducing computational costs and carbon emissions. It is a promising technique for sustainably deploying large language models.