Our Latest News

In which scenarios does Prompt Tuning perform better than Fine Tuning?

Table of Contents

In which scenarios does Prompt Tuning perform better than Fine Tuning?

As we all know, the data labeled data largely determines the upper limit of AI algorithm and the cost is very high, both the contrast learning and prompt learning focus on solving the problem of learning with few samples. In this paper, we introduce the idea of prompt learning and its application. This paper focuses on prompt learning ideas and current common approaches.

Catalog

I. What are the training paradigms of NLP
II. Why cue learning is needed
III. What is cued learning?
IV. Common Prompted Learning Methods
V. Summary

I. What are the training paradigms of NLP

The current academic community generally divides the development of NLP tasks into four phases, i.e., the four paradigms of NLP.

Request FPGA Chip or full BOM List Quote now

First paradigm: paradigm based on “traditional machine learning models”, such as tf-idf features + machine algorithms such as plain Bayes.
Second paradigm: paradigm based on “deep learning models”, such as word2vec features + LSTM and other deep learning algorithms, with improved model accuracy and reduced feature engineering efforts compared to the first paradigm.
Third paradigm: paradigm based on “pre-trained model + finetuning”, such as BERT + finetuning for NLP tasks, compared to the second paradigm, the model accuracy is significantly improved, but the model also becomes larger, but a good model can be trained with a small data set.
Fourth paradigm: paradigm based on “pre-trained model + Prompt + prediction”, such as BERT + Prompt paradigm requires significantly less training data for model training compared to the third paradigm.

In the whole NLP field, you will find that the whole development is towards higher accuracy, less supervision, and even unsupervised direction, and Prompt Learning is the latest and hottest research result in this direction in academia

II. Why do you need tips to learn why?

In order to propose a good approach, it must be used to “solve the defects or shortcomings of another approach”, so let’s start with its previous paradigm, which is the pre-training model PLM + finetuning paradigm The common one is BERT + finetuning.

This paradigm is to pre-train the model to better apply in the downstream tasks, need to use the downstream data to fine-tune the model parameters; first, the model in the “pre-training, the form of training: autoregressive, self-coding, which has a great gap with the form of downstream tasks,” can not fully play the ability of the pre-trained model itself

This inevitably leads to: more data to adapt to the new task form -> poor learning ability with fewer samples and easy overfitting

Gap exists in the form of upstream and downstream tasks

Secondly, the number of pre-trained models is getting larger and larger, and finetuning a model for a specific task and then deploying it to an online business can cause a huge waste of deployment resources.

Request FPGA Chip or full BOM List Quote now

Model specificity task-specific fine-tuning leads to high deployment costs

III. What is cue learning
The first “consensus” we should have is that there is a large amount of knowledge in the pre-trained model; the pre-trained model itself has the ability to learn with few samples.

The In-Context Learning proposed by GPT-3 also effectively proves that the model can achieve good results without any parameters in Zero-shot and Few-shot scenarios, especially ChatGPT in the recent hot GPT3.5 series.

The Essence of Prompt Learning

Unify all downstream tasks into a pre-training task; “convert the data of downstream tasks into natural language form with a specific template” to fully exploit the capability of the pre-training model itself.

Essentially, it is to design a template that fits better with the upstream pre-training task, and through the design of the template is to “tap into the potential of the upstream pre-training model”, so that the upstream pre-training model can complete the downstream task better with as little annotated data as possible, and the key includes 3 steps.

Designing the task of the pre-trained language model
Designing the input template style (Prompt Engineering)
Designing the label style and the way to map the model’s output to the label (Answer Engineering)

Forms of Prompt Learning

Taking the movie review sentiment classification task as an example, the model needs to do a binary classification based on the input sentences.

Original input: The special effects are very cool, I like it.

Prompt input: “Prompt template 1”: The special effects are very cool, I like it. This is a [MASK] movie; “Prompt Template 2”: The special effects are very cool and I like them a lot. This movie is [MASK]

The role of the cue template is just that: to convert the training data into a form of natural language and MASK it at the right place to stimulate the ability of the pre-trained model.

Cue learning template framework

Category Mapping/Verbalizer: Select appropriate prediction words and correspond these words to different categories.

Category Mapping

By constructing cue learning samples, Prompt Tuning with only a small amount of data can achieve good results with strong zero/small sample learning capability.

Request FPGA Chip or full BOM List Quote now

IV. Common prompt study methods

1. Hard Formwork Method

1.1 Hard Template – PET (Pattern Exploiting Training)

PET is a classical cue learning method that, like the previous example, models the problem as a completion problem and then optimizes the final output words. Although PET is also “optimizing the parameters of the whole model”, it requires “less data” than the traditional Finetuning method.

Modeling approach.

Previously, the model simply modeled P(l|x) (l is label), but now Prompt P and the label mapping (called verbalizer by the authors) are added, so the problem can be updated to

Where M denotes the model and s corresponds to the logits of the corresponding word generated under a certain prompt. then by softmax, the probability can be obtained as follows.

The authors added “MLM loss” to the training for joint training.

Specific approach.

Train a model for each prompt on a small amount of supervised data.
for unsupervised data, integrate multiple prompt predictions for the same sample, using averaging or weighting (assigning weights according to acc), and then normalize to obtain a probability distribution as a soft label for the unsupervised data.
A final model is finetune on the obtained soft label.
1.2 Hard template – LM-BFF

LM-BFF is the work of Tianqi Chen’s team, based on Prompt Tuning, which proposes Prompt Tuning with demonstration & Auto Prompt Generation.

“Shortcomings of Hard Template Approach”.

Request FPGA Chip or full BOM List Quote now

Hard template generation relies on two approaches: manual design based on experience & automated search. However, manual design is not necessarily better than automated search, and automated search is not as readable and interpretable.

The above experimental results show that for the prompt, changing a single word in the prompt can make a huge difference to the experimental results, so it also provides a direction for subsequent optimization, such as simply abandoning the hard template and optimizing the prompt token embedding.

2.Soft template method

2.1 Soft templates – P tuning

Instead of designing/searching for hard templates, a number of Pseudo Prompt Tokens that can be optimized are inserted directly on the input side to “automate the search for knowledge templates in continuous space”.

No reliance on manual design
Very few parameters to optimize, avoiding overfitting (can also be fully fine-tuned, degrading to traditional finetuning)

While traditional discrete prompts directly map each token of template T to a corresponding embedding, P-Tuning maps Pi (Pseudo Prompt) in template T to a “trainable parameter hi”.

The “optimization key points” are: the natural language hard prompt is replaced by a trainable soft prompt; the pseudo token sequence in template T is characterized by a bidirectional LSTM; a small number of natural language prompt anchor characters (Anchor) are introduced to improve the efficiency, such as “capital” in the above figure. “, it can be seen that the p-tuning is in the form of hard+soft, and not in the full soft form.

Specific approach.

Initialize a template: The capital of [X] is [mask]
Replace the input: at [X] replace it with the input “Britian”, i.e., predict the capital of Britain
Pick one or more tokens from the template as soft prompt
Feed all soft prompts into the LSTM and obtain the “hidden state vector h” for each soft prompt
Send the initial templates to the Embedding Layer of BERT, “replace the token embedding of all soft prompt with h”, and then predict the mask.
Core Conclusion: Based on the full amount of data, large model: only fine-tuning prompt-related parameters, comparable to the performance of fine-tuning.

Code: https://github.com/THUDM/

2.2 Soft template – Prefix tuning

P-tuning is a method to update prompt token embedding, which can optimize fewer parameters. prefix tuning wants to optimize more parameters to improve the effect, but without imposing too much burden. Although prefix tuning was proposed for the generation task, it has had an enlightening influence on the subsequent development of soft prompt.

Optimizing the Prompt token embedding for each layer, not just the input layer

As seen in the figure above, the model adds prefix before each transformer layer. prefix is not a real token but a “soft prompt” that freezes the parameters of the transformer during Prefix-tuning training and updates only the parameters of Prefix.

Only a copy of the large transformer and the learned task-specific prefixes need to be stored, incurring a very small overhead for each additional task.

Autoregressive model

Approach with the autoregressive model example on the graph.

Request FPGA Chip or full BOM List Quote now

The input is represented as Z = [ prefix ; x ; y ]
Prefix-tuning initializes a training matrix P that stores the prefix parameters
for the prefix-token, the parameters are chosen from the designed training matrix, while for the other tokens, the parameters are fixed and are those of the pre-trained language model
Core conclusion: Prefix-tuning performs comparably to fine-tuning on generative tasks with full data and large models: only the prompt-related parameters are fine-tuned.

Code: https://github.com/XiangLi1999/PrefixTuning

2.3 Soft Prompt Tuning – Soft Prompt Tuning

Soft Prompt Tuning system then validates the effectiveness of the soft template approach and proposes that: fixing the base model and effectively using the task-specific Soft Prompt Token can significantly reduce the resource consumption and achieve the generality of large models.

A simplification of Prefix-tuning, fixing the pre-trained model and only “adding an extra k learnable tokens” to the input of downstream tasks. This approach is comparable to the performance of traditional fine-tuning with large-scale pre-training models.

Code: https://github.com/kipgparker/soft-prompt-tuning

V. Summary
“Components of Prompt Learning

Prompt templates: Depending on the use of the pre-trained model, two types of templates can be constructed: fill-in-the-blank or prefix-based generation
Category Mapping/Verbalizer: Selecting appropriate category mapping words based on experience, 3. Pre-trained language models
“Summary of typical Prompt Learning methods

Hard template approach: manual design/automatic construction of templates based on discrete tokens
1) PET 2) LM-BFF
Soft Template Approach: Instead of pursuing the intuitive interpretability of templates, we directly optimize Prompt Token Embedding, which is a vector/learnable parameter
1）P-tuning 2）Prefix Tuning
We will try Prompt Learning in classification and information extraction tasks later, and keep updating.

We'd love to

hear from you

Highlight multiple sections with this eye-catching call to action style.

Contact Us

Exhibition Bay South Squre, Fuhai Bao’an Shenzhen China

Sales@ebics.com
+86.755.27389663

Our Latest News

In which scenarios does Prompt Tuning perform better than Fine Tuning?