LLM(七)——参数高效微调(Parameter-efficient fine-tuning,PEFT)

随着模型变得越来越大,在消费级硬件上对模型进行全部参数的微调变得不可行。此外,为每个下游任务独立存储和部署微调模型变得非常昂贵,因为微调模型与原始预训练模型的大小相同。参数高效微调(PEFT) 方法旨在解决这两个问题!PEFT 方法使您能够获得与全参数微调相当的性能,同时只有少量可训练参数。

一、PEFT类型

Additive methods

Additive methods的主要思想是通过添加额外的参数或层来扩充现有的Pre-Trained models,仅对新添加的参数进行训练。主要有两个类别:Adapter和Soft Prompt。

Adapter是在Transformer架构中引入小的全连接可训练层。而Soft Prompt旨在通过保持其结构固定和冻结来修改输入Prompt,从而控制LLM的行为。

Selective Methods

选择性方法对模型现有的参数进行微调,可以是基于层深度的选择、基于层类型的选择,甚至是个别参数的选择。其中一个例子是注意力调整。这些基于选择性的方法的性能有好有坏,并且在参数效率和计算效率之间存在明显的折中。

Reparametrization-based methods

基于重新参数化的PEFT方法利用Low-Rank Approximation性质来最小化可训练参数的数量。Low-Rank Matrix旨在捕捉高维数据的潜在Low-Rank结构。该方法冻结原始LLM参数,通过建立新的Low-Rank转换并引入少量可训练参数。

1.1 代表性PEFT方法

1.1.1 Prompt Tuning

  • 在输入序列前,额外加入一组伪 Embedding 向量
  • 只训练这组伪 Embedding,从而达到参数微调的效果

1.1.2 P-Tuning

  • 用一个生成器生成上述伪 Embedding
  • 只有生成器的参数是可训练的

1.1.3 Prefix-Tuning

  • 伪造前面的 Hidden States
  • 只训练伪造的这个 Prefix

1.1.4 LoRA

  • 在 Transformer 的参数矩阵上加一个低秩矩阵(
  • 只训练 A,B
  • 理论上可以把上述方法应用于 Transformer 中的任意参数矩阵,包括 Embedding 矩阵
  • 通常应用于 Query, Value 两个参数矩阵

对于预训练权重 ,将计算隐藏层数值的公式修改成

使用低秩矩阵分解()来限制更新,并且都与相同的输入相乘。在训练过程中被冻结并且不接收梯度更新。

QLoRA

什么是模型量化


更多参考: https://huggingface.co/blog/hf-bitsandbytes-integration

QLoRA 引入了许多创新来在不牺牲性能的情况下节省显存:

  • 4 位 NormalFloat(NF4),一种对于正态分布权重而言信息理论上最优的新数据类型
  • 双重量化,通过量化量化常数来减少平均内存占用
  • 分页优化器,用于管理内存峰值

1.1.5 Adapter

Adapter层是加在两个前馈层后面的,所以Adapter方法需要微调的参数就是新增的Adapter层,微调参数量会随着transformer的层数线性增长。

Adapter层是一个瓶颈结构,先把原始的维特征投影到一个更小的维度,之后应用非线性函数,最后再次投影回维。包括偏置在内,每层添加的总参数数量是。通过设置来保证每个Adapter层的参数量不会很大。

注意到Adapter层也是一个残差结构,所以只要把主路径上的全连接层参数初始化为0,Adapter在一开始的时候就相当于一个全等变换,这样相当于啥也没加,网络应该可以train得起来。

Adapter和微调最后几层transformer的效果,发现Adapter在相同参数量的情况下基本能够超过后者,在参数量小两个数据级的时候也能达到接近的效果,这说明Adapter在少参数微调的时候还是比较有性价比的。

1.1.6 IA3

具体来说作者修改了Attention,对其加入

然後在 position-wise feed-forward networks 加入

是网络中的非线性层。作者對每一個 Transformer layer 都做了一樣的改動。

在few-shot训练集上,IA3用少量参数赢过其他微调方法,并且比Full Fine-Tuning还要好。

1.1.7 Ladder Side-Tuning

反向传播,也就是求模型梯度,是从输出层向输入层逐步计算的,因此反向传播的深度/计算量,取决于最靠近输入层的参数深度,跟可训练的参数量没有太必然的联系。对于Adapter来说,它在每一层后面都插入了一个小规模的层,虽然其余参数都固定了,只有新插入的层可训练,但每一层都新层,所以反向传播要传到输入层;对于P-tuning来说,本质上它是只有在Embedding层中有少量可训练参数,但Embedding层是输入层,因此它的反向传播也要贯穿整个模型。因此,这两种方案能提升的训练效率并不多。

至于LST,它是在原有大模型的基础上搭建了一个“旁支”(梯子),将大模型的部分层输出作为旁枝模型的输入,所有的训练参数尽在旁枝模型中,由于大模型仅提供输入,因此反向传播的复杂度取决于旁枝模型的规模,并不需要直接在原始大模型上执行反向传播,因此是可以明显提升训练效率的。

1.1.8 BitFit

BitFit仅微调网络的Bias。BitFit仅更新模型约0.05%的参数量。

BitFit是一个非常简单的方法,但性能表现比Full Fine-Tuning差。

1.2 PEFT方法的架构和修改位置

二、ChatGLM3 Prefix-Tunning

基于ChatGLM3微调一个同时具有 NLU 和问答能力对话机器人

2.1 数据源

酒店预订场景:https://github.com/thu-coai/CrossWOZ

酒店数据库:https://github.com/thu-coai/CrossWOZ/blob/master/data/crosswoz/database/hotel_db.json

2.2 数据增强

  • 从 CrossWOZ 数据集中抽取了只关于酒店的对话
  • 利用 ChatGPT 进行如下修改和补充
    • 对设施的描述更口语化
      • “找一家有国际长途电话的酒店” -> “找一家能打国际长途的酒店”
    • 补充一定比例的多轮问答,和结束语对话(p=0.3)
      • 针对只提及一个酒店时的问答:“这个酒店的电话是多少”
      • 针对推荐多个酒店时的对比问答:“哪个酒店评分更高”
      • 结束语:“好的,祝您入住愉快”
    • 补充按酒店名(简称)、价格上限查询的对话(原数据中没有这类说法)

最终按 8:1:1 拆分训练集、验证集和测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
[
{
"role": "user",
"content": "你好,我出差想去酒店住宿。请帮我推荐一家公共区域和部分房间提供wifi服务,评分是4分以上的酒店。"
},
{
"role": "search",
"arguments": {
"facilities": ["公共区域和部分房间提供wifi"],
"rating_range_lower": 4.0
}
},
{
"role": "return",
"records": [
{
"name": "北京龙鼎华鼎云酒店",
"type": "舒适型",
"address": "北京朝阳区潘家园东里18号",
"subway": "劲松地铁站D口",
"phone": "010-52001188",
"facilities": [
"公共区域和部分房间提供wifi",
"宽带上网",
"国际长途电话",
"吹风机",
"24小时热水",
"中式餐厅",
"会议室",
"无烟房",
"商务中心",
"洗衣服务",
"行李寄存",
"叫醒服务"
],
"price": -1.0,
"rating": 4.3,
"hotel_id": 24
}
]
},
{
"role": "assistant",
"content": "那您去北京龙鼎华鼎云酒店住宿吧,酒店质量很好。"
}
]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
[
{
"role": "user",
"content": "你好,我出差想去酒店住宿。请帮我推荐一家提供无线网络且评分在4分以上的酒店。"
},
{
"role": "search",
"arguments": {
"facilities": ["无线网络"],
"rating_range_lower": 4.0
}
},
{
"role": "return",
"records": [
{
"name": "北京龙鼎华鼎云酒店",
"type": "舒适型",
"address": "北京朝阳区潘家园东里18号",
"subway": "劲松地铁站D口",
"phone": "010-52001188",
"facilities": [
"公共区域和部分房间提供wifi",
"宽带上网",
"国际长途电话",
"吹风机",
"24小时热水",
"中式餐厅",
"会议室",
"无烟房",
"商务中心",
"洗衣服务",
"行李寄存",
"叫醒服务"
],
"price": -1.0,
"rating": 4.3,
"hotel_id": 24
}
]
},
{
"role": "assistant",
"content": "那您去北京龙鼎华鼎云酒店住宿吧,酒店质量很好。"
},
{
"role": "user",
"content": "这个酒店的评分是多少?"
},
{
"role": "assistant",
"content": "这个酒店的评分是4.3分。"
},
{
"role": "user",
"content": "好的,我决定入住北京龙鼎华鼎云酒店了。"
},
{
"role": "assistant",
"content": "好的,祝您入住愉快。"
}
]

2.3 数据处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import json
import ast
import astunparse
from transformers import PreTrainedTokenizer
from torch.utils.data import Dataset
from copy import deepcopy
from typing import Dict, List

# text constants
FUNCTION_CALL_NAME = 'tool_call'
FUNCTION_CALL_PREFIX = '```python\n'
FUNCTION_CALL_POSTFIX = '\n```'
TOOL_DEFINITION_PREFIX = 'Answer the following questions as best as you can. You have access to the following tools:\n'
CONVERSATOIN_KEY = 'conversations'
TOOL_DESC_KEY = 'tools'

def format_function_call(function_name: str, parameters: Dict[str, str]):
function_name = ast.Name(id=function_name)
keywords = [
ast.keyword(arg=arg_name, value=ast.Constant(arg_value))
for arg_name, arg_value in parameters.items()
]
func_call = ast.Call(func=function_name, args=[], keywords=keywords)
return astunparse.unparse(func_call).strip()

def sanity_check(tokens: List[int], target: List[int], tokenizer: PreTrainedTokenizer):
print("Sanity Check >>>>>>>>>>>>>")
for t, m in zip(tokens, target):
decoded = tokenizer.tokenizer.index_special_tokens[t] \
if t in tokenizer.tokenizer.index_special_tokens \
else tokenizer.decode([t])
print("%20s: %6d -> %6d" % (repr(decoded), t, m))
print("<<<<<<<<<<<<< Sanity Check")

assert len(tokens) == len(target), f"length mismatch: {len(tokens)} vs {len(target)}"

2.3.1 基本拼接方式

2.3.2 多轮对话的拼接

ChatGLM 3 的方式

因为 CausalLM 是一直从左往右预测的,我们可以直接在多轮对话中标识出多段输出。具体如下:

角色special token用于标识分隔出多轮对话,同时也可以防范注入攻击

  • <|system|> #系统提示词,指明模型可使用的工具等信息
  • <|user|> #用户输入,用户的指令
  • <|assistant|> #模型回复,或模型思考要做的事情
  • <|observation|> #工具调用、代码执行结果

注意:这里<|role|>这种是一个 token,而不是一串文本,所以不能通过tokenizer.encode('<role>')来得到

角色后跟随的是 metadata,对于 function calling 来说,metadata 是调用的函数和相应参数;对其他角色的对话,metadata 为空

  • 多轮对话 finetune 时根据角色添加 loss_mask
  • 在一遍计算中为多轮回复计算 loss

<|system|>Answer the following questions as best as you can.

You have access to the following tools:\n[...]

<|user|> 北京的天气怎么样?

<|assistant|> 我需要调用天气预报工具来获取北京的天气信息。

<|assistant|>get_weather```python_call(location="北京")```

<|observation|> {"temperature_c": 12, "description": "haze"}

<|assistant|> 根据天气工具的信息,北京的天气是:温度 12 摄氏度,有雾。

<|user|> 这样的天气适合外出活动吗?

<|assistant|> 北京现在有雾,气温较低,建议您考虑一下是否适合外出进行锻炼。

<|user|>


高亮部分为需要计算 loss 的 token。注意<|assistant|>后的内容和角色 token 都需要算 loss。

官方讲解:https://www.bilibili.com/video/BV1uC4y1J7yA

多轮对话处理代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def format_conversation(item, tokenizer, conversation_key: str, tool_key: str):
conversations = deepcopy(item[conversation_key])

# Note: `loss_mask` here means whether *the prediction* of the token should take loss
tokens, loss_masks = [tokenizer.get_command("[gMASK]"), tokenizer.get_command("sop")], [0, 0]

def _update(_tokens: List[int], value: int = 1):
value = int(value)
tokens.extend(_tokens)
loss_masks.extend([value] * len(_tokens))

# insert system prompt for tools
if tool_key in item:
conversations.insert(0,
{
"role": "system",
"content": TOOL_DEFINITION_PREFIX + json.dumps(item[tool_key], indent=4, ensure_ascii=False)
}
)

for idx, conv in enumerate(conversations):
loss = conv.get("loss", True)
if conv['role'] in {'system', 'user'}:
loss = False
if conv['role'] == 'tool':
# function call python code
value = FUNCTION_CALL_PREFIX + format_function_call(FUNCTION_CALL_NAME, conv["parameters"]) + FUNCTION_CALL_POSTFIX
text = tokenizer.build_single_message("assistant", conv["name"], value)
_update(text, loss)

# function call result
value = conv.get('observation', None)
if not isinstance(value, str):
value = json.dumps(value, ensure_ascii=False)
text = tokenizer.build_single_message("observation", "", value)
_update(text, False)
else:
text = tokenizer.build_single_message(conv['role'], "", conv["content"])
_update(text, loss)

_update([tokenizer.eos_token_id], False)

assert len(tokens) == len(loss_masks), f"length mismatch: {len(tokens)} vs {len(loss_masks)}"
return tokens, loss_masks

2.3.3 ChatGLM 3 的数据加载

实际应用中,只需要将上述数据,与 ChatGLM 3 的标准数据格式对齐,就可调用其原生的数据加载器,自动完成数据拼接

ChatGLM 3 官方近期重构了代码,数据加载部分在:https://github.com/THUDM/ChatGLM3/blob/main/finetune_demo/finetune_hf.py

但这版重构后的代码未实现 tool 的拼接部分。带有 tool 拼接的早期版本,参考:https://github.com/THUDM/ChatGLM3/blob/4568c635e686e8e2053568d041f36c884cab328a/finetune_demo/preprocess_utils.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
{
"tools": [
"search_hotels: 根据筛选条件查询酒店的函数\nparameters: {\"name\":\"酒店名称\",\"price_range_lower\":\"价格下限\",\"price_range_upper\":\"价格上限\",\"rating_range_lower\":\"评分下限\",\"rating_range_upper\":\"评分上限\",\"facilities\": \"酒店提供的设施\"}\noutput: 酒店信息dict组成的list"
],
"conversations": [
{
"role": "user",
"content": "请帮我找一家最低价格是300-400元,提供无烟房的经济型酒店。"
},
{
"role": "assistant",
"content": "我需要使用search_hotels工具来查询酒店"
},
{
"role": "tool",
"name": "search_hotels",
"parameters": {
"facilities": ["无烟房"],
"price_range_lower": 300,
"price_range_upper": 400,
"type": "经济型"
},
"observation": [
{
"name": "飘HOME连锁酒店(北京王府井店)",
"type": "经济型",
"address": "北京东城区东安门大街43号",
"subway": "灯市口地铁站A口",
"phone": "010-57305888",
"facilities": [
"酒店各处提供wifi",
"宽带上网",
"吹风机",
"24小时热水",
"暖气",
"无烟房",
"早餐服务",
"行李寄存",
"叫醒服务"
],
"price": 303.0,
"rating": 4.3,
"hotel_id": 152
}
]
},
{
"role": "assistant",
"content": "推荐您去飘HOME连锁酒店(北京王府井店)。"
}
]
}

数据加载代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class MultiTurnDataset(Dataset):
def __init__(self, data: List[dict], tokenizer: PreTrainedTokenizer, max_seq_length: int):
super(MultiTurnDataset, self).__init__()
self.tokenizer = tokenizer
self.max_seq_length = max_seq_length
self.data = data

def __len__(self):
return len(self.data)

def __getitem__(self, i) -> dict:
data_item = self.data[i]
tokens, loss_masks = format_conversation(data_item, self.tokenizer, CONVERSATOIN_KEY, TOOL_DESC_KEY)

# labels are used inside the model
target_based_loss_mask = [False] + loss_masks[:-1]
labels = [(t if m else -100) for t, m in zip(tokens, target_based_loss_mask)]

tokens = tokens[:self.max_seq_length]
labels = labels[:self.max_seq_length]
tokens += [self.tokenizer.pad_token_id] * (self.max_seq_length - len(tokens))
labels += [-100] * (self.max_seq_length - len(labels))

assert len(tokens) == len(labels), f"length mismatch: {len(tokens)} vs {len(labels)}"

return {
"input_ids": tokens,
"labels": labels
}{% endfold %}

2.4 ChatGLM 3 Prefix-Tunning训练代码

模型相关参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
"""
文件中定义了模型定义和训练过程中的命令行参数
"""
from dataclasses import dataclass, field
from typing import Optional


@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""

model_name_or_path: str = field(
metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
)
checkpoint_path: Optional[str] = field(
default=None, metadata={"help": "Path to pt2 or lora finetuned checkpoint dir"}
)
ptuning_checkpoint: str = field(
default=None, metadata={"help": "Path to p-tuning v2 checkpoints"}
)
config_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
)
tokenizer_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None,
metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
)
model_revision: str = field(
default="main",
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
use_auth_token: bool = field(
default=False,
metadata={
"help": (
"Will use the token generated when running `huggingface-cli login` (necessary to use this script "
"with private models)."
)
},
)
resize_position_embeddings: Optional[bool] = field(
default=None,
metadata={
"help": (
"Whether to automatically resize the position embeddings if `max_source_length` exceeds "
"the model's position embeddings."
)
},
)
quantization_bit: Optional[int] = field(
default=None
)
pre_seq_len: Optional[int] = field(
default=None
)
prefix_projection: bool = field(
default=False
)

数据相关参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
@dataclass
class DataTrainingArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
"""
train_file: Optional[str] = field(
default=None, metadata={"help": "The input training data file (a jsonlines or csv file)."}
)
validation_file: Optional[str] = field(
default=None, metadata={"help": "The input validation data file (a jsonlines or csv file)."}
)
test_file: Optional[str] = field(
default=None, metadata={"help": "The input test data file (a jsonlines or csv file)."}
)

max_seq_length: Optional[int] = field(
default=2048,
metadata={
"help": (
"The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated."
)
},
)

max_source_length: Optional[int] = field(
default=1024,
metadata={
"help": (
"The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
)
},
)
max_target_length: Optional[int] = field(
default=128,
metadata={
"help": (
"The maximum total sequence length for target text after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
)
},
)

overwrite_cache: bool = field(
default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
)

preprocessing_num_workers: Optional[int] = field(
default=None,
metadata={"help": "The number of processes to use for the preprocessing."},
)

max_seq_length: Optional[int] = field(
default=1024,
metadata={
"help": (
"The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
)
},
)

pad_to_max_length: bool = field(
default=False,
metadata={
"help": (
"Whether to pad all samples to model maximum sentence length. "
"If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
"efficient on GPU but very bad for TPU."
)
},
)

max_train_samples: Optional[int] = field(
default=None,
metadata={
"help": (
"For debugging purposes or quicker training, truncate the number of training examples to this "
"value if set."
)
},
)

微调相关参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
@dataclass
class PeftArguments:
lora_rank: int = field(
default=None,
metadata={"help": "LoRA rank number"}
)
lora_alpha: int = field(
default=32,
metadata={"help": "LoRA alpha weight"}
)
lora_dropout: float = field(
default=0.1,
metadata={"help": "LoRA dropout probability"}
)
lora_checkpoint: str = field(
default=None,
metadata={"help": "Path to LoRA checkpoints"}
)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
"""
The Trainer class, to easily train a 🤗 Transformers from scratch or finetune it on a new task.
"""
import os
from typing import Optional
from transformers import Trainer

import torch
from transformers.modeling_utils import PreTrainedModel, unwrap_model
from transformers.utils import logging

logger = logging.get_logger(__name__)

WEIGHTS_NAME = "pytorch_model.bin"
TRAINING_ARGS_NAME = "training_args.bin"

class PrefixTrainer(Trainer):
def __init__(self, *args, save_changed=False, **kwargs):
self.save_changed = save_changed
super().__init__(*args, **kwargs)

def _save(self, output_dir: Optional[str] = None, state_dict=None):
# If we are executing this function, we are the process zero, so we don't check for that.
output_dir = output_dir if output_dir is not None else self.args.output_dir
os.makedirs(output_dir, exist_ok=True)
logger.info(f"Saving model checkpoint to {output_dir}")
# Save a trained model and configuration using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
if not isinstance(self.model, PreTrainedModel):
if isinstance(unwrap_model(self.model), PreTrainedModel):
if state_dict is None:
state_dict = self.model.state_dict()
unwrap_model(self.model).save_pretrained(output_dir, safe_serialization=False, state_dict=state_dict)
else:
logger.info("Trainer.model is not a `PreTrainedModel`, only saving its state dict.")
if state_dict is None:
state_dict = self.model.state_dict()
torch.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
else:
if self.save_changed:
print("Saving PrefixEncoder")
state_dict = self.model.state_dict()
filtered_state_dict = {}
for k, v in self.model.named_parameters():
if v.requires_grad:
filtered_state_dict[k] = state_dict[k]
self.model.save_pretrained(output_dir, safe_serialization=False, state_dict=filtered_state_dict)
else:
print("Saving the whole model")
self.model.save_pretrained(output_dir, safe_serialization=False, state_dict=state_dict)
if self.tokenizer is not None:
self.tokenizer.save_pretrained(output_dir, safe_serialization=False)

# Good practice: save your training arguments together with the trained model
torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import logging
import os
import sys
import torch
import json
import transformers
from transformers import (
AutoConfig,
AutoModel,
AutoTokenizer,
DataCollatorForSeq2Seq,
HfArgumentParser,
TrainingArguments,
set_seed,
)
from trainer import PrefixTrainer
from arguments import ModelArguments, DataTrainingArguments
from data_preprocess import sanity_check, MultiTurnDataset

# 初始化日志记录
logger = logging.getLogger(__name__)

def setup_logger(training_args):
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
handlers=[logging.StreamHandler(sys.stdout)],
)
# 配置huggingface的日志记录
if training_args.should_log:
transformers.utils.logging.set_verbosity_info()
log_level = training_args.get_process_log_level()
logger.setLevel(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()

logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
)
logger.info(f"Training/evaluation parameters {training_args}")

加载模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def load_model(model_args):
# 加载预训练的chatglm3-6b的model config
config = AutoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
config.pre_seq_len = model_args.pre_seq_len
config.prefix_projection = model_args.prefix_projection
# 加载预训练的chatglm3-6b的tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
# 判断是否加载pt2的checkpoint来继续训练
if model_args.ptuning_checkpoint is not None:
model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True)
prefix_state_dict = torch.load(os.path.join(model_args.ptuning_checkpoint, "pytorch_model.bin"))
new_prefix_state_dict = {}
for k, v in prefix_state_dict.items():
if k.startswith("transformer.prefix_encoder."):
new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v
model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)
else: # 不加载pt2 checkpoint则直接加载model
model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True)
# 如果有设置quantization则以int数值加载不参与更新的参数,用以节省显存
if model_args.quantization_bit is not None:
print(f"Quantized to {model_args.quantization_bit} bit")
model = model.quantize(model_args.quantization_bit)
# pt2训练,为要训练的prefix_encoder参数使用更高数值精度的float32
if model_args.pre_seq_len is not None:
model = model.half()
model.transformer.prefix_encoder.float()
# 全量参数finetune训练,需要很高的显存配置
else:
model = model.float()

return tokenizer, model

Prefix-Tuning微调主流程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def main():
# 解析传入的命令行参数
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
# 初始化工作
setup_logger(training_args)
set_seed(training_args.seed)
tokenizer, model = load_model(model_args)
# 准备训练数据集并处理成所需格式
if training_args.do_train:
with open(data_args.train_file, "r", encoding="utf-8") as f:
train_data = [json.loads(line) for line in f]

train_dataset = MultiTurnDataset(
train_data,
tokenizer,
data_args.max_seq_length,
)

#if training_args.local_rank < 1:
# sanity_check(train_dataset[0]['input_ids'], train_dataset[0]['labels'], tokenizer)
if training_args.do_eval:
with open(data_args.validation_file, "r", encoding="utf-8") as f:
eval_data = [json.loads(line) for line in f]

eval_dataset = MultiTurnDataset(
eval_data,
tokenizer,
data_args.max_seq_length,
)
# 将数据集中样本批处理成张量
data_collator = DataCollatorForSeq2Seq(
tokenizer,
model=model,
label_pad_token_id=-100,
pad_to_multiple_of=None,
padding=False
)
# 配置trainer,相比base trainer重写了保存参数的功能
trainer = PrefixTrainer(
model=model,
args=training_args,
train_dataset=train_dataset if training_args.do_train else None,
eval_dataset=eval_dataset if training_args.do_eval else None,
tokenizer=tokenizer,
data_collator=data_collator,
save_changed=model_args.pre_seq_len is not None
)
# 开始训练
if training_args.do_train:
checkpoint = None
if training_args.resume_from_checkpoint is not None:
checkpoint = training_args.resume_from_checkpoint
model.gradient_checkpointing_enable()
model.enable_input_require_grads()
trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model()
trainer.save_state()
if training_args.do_eval:
trainer.evaluate()

if __name__ == "__main__":
main()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#! /usr/bin/env bash

set -ex

LR=2e-2
PRE_SEQ_LEN=256
MAX_SEQ_LEN=3072

DATESTR=`date +%Y%m%d-%H%M%S`
RUN_NAME=hotel_pt2
OUTPUT_DIR=output/${RUN_NAME}-${DATESTR}
mkdir -p $OUTPUT_DIR

BASE_MODEL_PATH=/pathto/chatglm3-6b

CUDA_VISIBLE_DEVICES=0 python Prefix_Tunning.py \
--do_train \
--do_eval \
--train_file ../data/train.chatglm3.jsonl \
--validation_file ../data/dev.chatglm3.jsonl \
--max_seq_length $MAX_SEQ_LEN \
--preprocessing_num_workers 1 \
--model_name_or_path $BASE_MODEL_PATH \
--output_dir $OUTPUT_DIR \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 2 \
--per_device_eval_batch_size 2 \
--evaluation_strategy steps \
--eval_steps 300 \
--num_train_epochs 6 \
--logging_steps 300 \
--logging_dir $OUTPUT_DIR/logs \
--save_steps 300 \
--learning_rate $LR \
--quantization_bit 4 \
--pre_seq_len $PRE_SEQ_LEN 2>&1 | tee ${OUTPUT_DIR}/train.log

三、Llama 3 LoRA 训练

基于Llama3微调一个同时具有 NLU 和问答能力对话机器人

3.1 Llama 3 中实现类似 Function Calling 的效果

类似 ChatGLM 3 的实现方式

  1. 自定义 user、assistant、search、return 四个角色
    • 因为只有一个 function,我们直接把 function 标识成 search
  2. 每轮 assistant 和 search 前缀也由模型自动生成,我们以此判断是 function 还是文本回复
  3. 类似 ChatGLM 3,我们以预留的特殊 token,来标识每个轮次的角色和轮次结束
    • 例如:<|start_header_id|>角色<|end_header_id|>内容 ... ...<|eot_id|>
    • 其中 <|start_header_id|><|end_header_id|><|eot_id|> 是 Llama 3 预留的特殊 token

3.1.1 Function Call 的样例

输入

<|start_header_id|>user<|end_header_id|>你好,请帮我推荐一个提供无烟房的舒适型酒店可以吗?<|eot_id|>

输出

<|start_header_id|>search<|end_header_id|>{"facilities": ["无烟房"], "type": "舒适型"}}<|eot_id|>

3.1.2 文本回复的样例

输入

<|start_header_id|>user<|end_header_id|>你好,请帮我推荐一个提供无烟房的舒适型酒店可以吗?<|eot_id|>

<|start_header_id|>search<|end_header_id|>{"facilities": ["无烟房"], "type": "舒适型"}}<|eot_id|>

<|start_header_id|>return<|end_header_id|>[{"name": "北京红驿栈酒店", "type": "舒适型", "address": "北京朝阳区东直门外春秀路太平庄 10 号(主副楼在一幢建筑里)", "subway": "东直门地铁站 E 口", "phone": "010-64171066", "facilities": ["公共区域和部分房间提供 wifi", "宽带上网", "国际长途电话", "吹风机", "24 小时热水", "暖气", "无烟房", "早餐服务", "接待外宾", "行李寄存", "叫醒服务"], "price": 344.0, "rating": 4.7, "hotel_id": 51}, {"name": "维也纳国际酒店(北京广安门店)", "type": "舒适型", "address": "北京西城区白广路 7 号", "subway": "广安门内地铁站 C 口", "phone": "010-83539988", "facilities": ["酒店各处提供 wifi", "宽带上网", "吹风机", "24 小时热水", "中式餐厅", "会议室", "无烟房", "商务中心", "早餐服务", "洗衣服务", "行李寄存", "叫醒服务"], "price": 553.0, "rating": 4.7, "hotel_id": 56}]}]<|eot_id|>

输出

<|start_header_id|>assistant<|end_header_id|>没问题,推荐你去北京红驿栈酒店和维也纳国际酒店(北京广安门店),都挺好的。<|eot_id|>

3.2 LoRA 代码

模型配置

1
2
3
4
5
6
7
8
9
10
11
12
from dataclasses import dataclass, field
from typing import Optional


@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""
model_name_or_path: str = field(
metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
)

LoRA配置
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
@dataclass
class PeftArguments:
lora_rank: int = field(
default=8,
metadata={"help": "LoRA rank number"}
)
lora_alpha: int = field(
default=32,
metadata={"help": "LoRA alpha weight"}
)
lora_dropout: float = field(
default=0.1,
metadata={"help": "LoRA dropout probability"}
)
lora_checkpoint: str = field(
default=None,
metadata={"help": "Path to LoRA checkpoints"}
)

数据处理配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
@dataclass
class DataTrainingArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
"""
prompt_column: Optional[str] = field(
default=None,
metadata={"help": "The name of the column in the datasets containing the full texts (for summarization)."},
)
response_column: Optional[str] = field(
default=None,
metadata={"help": "The name of the column in the datasets containing the summaries (for summarization)."},
)
history_column: Optional[str] = field(
default=None,
metadata={"help": "The name of the column in the datasets containing the history of chat."},
)
train_file: Optional[str] = field(
default=None, metadata={"help": "The input training data file (a jsonlines or csv file)."}
)
validation_file: Optional[str] = field(
default=None,
metadata={
"help": (
"An optional input evaluation data file to evaluate the metrics (rouge) on (a jsonlines or csv file)."
)
},
)
test_file: Optional[str] = field(
default=None,
metadata={
"help": "An optional input test data file to evaluate the metrics (rouge) on (a jsonlines or csv file)."
},
)
overwrite_cache: bool = field(
default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
)
preprocessing_num_workers: Optional[int] = field(
default=None,
metadata={"help": "The number of processes to use for the preprocessing."},
)
max_source_length: Optional[int] = field(
default=1024,
metadata={
"help": (
"The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
)
},
)
max_target_length: Optional[int] = field(
default=256,
metadata={
"help": (
"The maximum total sequence length for target text after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
)
},
)
ignore_pad_token_for_loss: bool = field(
default=True,
metadata={
"help": "Whether to ignore the tokens corresponding to padded labels in the loss computation or not."
},
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import json
from torch.utils.data import Dataset

class InputOutputDataset(Dataset):
def __init__(self, data, tokenizer, args):
super(InputOutputDataset, self).__init__()
self.data = data
self.tokenizer = tokenizer
self.prompt_column = args.prompt_column
self.response_column = args.response_column
self.max_source_length = args.max_source_length
self.max_target_length = args.max_target_length

def __len__(self):
return len(self.data)

def __getitem__(self, i):
item = self.data[i]
# add_special_tokens 不在开头加 special_tokens
context = self.tokenizer(
build_prompt(item[self.prompt_column]),
max_length=self.max_source_length,
add_special_tokens=False)
response = self.tokenizer(
build_response(item[self.response_column]),
max_length=self.max_target_length,
add_special_tokens=False)
input_ids = context["input_ids"] + response["input_ids"]
attention_mask = context["attention_mask"] + response["attention_mask"]
labels = [-100] * len(context["input_ids"]) + response["input_ids"]
assert len(input_ids) == len(labels), f"length mismatch: {len(input_ids)} vs {len(labels)}"
return {
"input_ids": input_ids,
"attention_mask": attention_mask,
"labels": labels
}

def build_prompt(context):
if isinstance(context,str):
context = json.loads(context)
prompt = ''
for turn in context:
if turn["role"] in ["user","assistant"]:
prompt += f'<|start_header_id|>{turn["role"]}<|end_header_id|>\n{turn["content"]}<|eot_id|>\n'
else:
if turn["role"] == "search":
obj = turn["arguments"]
filtered_obj = {k: v for k, v in obj.items() if v is not None}
prompt += '<|start_header_id|>search<|end_header_id|>\n'
prompt += json.dumps(filtered_obj,indent=4,ensure_ascii=False)
else:
obj = turn["records"]
prompt += '<|start_header_id|>return<|end_header_id|>\n'
prompt += json.dumps(obj,indent=4,ensure_ascii=False)
prompt += '<|eot_id|>\n'
return prompt

def build_response(response):
if isinstance(response,str):
response = json.loads(response)
if response["role"] == "assistant":
return '<|start_header_id|>assistant<|end_header_id|>\n' + response["content"] + '<|eot_id|>'
else:
obj = response["arguments"]
filtered_obj = {k: v for k, v in obj.items() if v is not None}
return '<|start_header_id|>search<|end_header_id|>\n' + json.dumps(filtered_obj,indent=4,ensure_ascii=False) + '<|eot_id|>'

def parse_json(string):
search_pos = 0
# 开始寻找第一个 '{'
start = string.find('{', search_pos)
if start == -1:
return None
# 从找到的 '{' 位置开始,向后寻找最后一个 '}'
end = string.rfind('}', start)
if end == -1:
return None
# 提取并尝试解析 JSON
json_string = string[start:end + 1]
try:
obj = json.loads(json_string)
return obj
except json.JSONDecodeError:
return None

模型加载

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import json
import torch
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
BitsAndBytesConfig,
DataCollatorForSeq2Seq,
HfArgumentParser,
TrainingArguments,
Trainer
)
from peft import LoraConfig, TaskType, get_peft_model
from arguments import ModelArguments, DataTrainingArguments, PeftArguments
from data_preprocess import InputOutputDataset

def load_model(model_name):
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
n_gpus = torch.cuda.device_count()
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto", # dispatch efficiently the model on the available ressources
max_memory = {i: '24500MB' for i in range(n_gpus)},
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
return model, tokenizer

LoRA主流程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def main():
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, PeftArguments, TrainingArguments))
model_args, data_args, peft_args, training_args = parser.parse_args_into_dataclasses()

lora_config = LoraConfig(
inference_mode=False,
task_type=TaskType.CAUSAL_LM,
target_modules=["q_proj", "k_proj", "v_proj"],#Q,K,V都进行微调
r=peft_args.lora_rank,
lora_alpha=peft_args.lora_alpha,
lora_dropout=peft_args.lora_dropout
)

model, tokenizer = load_model(model_args.model_name_or_path)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

data_collator = DataCollatorForSeq2Seq(
tokenizer=tokenizer,
padding=True
)

if training_args.do_train:
with open(data_args.train_file, "r", encoding="utf-8") as f:
train_data = [json.loads(line) for line in f]
train_dataset = InputOutputDataset(train_data, tokenizer, data_args)
if training_args.do_eval:
with open(data_args.validation_file, "r", encoding="utf-8") as f:
eval_data = [json.loads(line) for line in f]
eval_dataset = InputOutputDataset(eval_data, tokenizer, data_args)

trainer = Trainer(
model=model,
tokenizer=tokenizer,
data_collator=data_collator,
args=training_args,
train_dataset=train_dataset if training_args.do_train else None,
eval_dataset=eval_dataset if training_args.do_eval else None,
)

if training_args.do_train:
model.gradient_checkpointing_enable()
model.enable_input_require_grads()
trainer.train()
if training_args.do_eval:
trainer.evaluate()

if __name__ == "__main__":
main()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#! /usr/bin/env bash

set -ex

LR=2e-4

DATESTR=`date +%Y%m%d-%H%M%S`
RUN_NAME=hotel_qlora
OUTPUT_DIR=output/${RUN_NAME}-${DATESTR}
mkdir -p $OUTPUT_DIR

MODEL_PATH="/pathto/Meta-Llama-3-8B-Instruct"

CUDA_VISIBLE_DEVICES=0 python main_qlora.py \
--do_train \
--do_eval \
--train_file ../data/train.llama3.jsonl \
--validation_file ../data/dev.llama3.jsonl \
--prompt_column context \
--response_column response \
--model_name_or_path "${MODEL_PATH}" \
--output_dir $OUTPUT_DIR \
--max_source_length 2048 \
--max_target_length 1024 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 4 \
--evaluation_strategy steps \
--eval_steps 300 \
--num_train_epochs 2 \
--logging_steps 300 \
--logging_dir $OUTPUT_DIR/logs \
--save_steps 300 \
--learning_rate $LR \
--lora_rank 16 \
--lora_alpha 32 \
--lora_dropout 0.1 \
--optim "paged_adamw_8bit" \
--warmup_ratio 0.1 \
--fp16 2>&1 | tee ${OUTPUT_DIR}/train.log

工具

HuggingFace 官方推出一个在线工具,可以估算模型的显存使用情况。

参考

  1. 概覽 Parameter-Efficient Fine-Tuning (PEFT)
  2. Ladder Side-Tuning:预训练模型的“过墙梯”
  3. Parameter-efficient Fine-tuning (PEFT): Overview, benefits, techniques and model training
  4. PEFT:在低资源硬件上对十亿规模模型进行参数高效微调
  5. 参数高效微调方法(Parameter-Efficient Fine-Tuning,PEFT)概述
  6. 大模型参数高效微调技术原理综述
  7. Finetuning LLMs Efficiently with Adapters
  8. LLM微调系列1: Adapter
  9. A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes
  10. peft模型微调_IA3
  11. Parameter-Efficient Transfer Learning for NLP
  12. Get Insight from your Business Data - Build LLM application with PEFT (with LoRA) using 🤗 Hugging Face
  13. TOWARDS A UNIFIED VIEW OF PARAMETER-EFFICIENT TRANSFER LEARNING

LLM(七)——参数高效微调(Parameter-efficient fine-tuning,PEFT)
https://mztchaoqun.com.cn/posts/D45_Fine-tuning/
作者
mztchaoqun
发布于
2024年11月2日
许可协议