探索微调的全景——在 Colab 上用 FFT、SFT 和 QLoRA 改造语言模型

🥳 介绍

欢迎！在本文中，我们将深入探讨几种流行的微调策略：FFT (Full-Fine-Tuning)、SFT (Supervised Fine Tune) 和 Qlora。每种策略都有其独特之处，适用于不同的场景和需求。让我们一起探索这些策略，理解它们的工作原理，以及如何在实际应用中选择和使用它们。让我们开始吧！

自从Llama1 出山后，开源模型社区开始百花齐放，从Alpaca, Vicuna, WizardLM 到百川，千问，ChatGLM，开源在很大程度上能够带给个人和企业更多的想法，带来多样性。随着Llama2的发布，更加优秀的性能（相对）（那个安全限制多少有点整不懂了）和许可证进一步宽松为进一步的创新提供了更多可能。

模型的微调和Langchian等工具的使用，成为了个性化模型和生产环境的部署的必备流程。

预训练只是初步的。要让模型在特定任务上达到最佳性能，微调就显得尤为重要。通过在特定任务的数据上进一步训练预训练模型，使其适应该任务。例如，虽然LLama2在生成文本方面表现出色，但为了使其在特定的问答任务上表现得更好，微调是必不可少的，也就是Llama2的chat版本。（当然也有灾难性遗忘等问题影响。）

⚠️ 注意：本人目前也在闲暇时间研究当中，参考了各种网上的教程和视频（花了不少时间和经费，成果其实并不怎么好🫠），尽可能避免在叙述中出现问题，但是依然有很多知识的的理解或表述存在漏洞，如有发现请在评论中或邮件中提出，非常感谢🙏

⚠️ 注意：如果在笔记本电脑上使用Arc浏览器访问该文章，建议暂时隐藏侧边栏，不然屏幕宽度太小属实有点难受

🤖 LLM的背景信息

根据这篇文章的定义，大型语言模型，或LLM，是一种可以模仿人类智能的人工智能。他们使用统计模型来分析大量数据，学习单词和短语之间的模式和联系。这允许他们生成与特定作者或流派风格相似的新内容，如论文或文章。

首先，语言模型是用来预测文本序列中下一个词出现概率的模型。这种模型能够学习文本数据中的模式，并且可以被训练来适应特定的风格。例如，它能够模仿莎士比亚或者现代流行文化的语言风格，并据此进行文本续写。

（全连接神经网络是一种基础但强大的神经网络结构，通常用于分类、回归等问题。然而，在处理序列数据，特别是长序列数据时，全连接神经网络并不是最有效的。）

这就引出了Transformer结构，它是一种专为处理序列数据，尤其是文本数据设计的架构。相比于全连接神经网络，Transformer在处理长距离依赖和并行计算方面有明显优势。

Simple Explanation
想象一下，你正在举办一场派对，每个人（单词或其他数据点）都想与其他人互动以了解更多信息。但这个派对有个规则：你不能一次与多人对话，只能一个接一个地谈。这样效率很低，对吧？这就是传统的 RNN（循环神经网络）和 LSTM（长短时记忆网络）的问题。

现在，Transformer 走进了派对，并说："大家不用排队了，随便跟任何人交谈！"这样，信息就能快速地在所有人之间流动，每个人都能更全面地了解整个场景。这就是所谓的"注意力机制"，让每个数据点都能快速地考虑到其他所有数据点。

但 Transformer 不止于此，它引入了"多头"注意力。想象每个人不仅可以用他们的"普通话"与别人交流，还可以同时用"俚语"、"科学术语"等多种"语言"来交流。这样，每个人就能从多个角度来理解同一件事，获取更丰富的信息。

简而言之，Transformer 就像是一个超级派对规划者，让所有人都能迅速、高效且全面地交流，从而让整个派对（或者说，数据处理任务）更加成功！

🫠 微调的基本概念

就像之前提到的，微调的目的主要是让模型可以更加擅长做某些任务，例如文本分类、情感分析或问答等。微调就是为了优化模型在这个特定任务上的性能。

Screenshot 2023-08-26 at 12.04.45 PM.webp

Source: Original content from microsoft

下面是一些关于微调的简单基础概念

基本概念	简单解释
预训练模型	一个已经"学过"的模型，知道很多基础的东西
目标任务	你希望模型能做的特定"作业"或"任务"
迁移学习	让模型把在一个任务上学到的知识用在另一个任务上
数据集	一些特定任务的"练习题"，用于让模型更擅长这个任务
学习率	告诉模型应该多快地"学习"新任务
优化器	一个"教练"，帮助模型更好地进行学习
计算资源	"学习"所需要的电脑和时间
评估指标	看模型在新任务上做得好不好的方法，就像考试的分数
正则化	避免模型"死记硬背"，使其能更灵活地应用所学到的知识

本文中将介绍的3种微调方式（怎么简单怎么来）（RLHF目前不包括在文章中）

FFT (Full-Fine-Tuning)

FFT，也就是全微调（Full-Fine-Tuning），是一种广泛应用的模型微调方法。在这种方法中，我们不仅微调模型的最后几层（这通常是在只微调分类器层的策略中完成的），而是微调模型的所有层。这意味着模型的整体架构，从输入层到输出层，都会进行权重的更新。

主要特点

全面性：所有模型层都被微调，这通常能带来更好的模型性能。
灵活性：由于模型的所有部分都在学习，这种方法通常更能适应新任务的特点。
计算量：全微调通常需要更多的计算资源和时间，因为整个模型都需要更新。

适用场景

当目标任务与预训练任务有较大差异时，全微调通常更为有效。
当你拥有相对充足的标注数据和计算资源时。

SFT (Supervised-Fine-Tune)

SFT，也就是有监督微调（Supervised Fine-Tune），是一种在有监督学习环境下进行模型微调的方法。与FFT（Full-Fine-Tuning）不同，SFT通常只会微调模型的某几层而非整个模型。这样可以更专注于任务相关的特定特性，并且通常需要更少的计算资源。

主要特点

目标导向：微调过程专注于模型与特定任务的紧密配合。
计算高效：由于只微调部分模型层，SFT通常比FFT更节省计算资源。
数据依赖性：由于是有监督的，需要相对高质量的标注数据进行微调。

适用场景

当目标任务与预训练模型非常接近时，微调较少的层通常就足够了。
当你有限的计算资源或希望更快地得到结果时。

(Q)lora (Low-Rank Adaptation)

LoRA（Layer-wise Recurrent Mechanism for Adaptation）是一种针对大型预训练模型（如Transformers）微调的新方法。该方法通过在预训练模型的每一层添加少量的可适应参数来提高微调性能。这些额外的参数为各个层添加了一种递归机制，使得模型在微调过程中可以更有效地适应新任务。

主要特点

参数高效：LoRA只添加少量的参数，大大减少了计算复杂度和存储需求。
适应性强：通过在每一层都添加递归机制，LoRA使模型能更好地适应新任务。
兼容性好：LoRA通常可以和现有的预训练模型无缝集成，不需要重新训练模型。

适用场景

当你需要在有限的计算和存储资源下进行微调。
当你希望改善微调任务的性能，而不影响预训练模型的原始结构。

参数高效微调（PEFT）方法允许我们用更少的计算和存储资源来调整已经训练好的大型语言模型（PLM），使其适用于特定任务。通常，调整这些大型模型会很费时间和金钱，但PEFT通过只调整一小部分参数，让整个过程更加便捷和经济。这个LORA就是PEFT的其中一种方法

QLoRA（Quantized Low-Rank Adapters）的Q代表量化（Quantization）。量化是一种减少模型大小和内存占用的技术，它将浮点数转换为低位数的整数。例如，4-bit 量化就是将浮点数映射到 0 到 15 的整数范围内。说白了就是单消费级GPU也可以搞定，但是我目前做出来的效果不是特别好

输出激活原始（冻结）预训练权重（左）由权重矩阵 A 和 B 组成的低秩适配器（右）增强。

Source: Original content from huggingface

💬 微调方法详解

在本节中我们将详细讲解使用Full-Fine-Tuning，SFT (Supervised-Fine-Tune) 和 QLoRA（Quantized Low-Rank Adapters）来微调基于Meta的Llama2的微调模型Vicuna 7b 1.5版，其中除Full-Fine-Tuning很难在**Google Colab (Pro)**中进行微调外，其他方法是可以Colab中微调的，当然SFT的方法还是推荐云服务提供商的价格会便宜一点，大部分内容参考了网上教程和官方文档，原始文章将在Reference中显示，再次感谢

1. Full-Fine-Tuning

这里使用官方的微调方法

克隆仓库

1 2	git clone https://github.com/lm-sys/FastChat.git cd FastChat

安装一些必要的包

1
2
3

pip3 install --upgrade pip  # enable PEP 660 support
pip3 install -e ".[model_worker,webui]"
pip3 install -e ".[train]"

⚠️注意：这里需要安装Flash-Attention，根据pip3 install -e “.[train]”是没有问题的，ninja也会安装的，Build的速度可以提上去，没有 Ninja，编译可能会花费很长时间（2小时），因为它不使用多个CPU核心。使用 Ninja，在64核机器上编译只需要3-5分钟。前提是你要有64核机器

FlashAttention-2 目前支持：

Ampere、Ada 或 Hopper 显卡（例如，A100, RTX 3090, RTX 4090, H100）。对 Turing 显卡（T4, RTX 2080）的支持即将到来，请暂时使用 FlashAttention 1.x 版本以支持 Turing 显卡。
数据类型 fp16 和 bf16（bf16 需要 Ampere、Ada 或 Hopper 显卡）。
所有头维度（head dimensions）直至 256。当头维度（head dim）大于 192 时，需要 A100/A800 或 H100/H800 显卡以进行反向传播（backward）。

Screenshot 2023-08-15 at 00.20.56.webp

准备好数据集开始训练

torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_mem.py \
    --model_name_or_path lmsys/vicuna-7b-v1.5-16k  \
    --data_path [your dataset path] \
    --bf16 False \
    --output_dir output_vicuna \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 1024 \
    --gradient_checkpointing True \
    --lazy_preprocess True

上面的方法使用分布式训练，你可能需要准备多台GPU，例如2台或4台或8台A100 （40G或80G RAM）同时上面的只是一个例子，需要根据实际情况修改

以下是各个参数的解释：

torchrun --nproc_per_node=4 --master_port=20001: 这是torchrun命令的一部分，用于分布式训练。nproc_per_node设置每个节点上的进程数，而master_port设置主节点的端口。
fastchat/train/train_mem.py: 这是你要运行的Python脚本。
--model_name_or_path lmsys/vicuna-7b-v1.5-16k: 预训练模型的名称或路径。
--data_path [your dataset path]: 训练数据集的路径。
--bf16 True: 是否使用bf16（Brain Floating Point 16）进行训练。
--output_dir output_vicuna: 模型和日志输出的目录。
--num_train_epochs 1: 训练轮数。
--per_device_train_batch_size 1: 每个设备上的训练批量大小。
--per_device_eval_batch_size 1: 每个设备上的评估批量大小。
--gradient_accumulation_steps 4: 梯度累积的步骤数。
--evaluation_strategy "no": 评估策略，这里设置为"不进行评估"。
--save_strategy "steps": 保存策略，这里设置为按步数保存。
--save_steps 1200: 每多少步保存一次模型。
--save_total_limit 10: 最多保存的模型数量。
--learning_rate 2e-5: 学习率。
--weight_decay 0.: 权重衰减。
--warmup_ratio 0.03: 学习率预热的比例。
--lr_scheduler_type "cosine": 学习率调度类型。
--logging_steps 1: 每多少步进行一次日志记录。
--fsdp "full_shard auto_wrap": 使用全分片数据并自动包装模型用于FSDP（分布式数据并行性）。
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer': 要用于FSDP包装的Transformer层的类。
--tf32 True: 是否使用TF32数据类型。
--model_max_length 2048: 模型的最大长度。
--gradient_checkpointing True: 是否使用梯度检查点来节省内存。
--lazy_preprocess True: 是否使用延迟预处理。

Screenshot 2023-08-15 at 20.47.37.webp

⚠️注意: 训练之前如果你需要查看训练的状况，可以使用Wandb,你需要用你的token

⚠️注意: 你使用的数据集应该遵循Vicuna的格式：下面是一个官方的例子格式

[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "human",
        "value": "Who are you?"
      },
      {
        "from": "gpt",
        "value": "I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS)."
      },
      {
        "from": "human",
        "value": "What can you do?"
      },
      {
        "from": "gpt",
        "value": "I can chat with you."
      }
    ]
  },
  {
    "id": "identity_1",
    "conversations": [
      {
        "from": "human",
        "value": "Who are you?"
      },
      {
        "from": "gpt",
        "value": "My name is Vicuna, and I'm a language model developed by Large Model Systems Organization (LMSYS)."
      }
    ]
  },
[Continue]
]

示例链接：dummy_conversation.json

这里使用两台A100 80G做测试

每1秒钟自动刷新并显示nvidia-smi（NVIDIA System Management Interface）的输出

1	watch -n 1 nvidia-smi

Screenshot 2023-08-15 at 21.06.18.webp

训练完成后你也许想要把模型上传到huggingface上

使用 huggingface-cli login 登录并上传

具体请参考官方的文档: https://huggingface.co/docs/huggingface_hub/guides/upload

2. Supervised-Fine-Tune

在这里，我们直接使用hugging face官方的autotrain advanced 比较方便

Github仓库如下: https://github.com/huggingface/autotrain-advanced

下面是具体步骤：

⚠️注意：下面教程使用Google Colab，如果系统RAM不够你可能需要使用Google Colab Pro的High RAM或使用更强的GPU,例如A100，通常的报错是CUDA Out of Memory

首先看看GPU配置如何

print("============查看GPU信息================")
# 查看GPU信息
!nvidia-smi
print("==============查看pytorch版本==============")
# 查看pytorch版本
import torch
print(torch.__version__)
print("============查看虚拟机硬盘容量================")
# 查看虚拟机硬盘容量
!df -lh
print("============查看cpu配置================")
# 查看cpu配置
!cat /proc/cpuinfo | grep model\ name
print("=============查看内存容量===============")
# 查看内存容量
!cat /proc/meminfo | grep MemTotal

安装 Autotrain-advanced

1	!pip install autotrain-advanced transformers

更新一下Autotrain-advanced确保一切都是最新的

1	!autotrain setup --update-torch

查看下帮助

1	!autotrain llm --help

开始训练并上传Huggingface

!autotrain llm --train \
    --project_name [Project Name] \
    --model lmsys/vicuna-7b-v1.5-16k \
    --data_path [Dataset id from huggingface] \
    --use_peft \
    --use_int4 \
    --learning_rate 2e-4 \
    --train_batch_size 16 \
    --num_train_epochs 3 \
    --max_seq_length 16000 \
    --trainer sft \
    --text_column text \
    --push_to_hub \
    --token [If you want to push to huggingface, you need to add token here] \
    --repo_id [Your repo id]

当一切完成后，模型文件出现在hugging face,新版有merge选项，可以直接merge

这是官方的 Colab

3. Qlora & Lora

目前看到有两种比较简单的方法
一种是普通方法，一种使用官方的Deepspeed方法

普通方法来自 mlabonne.github.io 非常感谢

普通方法

首先安装并导入一些必要的软件

1	!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

设置config

⚠️注意：如果你使用A100或H100 GPU，你可以开启bf16 = True加速，同时可以稍微增加Batch size加快训练速度

# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "llama-2-7b-miniguanaco"

#########################################
# QLoRA parameters
#########################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

#########################################
# bitsandbytes parameters
#########################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

#########################################
# TrainingArguments parameters
#########################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

#########################################
# SFT parameters
#########################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

开始训练

# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

用tensorboard显示

1 2	%load_ext tensorboard %tensorboard --logdir results/runs

用正确的格式测试下

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

合并模型并上传Huggingface

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

!huggingface-cli login

model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

这是原作者的Colab文件：Colab

原作者的其他文章和课程也非常有意思 👍

Deepspeed

Source: https://github.com/lm-sys/FastChat/blob/main/docs/training.md

您可以使用以下命令使用ZeRO2使用QLoRA训练Vicuna-7B。请注意，QLoRA目前不支持ZeRO3，但ZeRO3确实支持LoRA，LoRA在playground/deepspeed_config_s3.json下有一个参考配置。要使用QLoRA，您必须安装比特和字节>=0.39.0和变压器>=4.30.0。

deepspeed fastchat/train/train_lora.py \
    --model_name_or_path ~/model_weights/llama-7b \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --data_path ./data/dummy_conversation.json \
    --bf16 True \
    --output_dir ./checkpoints \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 100 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --q_lora True \
    --deepspeed playground/deepspeed_config_s2.json \

4. 转换成ggml格式

由于原始模型可能依然比较大，或者你想在Mac/iPhone上比较轻松的运行微调过的模型，将模型转换成ggml格式是必不可少的一步。这一节中我们将使用llama.cpp转换模型。你可以在Google Colab上转换并上传至Huggingface

克隆Repo

1
2
3

!git clone https://github.com/ggerganov/llama.cpp
%cd llama.cpp/
%cd models/

登录huggingface

1 2	!pip install -q transformers==4.31.0 !huggingface-cli login

下载模型

1 2	!git clone https://huggingface.co/lmsys/vicuna-13b-v1.5-16k %cd vicuna-13b-v1.5-16k/

Make 并转换成ggml-model-f32.bin

%cd /
%cd content/llama.cpp/
!make
!python3 -m pip install -r requirements.txt
!python3 convert-pth-to-ggml.py models/vicuna-13b-v1.5-16k/ 0

转换成 Original llama.cpp quant methods: q4_0, q4_1, q5_0, q5_1, q8_0

1	!./quantize ./models/vicuna-13b-v1.5-16k/ggml-model-f32.bin ./models/vicuna-13b-v1.5-16k/ggml-model-q4_0.bin q4_0

1	!./quantize ./models/vicuna-13b-v1.5-16k/ggml-model-f32.bin ./models/vicuna-13b-v1.5-16k/ggml-model-q4_1.bin q4_1

1	!./quantize ./models/vicuna-13b-v1.5-16k/ggml-model-f32.bin ./models/vicuna-13b-v1.5-16k/ggml-model-q5_0.bin q5_0

1	!./quantize ./models/vicuna-13b-v1.5-16k/ggml-model-f32.bin ./models/vicuna-13b-v1.5-16k/ggml-model-q5_1.bin q5_1

1	!./quantize ./models/vicuna-13b-v1.5-16k/ggml-model-f32.bin ./models/vicuna-13b-v1.5-16k/ggml-model-q8_0.bin q8_0

上传huggingface

from huggingface_hub import HfApi
api = HfApi()
api.upload_file(
    path_or_fileobj="./models/vicuna-13b-v1.5-16k/ggml-model-q4_0.bin",
    path_in_repo="ggml-model-q4_0.bin",
    repo_id="[Your repo_id]",
    repo_type="model",
)

✈️ 总结

在本文中，我们详细探讨了大型语言模型的微调技术，包括FFT（Full-Fine-Tuning）、SFT（Supervised Fine Tune）和Qlora。这些微调方法各有优缺点，适用于不同的场景和需求。

展望未来，随着预训练模型和微调技术的不断进步，我们有理由相信，微调将会变得更加高效、灵活和可靠。

最后，本文旨在为那些对微调大型语言模型感兴趣的读者提供一个全面的指南。通过对比分析和实例应用，我们希望本文能为您在未来的研究或项目中选择适当的微调方法提供有价值的参考。

如果您在文章中注意到了一些问题或有建议的，请您在评论中提出，我们将非常感谢您！

Reference

Meta-Llama 2 Hugging-Face
Perplexity.ai’s online LLAMA-2-7B-CHAT
Open Foundation and Fine-Tuned Chat Models’s Paper
mlabonne.github.io
The EASIEST way to finetune LLAMA-v2 on local machine!