Navigating the Fine-Tuning Landscape – Transmuting Language Models with FFT, SFT, and Qlora on Colab

Language

English

🥳 Introduction

Welcome! In this article, we will delve into several popular fine-tuning strategies: FFT (Full-Fine-Tuning), SFT (Supervised Fine Tune), and Qlora. Each strategy has its own uniqueness and is applicable to different scenarios and needs. Let’s explore these strategies together, understand how they work, and how to choose and use them in practical applications. Let’s get started!

Since Llama1 was introduced, the open-source model community has begun to flourish. From Alpaca, Vicuna, WizardLM to Baichuan, Qianwen, and ChatGLM, open-source has largely been able to bring more ideas to individuals and enterprises, adding to diversity. With the release of Llama2, even better performance (relatively speaking) and more relaxed licensing have provided further opportunities for innovation. (Though the safety restrictions are somewhat hard to understand.)

Fine-tuning models and the use of tools like Langchian have become essential steps in the deployment of personalized models and production environments.

Pre-training is just the beginning. To make the model perform optimally on specific tasks, fine-tuning becomes particularly important. This is achieved by further training the pre-trained model on data specific to the task at hand. For example, although Llama2 performs well in text generation, fine-tuning is indispensable to make it perform better in specific question-answering tasks—hence the chat version of Llama2. (Of course, there are also issues like catastrophic forgetting that affect performance.)

⚠️ Note: I am currently conducting research in my free time, referring to various online tutorials and videos (I’ve spent quite a bit of time and resources, but the results are honestly not that great 🫠). I try to avoid mistakes in my descriptions, but there are still many gaps in my understanding or expression of knowledge. If you find any, please point them out in the comments or via email. Thank you very much 🙏.

⚠️ Note: If you are accessing this article using the Arc browser on a laptop, it’s recommended to temporarily hide the sidebar, otherwise the small screen width can be quite uncomfortable.

🤖 Background Information

According to the definition in this article, Large Language Models (LLMs) are a type of artificial intelligence that can mimic human intelligence. They use statistical models to analyze vast amounts of data, learning the patterns and connections between words and phrases. This allows them to generate new content similar to the style of specific authors or genres, such as academic papers or articles.

Firstly, language models are used to predict the probability of the next word appearing in a text sequence. These models can learn patterns in text data and can be trained to adapt to specific styles. For example, they can mimic the linguistic style of Shakespeare or modern pop culture and continue text based on that.

(Fully connected neural networks are a basic but powerful neural network structure, commonly used for classification, regression, and other problems. However, they are not the most effective for dealing with sequence data, especially long sequences.)

This brings us to the Transformer architecture, which is specifically designed for handling sequence data, particularly text data. Compared to fully connected neural networks, Transformers have distinct advantages in handling long-range dependencies and parallel computing.

Simple Explanation
Imagine you're hosting a party where everyone (words or other data points) wants to interact with each other to gather more information. But there's a rule at this party: you can't talk to multiple people at once, only one-on-one conversations are allowed. That's pretty inefficient, right? This is the problem with traditional RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks).

Now, the Transformer walks into the party and says, "No need to line up, feel free to talk to anyone you want!" This way, information can flow rapidly between everyone, allowing each to get a comprehensive understanding of the whole scene. This is the so-called "attention mechanism," enabling each data point to consider all other data points rapidly.

But the Transformer goes beyond this. It introduces "multi-head" attention. Imagine that everyone can not only communicate with each other in their "normal language" but also simultaneously in "slang," "scientific jargon," and other "languages." In this way, each person can understand the same thing from multiple angles, gathering more enriched information.

In short, the Transformer is like a super party planner, enabling everyone to communicate quickly, efficiently, and comprehensively, thus making the entire party (or data processing task) more successful!

🫠 Basic Concepts of Fine-Tuning

As mentioned earlier, the main purpose of fine-tuning is to make the model more proficient at certain tasks, such as text classification, sentiment analysis, or question-answering. Fine-tuning aims to optimize the model’s performance on these specific tasks.

Screenshot 2023-08-26 at 12.04.45 PM.webp

Source: Original content from microsoft

Below are some simple basic concepts about fine-tuning:

Basic Concepts	Simple Explanation
Pre-trained Model	A model that has already “learned” and knows a lot of foundational things
Target Task	The specific “job” or “task” you want the model to perform
Transfer Learning	Applying the knowledge learned from one task to another task
Dataset	A set of “exercise questions” for a specific task, used to make the model better at that task
Learning Rate	Tells the model how quickly to “learn” the new task
Optimizer	A “coach” that helps the model learn better
Computational Resources	The computer and time needed for “learning”
Evaluation Metrics	Ways to assess how well the model performs on the new task, like exam scores
Regularization	Avoids the model from “cramming,” making it more flexible in applying what it has learned

Three types of fine-tuning methods will be introduced in this article (the simpler, the better) (RLHF is currently not included in the article).

FFT (Full-Fine-Tuning)

FFT, or Full-Fine-Tuning, is a widely applied model fine-tuning method. In this approach, not only the last few layers of the model are fine-tuned (which is usually done in strategies where only the classifier layer is fine-tuned), but all layers of the model are fine-tuned. This means that the model’s entire architecture, from the input layer to the output layer, undergoes weight updates.

Key Features

Comprehensiveness: All layers of the model are fine-tuned, which usually leads to better model performance.
Flexibility: Since all parts of the model are learning, this method is generally better at adapting to the characteristics of new tasks.
Computational Load: Full-fine-tuning generally requires more computational resources and time, as the entire model needs to be updated.

Applicable Scenarios

When the target task differs significantly from the pre-training task, full-fine-tuning is usually more effective.
When you have relatively abundant labeled data and computational resources.

SFT (Supervised-Fine-Tune)

SFT, or Supervised Fine-Tune, is a model fine-tuning method applied in a supervised learning environment. Unlike FFT (Full-Fine-Tuning), SFT usually only fine-tunes a few layers of the model rather than the entire model. This allows for greater focus on task-specific features and typically requires fewer computational resources.

Key Features

Goal-Oriented: The fine-tuning process is focused on closely aligning the model with a specific task.
Computational Efficiency: Since only some model layers are fine-tuned, SFT is generally more computationally efficient than FFT.
Data Dependence: Being supervised, it requires relatively high-quality labeled data for fine-tuning.

Applicable Scenarios

When the target task closely aligns with the pre-trained model, fine-tuning fewer layers is usually sufficient.
When you have limited computational resources or want to achieve results more quickly.

(Q)lora (Low-Rank Adaptation)

LoRA (Layer-wise Recurrent Mechanism for Adaptation) is a new method for fine-tuning large pre-trained models, like Transformers. This method improves fine-tuning performance by adding a small number of adaptable parameters to each layer of the pre-trained model. These extra parameters introduce a layer-wise recurrent mechanism, allowing the model to adapt more effectively to new tasks during the fine-tuning process.

Key Features

Parameter Efficiency: LoRA only adds a small number of parameters, significantly reducing computational complexity and storage requirements.
Strong Adaptability: By adding a recurrent mechanism to each layer, LoRA allows the model to better adapt to new tasks.
Good Compatibility: LoRA can usually be seamlessly integrated with existing pre-trained models, without requiring retraining.

Applicable Scenarios

When you need to fine-tune with limited computational and storage resources.
When you want to improve the performance of fine-tuning tasks without impacting the original structure of the pre-trained model.

Parameter-efficient fine-tuning (PEFT) methods allow us to adjust already-trained large language models (PLMs) for specific tasks using fewer computational and storage resources. Normally, adjusting these large models can be time-consuming and costly, but PEFT makes the process more convenient and economical by only adjusting a small subset of parameters. LoRA is one such method within PEFT.

QLoRA (Quantized Low-Rank Adapters) includes the “Q” for Quantization. Quantization is a technique that reduces the model size and memory footprint by converting floating-point numbers into low-bit integers. For example, 4-bit quantization maps floating-point numbers to integers within the range of 0 to 15. In simpler terms, it means that even consumer-grade GPUs can handle it, although the results I have achieved so far are not particularly good.

Output activation enhancements come from low-rank adapters (right) consisting of weight matrices A and B, compared to the original (frozen) pre-trained weights (left).

Source: Original content from huggingface

For more information, please refer to https://huggingface.co/blog/4bit-transformers-bitsandbytes

💬 Detailed Explanation of Fine-Tuning Methods

In this section, we will provide a detailed explanation of using Full-Fine-Tuning, SFT (Supervised-Fine-Tune), and QLoRA (Quantized Low-Rank Adapters) to fine-tune the Vicuna 7b 1.5 version model based on Meta’s Llama2. Except for Full-Fine-Tuning, which is difficult to fine-tune in Google Colab (Pro), other methods can be fine-tuned in Colab. Of course, for SFT methods, cloud service providers are recommended as they tend to be more cost-effective. Most of the content is referenced from online tutorials and official documentation. The original articles will be displayed in the Reference section. Thanks again.

1. Full-Fine-Tuning

Here we use the official fine-tuning method.

Clone the repository

1 2	git clone https://github.com/lm-sys/FastChat.git cd FastChat

Install necessary packages

1
2
3

pip3 install --upgrade pip  # enable PEP 660 support
pip3 install -e ".[model_worker,webui]"
pip3 install -e ".[train]"

⚠️ Note: You’ll need to install Flash-Attention, which should be covered by pip3 install -e “.[train]”. Ninja will also be installed, speeding up the build process. Without Ninja, compilation may take a long time (2 hours) because it doesn’t use multiple CPU cores. With Ninja, compilation on a 64-core machine takes just 3-5 minutes, provided you have a 64-core machine.

FlashAttention-2 currently supports:

Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100).
Data types fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
All head dimensions up to 256. When the head dimension is greater than 192, an A100/A800 or H100/H800 GPU is needed for backpropagation.

Screenshot 2023-08-15 at 00.20.56.webp

Prepare dataset and start training

torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_mem.py \
    --model_name_or_path lmsys/vicuna-7b-v1.5-16k  \
    --data_path [your dataset path] \
    --bf16 False \
    --output_dir output_vicuna \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 1024 \
    --gradient_checkpointing True \
    --lazy_preprocess True

The above method uses distributed training, so you may need multiple GPUs, such as 2, 4, or 8 A100s (40G or 80G RAM). The above is just an example; modify according to your specific situation.

Below are explanations of various parameters:

(List of parameter explanations similar to the original text)

Screenshot 2023-08-15 at 20.47.37.webp

⚠️ Note: Before training, if you want to monitor the training status, you can use Wandb; you’ll need your token.

⚠️ Note: Your dataset should follow Vicuna’s format. Below is an official example format.

[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "human",
        "value": "Who are you?"
      },
      {
        "from": "gpt",
        "value": "I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS)."
      },
      {
        "from": "human",
        "value": "What can you do?"
      },
      {
        "from": "gpt",
        "value": "I can chat with you."
      }
    ]
  },
  {
    "id": "identity_1",
    "conversations": [
      {
        "from": "human",
        "value": "Who are you?"
      },
      {
        "from": "gpt",
        "value": "My name is Vicuna, and I'm a language model developed by Large Model Systems Organization (LMSYS)."
      }
    ]
  },
[Continue]
]

Example Link: dummy_conversation.json

Here we use two A100 80G GPUs are used for testing.

Automatically refresh and display nvidia-smi (NVIDIA System Management Interface) output every 1 second.

1	watch -n 1 nvidia-smi

Screenshot 2023-08-15 at 21.06.18.webp

After training, you might want to upload the model to Hugging Face.

Use huggingface-cli login to log in and upload.

For details, refer to the official documentation: https://huggingface.co/docs/huggingface_hub/guides/upload.

2. Supervised-Fine-Tune

Here, we directly use the Hugging Face official autotrain advanced, which is more convenient.

GitHub repository: https://github.com/huggingface/autotrain-advanced

Below are the specific steps:

⚠️Note: The tutorial below uses Google Colab. If your system RAM is insufficient, you may need to use Google Colab Pro’s High RAM or a stronger GPU, like the A100. A common error is CUDA Out of Memory.

First, check the GPU configuration.

print("============Check GPU Info================")
# Check GPU info
!nvidia-smi
print("==============Check pytorch version==============")
# Check pytorch version
import torch
print(torch.__version__)
print("============Check Virtual Machine Disk Capacity================")
# Check virtual machine disk capacity
!df -lh
print("============Check CPU Configuration================")
# Check CPU configuration
!cat /proc/cpuinfo | grep model\ name
print("=============Check Memory Capacity===============")
# Check memory capacity
!cat /proc/meminfo | grep MemTotal

Install Autotrain-advanced.

1	!pip install autotrain-advanced transformers

Update Autotrain-advanced to ensure everything is up to date.

1	!autotrain setup --update-torch

Check the help menu.

1	!autotrain llm --help

Start training and upload to Huggingface.

!autotrain llm --train \
    --project_name [Project Name] \
    --model lmsys/vicuna-7b-v1.5-16k \
    --data_path [Dataset id from huggingface] \
    --use_peft \
    --use_int4 \
    --learning_rate 2e-4 \
    --train_batch_size 16 \
    --num_train_epochs 3 \
    --max_seq_length 16000 \
    --trainer sft \
    --text_column text \
    --push_to_hub \
    --token [If you want to push to huggingface, you need to add token here] \
    --repo_id [Your repo id]

Once everything is done, the model files will appear on Hugging Face. The new version has a merge option; you can directly merge.

Here is the official Colab.

3. Qlora & Lora

There are currently two relatively simple methods: one is the common method, and the other uses the official Deepspeed method.

The common method comes from mlabonne.github.io. Many thanks.

Common Method

First, install and import some necessary software.

1	!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

Set up config

⚠️Warning: If you are using an A100 or H100 GPU, you can enable bf16 = True for acceleration, and you can also slightly increase the Batch size to speed up training.

# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "llama-2-7b-miniguanaco"

#########################################
# QLoRA parameters
#########################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

#########################################
# bitsandbytes parameters
#########################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

#########################################
# TrainingArguments parameters
#########################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

#########################################
# SFT parameters
#########################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

Start Training

# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

Display with TensorBoard

1 2	%load_ext tensorboard %tensorboard --logdir results/runs

Test using the correct format

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Merge models and upload to Huggingface

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

!huggingface-cli login

model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

This is the original author’s Colab file.：Colab

The original author’s other articles and courses are also very interesting 👍.

Deepspeed

Source: https://github.com/lm-sys/FastChat/blob/main/docs/training.md

You can use the following command to train Vicuna-7B with ZeRO2 using QLoRA. Please note that QLoRA is currently not supported by ZeRO3, but ZeRO3 does support LoRA, and there is a reference configuration for LoRA under playground/deepspeed_config_s3.json. To use QLoRA, you must install bitandbytes >=0.39.0 and transformers >=4.30.0.

deepspeed fastchat/train/train_lora.py \
    --model_name_or_path ~/model_weights/llama-7b \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --data_path ./data/dummy_conversation.json \
    --bf16 True \
    --output_dir ./checkpoints \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 100 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --q_lora True \
    --deepspeed playground/deepspeed_config_s2.json \

4. Convert to GGML format

Since the original model may still be quite large, or if you want to run the fine-tuned model more easily on a Mac/iPhone, converting the model to ggml format is an essential step. In this section, we will use llama.cpp to convert the model. You can perform the conversion on Google Colab and upload it to Huggingface.

Clone Repo

1
2
3

!git clone https://github.com/ggerganov/llama.cpp
%cd llama.cpp/
%cd models/

1 2	!pip install -q transformers==4.31.0 !huggingface-cli login

Download Model

1 2	!git clone https://huggingface.co/lmsys/vicuna-13b-v1.5-16k %cd vicuna-13b-v1.5-16k/

Make and convert to ggml-model-f32.bin

%cd /
%cd content/llama.cpp/
!make
!python3 -m pip install -r requirements.txt
!python3 convert-pth-to-ggml.py models/vicuna-13b-v1.5-16k/ 0

Convert to Original llama.cpp quant methods: q4_0, q4_1, q5_0, q5_1, q8_0

1	!./quantize ./models/vicuna-13b-v1.5-16k/ggml-model-f32.bin ./models/vicuna-13b-v1.5-16k/ggml-model-q4_0.bin q4_0

1	!./quantize ./models/vicuna-13b-v1.5-16k/ggml-model-f32.bin ./models/vicuna-13b-v1.5-16k/ggml-model-q4_1.bin q4_1

1	!./quantize ./models/vicuna-13b-v1.5-16k/ggml-model-f32.bin ./models/vicuna-13b-v1.5-16k/ggml-model-q5_0.bin q5_0

1	!./quantize ./models/vicuna-13b-v1.5-16k/ggml-model-f32.bin ./models/vicuna-13b-v1.5-16k/ggml-model-q5_1.bin q5_1

1	!./quantize ./models/vicuna-13b-v1.5-16k/ggml-model-f32.bin ./models/vicuna-13b-v1.5-16k/ggml-model-q8_0.bin q8_0

Upload to Huggingface

from huggingface_hub import HfApi
api = HfApi()
api.upload_file(
    path_or_fileobj="./models/vicuna-13b-v1.5-16k/ggml-model-q4_0.bin",
    path_in_repo="ggml-model-q4_0.bin",
    repo_id="[Your repo_id]",
    repo_type="model",
)

中文

🥳 介绍

欢迎！在本文中，我们将深入探讨几种流行的微调策略：FFT (Full-Fine-Tuning)、SFT (Supervised Fine Tune) 和 Qlora。每种策略都有其独特之处，适用于不同的场景和需求。让我们一起探索这些策略，理解它们的工作原理，以及如何在实际应用中选择和使用它们。让我们开始吧！

自从Llama1 出山后，开源模型社区开始百花齐放，从Alpaca, Vicuna, WizardLM 到百川，千问，ChatGLM，开源在很大程度上能够带给个人和企业更多的想法，带来多样性。随着Llama2的发布，更加优秀的性能（相对）（那个安全限制多少有点整不懂了）和许可证进一步宽松为进一步的创新提供了更多可能。

模型的微调和Langchian等工具的使用，成为了个性化模型和生产环境的部署的必备流程。

预训练只是初步的。要让模型在特定任务上达到最佳性能，微调就显得尤为重要。通过在特定任务的数据上进一步训练预训练模型，使其适应该任务。例如，虽然LLama2在生成文本方面表现出色，但为了使其在特定的问答任务上表现得更好，微调是必不可少的，也就是Llama2的chat版本。（当然也有灾难性遗忘等问题影响。）

⚠️ 注意：本人目前也在闲暇时间研究当中，参考了各种网上的教程和视频（花了不少时间和经费，成果其实并不怎么好🫠），尽可能避免在叙述中出现问题，但是依然有很多知识的的理解或表述存在漏洞，如有发现请在评论中或邮件中提出，非常感谢🙏

⚠️ 注意：如果在笔记本电脑上使用Arc浏览器访问该文章，建议暂时隐藏侧边栏，不然屏幕宽度太小属实有点难受

🤖 LLM的背景信息

根据这篇文章的定义，大型语言模型，或LLM，是一种可以模仿人类智能的人工智能。他们使用统计模型来分析大量数据，学习单词和短语之间的模式和联系。这允许他们生成与特定作者或流派风格相似的新内容，如论文或文章。

首先，语言模型是用来预测文本序列中下一个词出现概率的模型。这种模型能够学习文本数据中的模式，并且可以被训练来适应特定的风格。例如，它能够模仿莎士比亚或者现代流行文化的语言风格，并据此进行文本续写。

（全连接神经网络是一种基础但强大的神经网络结构，通常用于分类、回归等问题。然而，在处理序列数据，特别是长序列数据时，全连接神经网络并不是最有效的。）

这就引出了Transformer结构，它是一种专为处理序列数据，尤其是文本数据设计的架构。相比于全连接神经网络，Transformer在处理长距离依赖和并行计算方面有明显优势。

Simple Explanation
想象一下，你正在举办一场派对，每个人（单词或其他数据点）都想与其他人互动以了解更多信息。但这个派对有个规则：你不能一次与多人对话，只能一个接一个地谈。这样效率很低，对吧？这就是传统的 RNN（循环神经网络）和 LSTM（长短时记忆网络）的问题。

现在，Transformer 走进了派对，并说：“大家不用排队了，随便跟任何人交谈！”这样，信息就能快速地在所有人之间流动，每个人都能更全面地了解整个场景。这就是所谓的“注意力机制”，让每个数据点都能快速地考虑到其他所有数据点。

但 Transformer 不止于此，它引入了“多头”注意力。想象每个人不仅可以用他们的“普通话”与别人交流，还可以同时用“俚语”、“科学术语”等多种“语言”来交流。这样，每个人就能从多个角度来理解同一件事，获取更丰富的信息。

简而言之，Transformer 就像是一个超级派对规划者，让所有人都能迅速、高效且全面地交流，从而让整个派对（或者说，数据处理任务）更加成功！

🫠 微调的基本概念

就像之前提到的，微调的目的主要是让模型可以更加擅长做某些任务，例如文本分类、情感分析或问答等。微调就是为了优化模型在这个特定任务上的性能。

Screenshot 2023-08-26 at 12.04.45 PM.webp

Source: Original content from microsoft

下面是一些关于微调的简单基础概念

基本概念	简单解释
预训练模型	一个已经“学过”的模型，知道很多基础的东西
目标任务	你希望模型能做的特定“作业”或“任务”
迁移学习	让模型把在一个任务上学到的知识用在另一个任务上
数据集	一些特定任务的“练习题”，用于让模型更擅长这个任务
学习率	告诉模型应该多快地“学习”新任务
优化器	一个“教练”，帮助模型更好地进行学习
计算资源	“学习”所需要的电脑和时间
评估指标	看模型在新任务上做得好不好的方法，就像考试的分数
正则化	避免模型“死记硬背”，使其能更灵活地应用所学到的知识

本文中将介绍的3种微调方式（怎么简单怎么来）（RLHF目前不包括在文章中）

FFT (Full-Fine-Tuning)

FFT，也就是全微调（Full-Fine-Tuning），是一种广泛应用的模型微调方法。在这种方法中，我们不仅微调模型的最后几层（这通常是在只微调分类器层的策略中完成的），而是微调模型的所有层。这意味着模型的整体架构，从输入层到输出层，都会进行权重的更新。

主要特点

全面性：所有模型层都被微调，这通常能带来更好的模型性能。
灵活性：由于模型的所有部分都在学习，这种方法通常更能适应新任务的特点。
计算量：全微调通常需要更多的计算资源和时间，因为整个模型都需要更新。

适用场景

当目标任务与预训练任务有较大差异时，全微调通常更为有效。
当你拥有相对充足的标注数据和计算资源时。

SFT (Supervised-Fine-Tune)

SFT，也就是有监督微调（Supervised Fine-Tune），是一种在有监督学习环境下进行模型微调的方法。与FFT（Full-Fine-Tuning）不同，SFT通常只会微调模型的某几层而非整个模型。这样可以更专注于任务相关的特定特性，并且通常需要更少的计算资源。

主要特点

目标导向：微调过程专注于模型与特定任务的紧密配合。
计算高效：由于只微调部分模型层，SFT通常比FFT更节省计算资源。
数据依赖性：由于是有监督的，需要相对高质量的标注数据进行微调。

适用场景

当目标任务与预训练模型非常接近时，微调较少的层通常就足够了。
当你有限的计算资源或希望更快地得到结果时。

(Q)lora (Low-Rank Adaptation)

LoRA（Layer-wise Recurrent Mechanism for Adaptation）是一种针对大型预训练模型（如Transformers）微调的新方法。该方法通过在预训练模型的每一层添加少量的可适应参数来提高微调性能。这些额外的参数为各个层添加了一种递归机制，使得模型在微调过程中可以更有效地适应新任务。

主要特点

参数高效：LoRA只添加少量的参数，大大减少了计算复杂度和存储需求。
适应性强：通过在每一层都添加递归机制，LoRA使模型能更好地适应新任务。
兼容性好：LoRA通常可以和现有的预训练模型无缝集成，不需要重新训练模型。

适用场景

当你需要在有限的计算和存储资源下进行微调。
当你希望改善微调任务的性能，而不影响预训练模型的原始结构。

参数高效微调（PEFT）方法允许我们用更少的计算和存储资源来调整已经训练好的大型语言模型（PLM），使其适用于特定任务。通常，调整这些大型模型会很费时间和金钱，但PEFT通过只调整一小部分参数，让整个过程更加便捷和经济。这个LORA就是PEFT的其中一种方法

QLoRA（Quantized Low-Rank Adapters）的Q代表量化（Quantization）。量化是一种减少模型大小和内存占用的技术，它将浮点数转换为低位数的整数。例如，4-bit 量化就是将浮点数映射到 0 到 15 的整数范围内。说白了就是单消费级GPU也可以搞定，但是我目前做出来的效果不是特别好

输出激活原始（冻结）预训练权重（左）由权重矩阵 A 和 B 组成的低秩适配器（右）增强。

Source: Original content from huggingface

💬 微调方法详解

在本节中我们将详细讲解使用Full-Fine-Tuning，SFT (Supervised-Fine-Tune) 和 QLoRA（Quantized Low-Rank Adapters）来微调基于Meta的Llama2的微调模型Vicuna 7b 1.5版，其中除Full-Fine-Tuning很难在**Google Colab (Pro)**中进行微调外，其他方法是可以Colab中微调的，当然SFT的方法还是推荐云服务提供商的价格会便宜一点，大部分内容参考了网上教程和官方文档，原始文章将在Reference中显示，再次感谢

1. Full-Fine-Tuning

这里使用官方的微调方法

克隆仓库

1 2	git clone https://github.com/lm-sys/FastChat.git cd FastChat

安装一些必要的包

1
2
3

pip3 install --upgrade pip  # enable PEP 660 support
pip3 install -e ".[model_worker,webui]"
pip3 install -e ".[train]"

⚠️注意：这里需要安装Flash-Attention，根据pip3 install -e “.[train]”是没有问题的，ninja也会安装的，Build的速度可以提上去，没有 Ninja，编译可能会花费很长时间（2小时），因为它不使用多个CPU核心。使用 Ninja，在64核机器上编译只需要3-5分钟。前提是你要有64核机器

FlashAttention-2 目前支持：

Ampere、Ada 或 Hopper 显卡（例如，A100, RTX 3090, RTX 4090, H100）。对 Turing 显卡（T4, RTX 2080）的支持即将到来，请暂时使用 FlashAttention 1.x 版本以支持 Turing 显卡。
数据类型 fp16 和 bf16（bf16 需要 Ampere、Ada 或 Hopper 显卡）。
所有头维度（head dimensions）直至 256。当头维度（head dim）大于 192 时，需要 A100/A800 或 H100/H800 显卡以进行反向传播（backward）。

Screenshot 2023-08-15 at 00.20.56.webp

准备好数据集开始训练

torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_mem.py \
    --model_name_or_path lmsys/vicuna-7b-v1.5-16k  \
    --data_path [your dataset path] \
    --bf16 False \
    --output_dir output_vicuna \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 1024 \
    --gradient_checkpointing True \
    --lazy_preprocess True

上面的方法使用分布式训练，你可能需要准备多台GPU，例如2台或4台或8台A100 （40G或80G RAM）同时上面的只是一个例子，需要根据实际情况修改

以下是各个参数的解释：

torchrun --nproc_per_node=4 --master_port=20001: 这是torchrun命令的一部分，用于分布式训练。nproc_per_node设置每个节点上的进程数，而master_port设置主节点的端口。
fastchat/train/train_mem.py: 这是你要运行的Python脚本。
--model_name_or_path lmsys/vicuna-7b-v1.5-16k: 预训练模型的名称或路径。
--data_path [your dataset path]: 训练数据集的路径。
--bf16 True: 是否使用bf16（Brain Floating Point 16）进行训练。
--output_dir output_vicuna: 模型和日志输出的目录。
--num_train_epochs 1: 训练轮数。
--per_device_train_batch_size 1: 每个设备上的训练批量大小。
--per_device_eval_batch_size 1: 每个设备上的评估批量大小。
--gradient_accumulation_steps 4: 梯度累积的步骤数。
--evaluation_strategy "no": 评估策略，这里设置为“不进行评估”。
--save_strategy "steps": 保存策略，这里设置为按步数保存。
--save_steps 1200: 每多少步保存一次模型。
--save_total_limit 10: 最多保存的模型数量。
--learning_rate 2e-5: 学习率。
--weight_decay 0.: 权重衰减。
--warmup_ratio 0.03: 学习率预热的比例。
--lr_scheduler_type "cosine": 学习率调度类型。
--logging_steps 1: 每多少步进行一次日志记录。
--fsdp "full_shard auto_wrap": 使用全分片数据并自动包装模型用于FSDP（分布式数据并行性）。
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer': 要用于FSDP包装的Transformer层的类。
--tf32 True: 是否使用TF32数据类型。
--model_max_length 2048: 模型的最大长度。
--gradient_checkpointing True: 是否使用梯度检查点来节省内存。
--lazy_preprocess True: 是否使用延迟预处理。

Screenshot 2023-08-15 at 20.47.37.webp

⚠️注意: 训练之前如果你需要查看训练的状况，可以使用Wandb,你需要用你的token

⚠️注意: 你使用的数据集应该遵循Vicuna的格式：下面是一个官方的例子格式

[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "human",
        "value": "Who are you?"
      },
      {
        "from": "gpt",
        "value": "I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS)."
      },
      {
        "from": "human",
        "value": "What can you do?"
      },
      {
        "from": "gpt",
        "value": "I can chat with you."
      }
    ]
  },
  {
    "id": "identity_1",
    "conversations": [
      {
        "from": "human",
        "value": "Who are you?"
      },
      {
        "from": "gpt",
        "value": "My name is Vicuna, and I'm a language model developed by Large Model Systems Organization (LMSYS)."
      }
    ]
  },
[Continue]
]

示例链接：dummy_conversation.json

这里使用两台A100 80G做测试

每1秒钟自动刷新并显示nvidia-smi（NVIDIA System Management Interface）的输出

1	watch -n 1 nvidia-smi

Screenshot 2023-08-15 at 21.06.18.webp

训练完成后你也许想要把模型上传到huggingface上

使用 huggingface-cli login 登录并上传

具体请参考官方的文档: https://huggingface.co/docs/huggingface_hub/guides/upload

2. Supervised-Fine-Tune

在这里，我们直接使用hugging face官方的autotrain advanced 比较方便

Github仓库如下: https://github.com/huggingface/autotrain-advanced

下面是具体步骤：

⚠️注意：下面教程使用Google Colab，如果系统RAM不够你可能需要使用Google Colab Pro的High RAM或使用更强的GPU,例如A100，通常的报错是CUDA Out of Memory

首先看看GPU配置如何

print("============查看GPU信息================")
# 查看GPU信息
!nvidia-smi
print("==============查看pytorch版本==============")
# 查看pytorch版本
import torch
print(torch.__version__)
print("============查看虚拟机硬盘容量================")
# 查看虚拟机硬盘容量
!df -lh
print("============查看cpu配置================")
# 查看cpu配置
!cat /proc/cpuinfo | grep model\ name
print("=============查看内存容量===============")
# 查看内存容量
!cat /proc/meminfo | grep MemTotal

安装 Autotrain-advanced

1	!pip install autotrain-advanced transformers

更新一下Autotrain-advanced确保一切都是最新的

1	!autotrain setup --update-torch

查看下帮助

1	!autotrain llm --help

开始训练并上传Huggingface

!autotrain llm --train \
    --project_name [Project Name] \
    --model lmsys/vicuna-7b-v1.5-16k \
    --data_path [Dataset id from huggingface] \
    --use_peft \
    --use_int4 \
    --learning_rate 2e-4 \
    --train_batch_size 16 \
    --num_train_epochs 3 \
    --max_seq_length 16000 \
    --trainer sft \
    --text_column text \
    --push_to_hub \
    --token [If you want to push to huggingface, you need to add token here] \
    --repo_id [Your repo id]

当一切完成后，模型文件出现在hugging face,新版有merge选项，可以直接merge

这是官方的 Colab

3. Qlora & Lora

目前看到有两种比较简单的方法
一种是普通方法，一种使用官方的Deepspeed方法

普通方法来自 mlabonne.github.io 非常感谢

普通方法

首先安装并导入一些必要的软件

1	!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

设置config

⚠️注意：如果你使用A100或H100 GPU，你可以开启bf16 = True加速，同时可以稍微增加Batch size加快训练速度

# The model that you want to train from the Hugging Face hub
model_name = "NousResearch/llama-2-7b-chat-hf"

# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model name
new_model = "llama-2-7b-miniguanaco"

#########################################
# QLoRA parameters
#########################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

#########################################
# bitsandbytes parameters
#########################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

#########################################
# TrainingArguments parameters
#########################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

#########################################
# SFT parameters
#########################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

开始训练

# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

用tensorboard显示

1 2	%load_ext tensorboard %tensorboard --logdir results/runs

用正确的格式测试下

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

合并模型并上传Huggingface

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

!huggingface-cli login

model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

这是原作者的Colab文件：Colab

原作者的其他文章和课程也非常有意思 👍

Deepspeed

Source: https://github.com/lm-sys/FastChat/blob/main/docs/training.md

您可以使用以下命令使用ZeRO2使用QLoRA训练Vicuna-7B。请注意，QLoRA目前不支持ZeRO3，但ZeRO3确实支持LoRA，LoRA在playground/deepspeed_config_s3.json下有一个参考配置。要使用QLoRA，您必须安装比特和字节>=0.39.0和变压器>=4.30.0。

deepspeed fastchat/train/train_lora.py \
    --model_name_or_path ~/model_weights/llama-7b \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --data_path ./data/dummy_conversation.json \
    --bf16 True \
    --output_dir ./checkpoints \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 100 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --q_lora True \
    --deepspeed playground/deepspeed_config_s2.json \

4. 转换成ggml格式

由于原始模型可能依然比较大，或者你想在Mac/iPhone上比较轻松的运行微调过的模型，将模型转换成ggml格式是必不可少的一步。这一节中我们将使用llama.cpp转换模型。你可以在Google Colab上转换并上传至Huggingface

克隆Repo

1
2
3

!git clone https://github.com/ggerganov/llama.cpp
%cd llama.cpp/
%cd models/

登录huggingface

1 2	!pip install -q transformers==4.31.0 !huggingface-cli login

下载模型

1 2	!git clone https://huggingface.co/lmsys/vicuna-13b-v1.5-16k %cd vicuna-13b-v1.5-16k/

Make 并转换成ggml-model-f32.bin

%cd /
%cd content/llama.cpp/
!make
!python3 -m pip install -r requirements.txt
!python3 convert-pth-to-ggml.py models/vicuna-13b-v1.5-16k/ 0

转换成 Original llama.cpp quant methods: q4_0, q4_1, q5_0, q5_1, q8_0

1	!./quantize ./models/vicuna-13b-v1.5-16k/ggml-model-f32.bin ./models/vicuna-13b-v1.5-16k/ggml-model-q4_0.bin q4_0

1	!./quantize ./models/vicuna-13b-v1.5-16k/ggml-model-f32.bin ./models/vicuna-13b-v1.5-16k/ggml-model-q4_1.bin q4_1

1	!./quantize ./models/vicuna-13b-v1.5-16k/ggml-model-f32.bin ./models/vicuna-13b-v1.5-16k/ggml-model-q5_0.bin q5_0

1	!./quantize ./models/vicuna-13b-v1.5-16k/ggml-model-f32.bin ./models/vicuna-13b-v1.5-16k/ggml-model-q5_1.bin q5_1

1	!./quantize ./models/vicuna-13b-v1.5-16k/ggml-model-f32.bin ./models/vicuna-13b-v1.5-16k/ggml-model-q8_0.bin q8_0

上传huggingface

from huggingface_hub import HfApi
api = HfApi()
api.upload_file(
    path_or_fileobj="./models/vicuna-13b-v1.5-16k/ggml-model-q4_0.bin",
    path_in_repo="ggml-model-q4_0.bin",
    repo_id="[Your repo_id]",
    repo_type="model",
)