在Apple Silicon Mac上运行LLaMA

简介

准备好将你的AI研究提升到新的水平了吗？LLaMA - Meta AI的大型语言模型就是你需要的。LLaMA是专为帮助研究人员在AI子领域推进工作而设计的，它以非商业许可证发布，专注于研究用例，向全球的学术研究人员、政府组织、民间社会和学术机构以及工业研究实验室的相关人员开放访问权限。在本文中，我们将深入探讨LLaMA的精彩世界，并探索如何在M1 Mac上使用它，特别关注击何在M1/M2 MacBook Pro上使用llama.cpp运行LLaMA 7B和13B。准备好释放大型语言模型的全部潜力，彻底改变你的研究工作吧！

那么如何在你的MacBook Pro上运行它呢？

运行LLaMA

感谢Georgi Gerganov和他的llama.cpp项目，使得在没有专用GPU的单台计算机上运行Meta的LLaMA成为可能。这太神奇了！

步骤1 安装一些依赖

你需要安装Xcode来运行C++项目。如果你没有：

1	xcode-select --install

同时，使用Brew来构建C++项目（pkgconfig和cmake）。

1	brew install pkgconfig cmake python@3.11

（可选）安装Python虚拟环境

1	pip3 install virtualenv

创建Python虚拟环境

1	virtualenv （你的项目名称）

激活你的虚拟环境

1	source bin/activate

接下来，安装PyTorch（推荐安装Nightly版本）和一些其他包

1	pip install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cpu

1	pip install numpy sentencepiece

（可选）尝试Metal Performance Shaders (MPS) backend用于GPU

Python 3.11.2 (main, Feb 16 2023, 02:55:59) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch; torch.backends.mps.is_available()
True
>>>

步骤2 下载项目

获取llama.cpp 🫡 仓库：

1 2	git clone https://github.com/ggerganov/llama.cpp cd llama.cpp

运行make编译C++代码：

make

步骤3 下载LLaMA模型

获取模型的两种方式

官方表单：https://forms.gle/jk851eBVbX1m5TAv5
从Github获取BitTorrent：https://github.com/facebookresearch/llama/pull/73

注意
如果你从Github下载模型，不要使用ipfs链接，使用BitTorrent，以防后续无法转换模型

下载模型后，结构将如下所示：

.
├── 7B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  └── params.json
├── 13B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  ├── consolidated.01.pth
│  └── params.json
├── 30B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  ├── consolidated.01.pth
│  ├── consolidated.02.pth
│  ├── consolidated.03.pth
│  └── params.json
├── 65B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  ├── consolidated.01.pth
│  ├── consolidated.02.pth
│  ├── consolidated.03.pth
│  ├── consolidated.04.pth
│  ├── consolidated.05.pth
│  ├── consolidated.06.pth
│  ├── consolidated.07.pth
│  └── params.json
├── llama.sh
├── tokenizer.model
└── tokenizer_checklist.chk

步骤4 转换LLaMA模型7B

将模型放在llama.cpp仓库的models/目录下。

1	python convert-pth-to-ggml.py models/7B 1

你的输出将如下所示：

{'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-06, 'vocab_size': 32000}
n_parts =  1
Processing part  0
Processing variable: tok_embeddings.weight with shape:  torch.Size([32000, 4096])  and type:  torch.float16
Processing variable: norm.weight with shape:  torch.Size([4096])  and type:  torch.float16
  Converting to float32
Processing variable: output.weight with shape:  torch.Size([32000, 4096])  and type:  torch.float16
Processing variable: layers.0.attention.wq.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.attention.wk.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.attention.wv.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.attention.wo.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.feed_forward.w1.weight with shape:  torch.Size([11008, 4096])  and type:  tor
ch.float16
Processing variable: layers.0.feed_forward.w2.weight with shape:  torch.Size([4096, 11008])  and type:  tor
ch.float16
Processing variable: layers.0.feed_forward.w3.weight with shape:  torch.Size([11008, 4096])  and type:  tor
ch.float16
Processing variable: layers.0.attention_norm.weight with shape:  torch.Size([4096])  and type:  torch.float
16
...
Done. Output file: models/7B/ggml-model-f16.bin, (part  0 )

如果你遇到名为RuntimeError: PytorchStreamReader failed reading zip archive: not a ZIP archive的错误，请检查你的模型和我之前的注意事项。

这应该会生成models/7B/ggml-model-f16.bin - 另一个13GB的文件。

这个脚本将模型量化为4位：

1	./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2

步骤5 运行LLaMA模型7B

./main -m ./models/7B/ggml-model-q4_0.bin \
        -t 8 \
        -n 128 \
        -p '第一个登上月球的人是 '

输出：
第一个登上月球的人是在1969年7月时38岁。尼尔·阿姆斯特朗出生在我母亲的家人在萨吉诺和弗林特/海湾城之间的小农场定居后仅一两个月；他们来自宾夕法尼亚州（就像我们大多数人一样）。
在今年晚些时候对他的悼词中 - 他赞扬阿姆斯特朗的"伟大"，这反映在尽管经历了2018年最糟糕的政治灾难，人们谈论它的程度比平常多得多）… 奥巴马指出：“我不认为…”

步骤6 运行LLaMA模型13B

要将13B模型转换为ggml：

1	python convert-pth-to-ggml.py models/13B/ 1

需要依次对每个文件运行quantize命令：

1
2
3

./quantize ./models/13B/ggml-model-f16.bin   ./models/13B/ggml-model-q4_0.bin 2

./quantize ./models/13B/ggml-model-f16.bin.1 ./models/13B/ggml-model-q4_0.bin.1 2

./main \
  -m ./models/13B/ggml-model-q4_0.bin \
  -t 8 \
  -n 128 \
  -p '一些由海狸经营的咖啡店的有趣双关语名字：
-'

尽情享受！

参考资料🙏🏻
1. https://dev.l1x.be/posts/2023/03/12/using-llama-with-m1-mac/
2. https://til.simonwillison.net/llms/llama-7b-m2