Running LLaMA on Apple Silicon Mac

Introduction

Are you ready to take your AI research to the next level? Look no further than LLaMA - the Large Language Model Meta AI. Designed to help researchers advance their work in the subfield of AI, LLaMA has been released under a noncommercial license focused on research use cases, granting access to academic researchers, those affiliated with organizations in government, civil society, and academia, and industry research laboratories around the world. In this article, we will dive into the exciting world of LLaMA and explore how to use it with M1 Macs, specifically focusing on running LLaMA 7B and 13B on a M1/M2 MacBook Pro with llama.cpp. Get ready to unlock the full potential of large language models and revolutionize your research!

So how to Run it on your MacBookPro ?

Running LLaMA

Thanks to Georgi Gerganov and his llama.cpp project it is possible to run Meta’s LLaMA on a single computer without a dedicated GPU. That’s amazing!

Step 1 Install some dependencies

You need to install Xcode to run C++ project. If you don’t have:

1	xcode-select --install

At the same time, use Brew to building the C++ project (pkgconfigand cmake).

1	brew install pkgconfig cmake python@3.11

(Optional) Install a python virtual environment

1	pip3 install virtualenv

Create a python virtual environment

1	virtualenv （Your Project Name）

Activate your virtual environment

1	source bin/activate

Next, install PyTorch (Recommand you to install Nightly version) and some other packages

1	pip install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cpu

1	pip install numpy sentencepiece

(Optional) Try Metal Performance Shaders (MPS) backend for GPU

Python 3.11.2 (main, Feb 16 2023, 02:55:59) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch; torch.backends.mps.is_available()
True
>>>

Step 2 Download Project

Get llama.cpp 🫡 Repository:

1 2	git clone https://github.com/ggerganov/llama.cpp cd llama.cpp

Run make to compile the C++ code:

make

Step 3 Download LLaMA Model

Two ways to Get Model

Offical Form: https://forms.gle/jk851eBVbX1m5TAv5
BitTorrent From Github: https://github.com/facebookresearch/llama/pull/73

Note
If you download model from Github, Do not use ipfs link, use BitTorrent, prevent you can't convert model later

After you download your model, the structure will look like this:

.
├── 7B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  └── params.json
├── 13B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  ├── consolidated.01.pth
│  └── params.json
├── 30B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  ├── consolidated.01.pth
│  ├── consolidated.02.pth
│  ├── consolidated.03.pth
│  └── params.json
├── 65B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  ├── consolidated.01.pth
│  ├── consolidated.02.pth
│  ├── consolidated.03.pth
│  ├── consolidated.04.pth
│  ├── consolidated.05.pth
│  ├── consolidated.06.pth
│  ├── consolidated.07.pth
│  └── params.json
├── llama.sh
├── tokenizer.model
└── tokenizer_checklist.chk

Step 4 Convert LLaMA model 7B

You placed the models under models/ in the llama.cpp repo.

1	python convert-pth-to-ggml.py models/7B 1

Your output will look like this:

{'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-06, 'vocab_size': 32000}
n_parts =  1
Processing part  0
Processing variable: tok_embeddings.weight with shape:  torch.Size([32000, 4096])  and type:  torch.float16
Processing variable: norm.weight with shape:  torch.Size([4096])  and type:  torch.float16
  Converting to float32
Processing variable: output.weight with shape:  torch.Size([32000, 4096])  and type:  torch.float16
Processing variable: layers.0.attention.wq.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.attention.wk.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.attention.wv.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.attention.wo.weight with shape:  torch.Size([4096, 4096])  and type:  torch.f
loat16
Processing variable: layers.0.feed_forward.w1.weight with shape:  torch.Size([11008, 4096])  and type:  tor
ch.float16
Processing variable: layers.0.feed_forward.w2.weight with shape:  torch.Size([4096, 11008])  and type:  tor
ch.float16
Processing variable: layers.0.feed_forward.w3.weight with shape:  torch.Size([11008, 4096])  and type:  tor
ch.float16
Processing variable: layers.0.attention_norm.weight with shape:  torch.Size([4096])  and type:  torch.float
16
...
Done. Output file: models/7B/ggml-model-f16.bin, (part  0 )

If you get an error called RuntimeError: PytorchStreamReader failed reading zip archive: not a ZIP archive, Check your model and my note before.

This should produce models/7B/ggml-model-f16.bin - another 13GB file.

This script "quantizes the model to 4-bits:

1	./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2

Step 5 Running LLaMA model 7B

./main -m ./models/7B/ggml-model-q4_0.bin \
        -t 8 \
        -n 128 \
        -p 'The first man on the moon was '

Output:
The first man on the moon was 38 years old in July, ’69. Neil Armstrong had been born only a month or two after my mother’s family settled at their little homestead between Saginaw and Flint/Bay City; they came from Pennsylvania (like most of us).
In his eulogy to him later this year – in which he paid tribute to Armstrong for the “greatness” that was reflected by how much more people were talking about it than usual, despite living through 2018’s worst political disaster yet) … Obama noted: ‘I don’t think…

Step 6 Running LLaMA model 13B

To convert 13B model to ggml:

1	python convert-pth-to-ggml.py models/13B/ 1

The quantize command needs to be run for each of those in turn:

1
2
3

./quantize ./models/13B/ggml-model-f16.bin   ./models/13B/ggml-model-q4_0.bin 2

./quantize ./models/13B/ggml-model-f16.bin.1 ./models/13B/ggml-model-q4_0.bin.1 2

./main \
  -m ./models/13B/ggml-model-q4_0.bin \
  -t 8 \
  -n 128 \
  -p 'Some good pun names for a coffee shop run by beavers:
-'

Enjoy !

Reference🙏🏻
1. https://dev.l1x.be/posts/2023/03/12/using-llama-with-m1-mac/
2. https://til.simonwillison.net/llms/llama-7b-m2