Run Llama 2 locally on MacBook
Last week, Meta released Llama 2 , an “open source” large language model that is free for research and commercial use. Within a few hours, the community has ported Llama 2 to llama.cpp which makes it eaiser and more efficient to run Llama 2 locally.
Download and compile llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && LLAMA_METAL=1 make
Note that LLAMA_METAL
is set to 1
to enable using GPU on Apple Silicone. On my M1 Pro MacBook Pro, the compliation took about a few seconds.
Download model weights
We will be using the 7B chat model that has been converted and quantified on HuggingFace :
wget "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_0.bin"
export MODEL=llama-2-7b-chat.ggmlv3.q4_0.bin
Run model inference
Run compiled main
with the prompt read from tty, and specify the model path with -m
flag:
echo "Prompt: " \
&& read PROMPT \
&& ./main \
-t 8 \
-ngl 1 \
-m ${MODEL} \
--color \
-c 2048 \
--temp 0.7 \
--repeat_penalty 1.1 \
-n -1 \
-p "### Instruction: ${PROMPT} \n### Response:"
Output:
### Instruction: hello \n### Response: Hello! How can I help you today? [end of text]
llama_print_timings: load time = 4777.73 ms
llama_print_timings: sample time = 6.97 ms / 10 runs ( 0.70 ms per token, 1434.10 tokens per second)
llama_print_timings: prompt eval time = 1305.32 ms / 12 tokens ( 108.78 ms per token, 9.19 tokens per second)
llama_print_timings: eval time = 462.38 ms / 9 runs ( 51.38 ms per token, 19.46 tokens per second)
llama_print_timings: total time = 1775.44 ms