Run Llama 2 locally on MacBook

July 23, 2023

2 min read

Last week, Meta released Llama 2 , an “open source” large language model that is free for research and commercial use. Within a few hours, the community has ported Llama 2 to llama.cpp which makes it eaiser and more efficient to run Llama 2 locally.

Download and compile llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && LLAMA_METAL=1 make

Note that LLAMA_METAL is set to 1 to enable using GPU on Apple Silicone. On my M1 Pro MacBook Pro, the compliation took about a few seconds.

Download model weights

We will be using the 7B chat model that has been converted and quantified on HuggingFace :

wget "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_0.bin"
export MODEL=llama-2-7b-chat.ggmlv3.q4_0.bin

Run model inference

Run compiled main with the prompt read from tty, and specify the model path with -m flag:

echo "Prompt: " \
    && read PROMPT \
    && ./main \
        -t 8 \
        -ngl 1 \
        -m ${MODEL} \
        --color \
        -c 2048 \
        --temp 0.7 \
        --repeat_penalty 1.1 \
        -n -1 \
        -p "### Instruction: ${PROMPT} \n### Response:"

Output:

### Instruction: hello \n### Response: Hello! How can I help you today? [end of text]

llama_print_timings:        load time =  4777.73 ms
llama_print_timings:      sample time =     6.97 ms /    10 runs   (    0.70 ms per token,  1434.10 tokens per second)
llama_print_timings: prompt eval time =  1305.32 ms /    12 tokens (  108.78 ms per token,     9.19 tokens per second)
llama_print_timings:        eval time =   462.38 ms /     9 runs   (   51.38 ms per token,    19.46 tokens per second)
llama_print_timings:       total time =  1775.44 ms