TIL of Xin Fu

Run Llama 2 locally on MacBook

Sun, 23 Jul 2023 00:00:00 +0000

Last week, Meta released Llama 2 , an “open source” large language model that is free for research and commercial use. Within a few hours, the community has ported Llama 2 to llama.cpp which makes it eaiser and more efficient to run Llama 2 locally.

Download and compile llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && LLAMA_METAL=1 make

Note that LLAMA_METAL is set to 1 to enable using GPU on Apple Silicone. On my M1 Pro MacBook Pro, the compliation took about a few seconds.

Download model weights

We will be using the 7B chat model that has been converted and quantified on HuggingFace :

wget "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_0.bin"
export MODEL=llama-2-7b-chat.ggmlv3.q4_0.bin

Run model inference

Run compiled main with the prompt read from tty, and specify the model path with -m flag:

echo "Prompt: " \
    && read PROMPT \
    && ./main \
        -t 8 \
        -ngl 1 \
        -m ${MODEL} \
        --color \
        -c 2048 \
        --temp 0.7 \
        --repeat_penalty 1.1 \
        -n -1 \
        -p "### Instruction: ${PROMPT} \n### Response:"

Output:

### Instruction: hello \n### Response: Hello! How can I help you today? [end of text]

llama_print_timings:        load time =  4777.73 ms
llama_print_timings:      sample time =     6.97 ms /    10 runs   (    0.70 ms per token,  1434.10 tokens per second)
llama_print_timings: prompt eval time =  1305.32 ms /    12 tokens (  108.78 ms per token,     9.19 tokens per second)
llama_print_timings:        eval time =   462.38 ms /     9 runs   (   51.38 ms per token,    19.46 tokens per second)
llama_print_timings:       total time =  1775.44 ms

Create interactive utility app with Streamlit

Mon, 10 Jul 2023 00:00:00 +0000

Streamlit is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science.

Besides data science and machine learning, I found Streamlit can also be used for creating very simple utility apps.

Postboy - a simple Postman-like app

For example, we can create a simple Postman -like app for testing REST APIs with Streamlit in just a few lines of code.

Install Streamlit:

pip install streamlit

Create a file app.py:

import json
import streamlit as st
import requests

st.title("Postboy")

url = st.text_input("URL", "https://jsonplaceholder.typicode.com/posts/1")
method = st.selectbox("Method", ["GET", "POST", "PUT", "DELETE"])

if st.button("Send Request"):
    with st.spinner("Sending..."):
        try:
            response = requests.request(method=method, url=url)
            st.code(json.dumps(response.json(), indent=2), language="json")
        except Exception as e:
            st.write(e)

Run the app:

❯ streamlit run app.py

  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://192.168.0.11:8501

The app will be opened in a browser:

We can further add more features to the app, but the above code demonstrates how easy it is to create a simple utility app with Streamlit.

Handling JWT in Python

Sun, 09 Jul 2023 00:00:00 +0000

JSON Web Tokens (JWT) is an open standard (RFC 7519 ) that defines a compact and self-contained way for securely transmitting information between parties as a JSON object.

JWT Structure

JWTs consist of three parts separated by dots:

Header
Payload
Signature

The header contains the algorithm used to sign the token and the type of token. The header is base64 encoded and looks like this:

{
  "alg": "HS256",
  "typ": "JWT"
}

Payload

Contains a set of claims.

{
  "sub": "1234567890",
  "name": "John Doe",
  "iat": 1516239022
}

Signature

The signature is used to verify the integrity of the token. It is created by signing the header and payload with a secret key. The signature is base64 encoded and looks like this:

HMACSHA256(
  base64UrlEncode(header) + "." +
  base64UrlEncode(payload),
  secret
)

JWT in Python

In Python, we can use pyjwt . Another option is python-jose but it seems not actively maintained.

$ pip install pyjwt[crypto]

Note that it’s better to include the [crypto] to install the cryptography module for working with RSA .

Decode JWT Token

>>> decoded = jwt.decode(encoded, public_key, algorithms=["RS256"])
{'some': 'payload'}

Sometimes, we may just want to decode the key without validation of the signature by setting the verify_signature option to False.

>>> jwt.decode(encoded, options={"verify_signature": False})
{'some': 'payload'}

Similarly, we can read the headers without validation:

>>> jwt.get_unverified_header(encoded)
{'alg': 'RS256', 'typ': 'JWT', 'kid': 'key-id-12345...'}

For more examples, see here from pyjwt documentation.

Question answering over documents with LLM

Sat, 01 Jul 2023 00:00:00 +0000

One of the most popular applications for large language model (LLM) is question answering over various types of documents, such a plain text, web pages, and PDFs. Usually, we want to make the model answer the question which it hasn’t been trained on.

Overview

There are mainly two steps involved:

Data ingestion: load source documents and convert them into vector embeddings which will be stored in a vector database
Question answering: when given input question, convert to vector embedding first, then perform similarity search within the vector database, and top k results will be used as context for the LLM to generate answer to the question.

Diagrams:

    ┌──────────────────┐
    │ Source Documents │
    └────────┬─────────┘
            │ Load & Split
            ▼
    ┌──────────────────┐
    │    Text Chunks   │
    └────────┬─────────┘
            │ Embedding Model
            ▼
    ┌──────────────────┐
    │ Vector Embeddings│
    └──────────────────┘

                  ┌──────────────┐
                  │Question Query│
                  └───────┬──────┘
                          │ Embedding Model
                          ▼
Similarity Search ┌──────────────┐
      ┌───────────┤ Query Vector │
      │           └──────────────┘
      ▼
┌───────────┐     ┌───────────────┐
│ Vector DB ├───► │ Most K Similar│
└───────────┘     │ Source Chunks │
                  └───────┬───────┘
                          │ as context
                          │ plus question
                          ▼
                  ┌───────────────┐
                  │      LLM      │
                  └───────┬───────┘
                          │
                          ▼
                    Generated Answer

Langchain is an emerging framework for quickly prototyping and building LLM applications. In this post, I’ll use it to make an example of how to do question answering over documents using LLM.

Data ingestion

For demo purpose, we only process Markdown documents. I used MDN Web Docs HTTP section files/en-us/web/http for the documents.

import glob

def get_markdown_files(directory):
    markdown_files = []
    pattern = f"{directory}/**/*.md"
    markdown_files = glob.glob(pattern, recursive=True)
    return markdown_files

files = get_markdown_files('/path/to/directory')

Load documents, see Document Loaders for other loaders for different kinds of documents.

from langchain.document_loaders import UnstructuredMarkdownLoader

documents = []
for file_path in files:
  loader = UnstructuredMarkdownLoader(file_path)
  docs = loader.load()
  documents.extend(docs)

Split the documents with RecursiveCharacterTextSplitter , which is the recommended one for generic text.

text_splitter = RecursiveCharacterTextSplitter(
  	chunk_size=1000,
  	chunk_overlap=100
)
texts = text_splitter.split_documents(documents)
print(f"Number of chunks: {len(texts)}")

Create vector embeddings using Hugging Face Embeddings with all-MiniLM-L6-v2 model.

from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

Chroma is a AI-native open-source vector database. Langchain provides integration with Chroma vector store .

Create embeddings from splitted texts and persist embeddings into Chroma vector DB:

from chromadb.config import Settings
from langchain.vectorstores import Chroma

chroma_settings = Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory='./db',
    anonymized_telemetry=False,
)

db = Chroma.from_documents(
    texts,
    embeddings,
    persist_directory='./db',
    client_settings=chroma_settings,
)
db.persist()

After persisting, the db directory structure shall look like this:

❯ exa ./db --tree -L 2
./db
├── chroma-collections.parquet
├── chroma-embeddings.parquet
└── index
   ├── id_to_uuid_257a38bd-b642-48ca-b23e-4182417aef0d.pkl
   ├── index_257a38bd-b642-48ca-b23e-4182417aef0d.bin
   ├── index_metadata_257a38bd-b642-48ca-b23e-4182417aef0d.pkl
   └── uuid_to_id_257a38bd-b642-48ca-b23e-4182417aef0d.pkl

Question answering

Langchain provides Retrieval QA to allow us conveniently do question answering over an index.

In this example, OpenAI model will be used as the LLM to generate the answer.

retriever = db.as_retriever(search_kwargs={"k": 5})
qa = RetrievalQA.from_chain_type(
  	llm=OpenAI(temperature=0.1),
  	chain_type="stuff",
  	retriever=retriever,
  	return_source_documents=True
)

query = "What is http?"
result = qa(query)
result['result']

HTTP (Hypertext Transfer Protocol) is an application-layer protocol for transmitting hypermedia documents, such as HTML. It was designed for communication between web browsers and web servers, but it can also be used for other purposes. HTTP follows a classical client-server model, with a client opening a connection to make a request, then waiting until it receives a response. HTTP is a stateless protocol, meaning that the server does not keep any data (state) between two requests.

Conclusion

This post depicts a typical flow for addressing questions over documents using LLM. It demonstrated that LLM can efficiently extract information and synthesize answers according on the context and corpus on which it has been trained.
There are many factors that can affect the output quality, what I can think of are:
- In the ingestion step: chunk_size, chunk_overlap as well as the embedding model, the dimension of the embedding
- In the question answering step: which LLM is used, the parameters of it (such as temperature), how the prompt is constructed, etc.

Deploy Ghost instance to Fly.io

Wed, 28 Jun 2023 00:00:00 +0000

Disclaimer: Not affiliated with Ghost or Fly.io, this is only for evaluation purpose.

Fly.io comes with generous free tier allowance, which can be used to run some random side projects for free. Ghost is a WordPress like publishing platform. Unfortunately, there’s no free tier provided by their company. Similar to WordPress, it provides an open-source version that can be self-hosted.

Get Started

To get started, first create an account on Fly.io , and install flyctl :

brew install flyctl

Open your favorite terminal app and sign into the account:

fly auth login

We can view a list of available regions:

fly platform regions

CODE	NAME                        	GATEWAY	PAID PLAN ONLY
ams 	Amsterdam, Netherlands      	✓
arn 	Stockholm, Sweden
...

Deploy Ghost

This article did a great job explaining every steps for deploying to Fly.io . Remember to choose the same region in each step.

Create storage

Free tier offers 3GB volume, so here we create a volume with 3GB:

flyctl volumes create data --size 3

Initialize app

In next step, we initialize the ghost application on Fly.io. See all the latest available Docker images for Ghost .

mkdir my-blog
cd my-blog
flyctl launch --image=ghost:5 --no-deploy

After this step, a fly.toml file will be generated. It contains the configuration for the app we will deploy to Fly.io.

# fly.toml app configuration file generated for fing-ghost on 2023-06-25T19:02:40+01:00
#
# See https://fly.io/docs/reference/configuration/ for information about how to use this file.
#

app = ""
primary_region = "lhr"

[build]
  image = "ghost:5-alpine"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0

Update config

We need to update the fly.toml configuration file to include some environment variables for Ghost instance. And we mount the volume we created earlier to the application. Make sure to change the port to 2368 which the Ghost instance listens to by default.

app = ""
primary_region = "lhr"

[build]
  image = "ghost:5-alpine"

[http_service]
  internal_port = 2368
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0

[env]
  url = "https://.fly.dev"
  database__client = "sqlite3"
  database__connection__filename = "content/data/ghost.db"
  database__useNullAsDefault = "true"
  database__debug = "false"

[mounts]
  source="data"
  destination="/var/lib/ghost/content"

[[services]]
  internal_port = 2368

Deploy

Simply do:

fly deploy

After a while, the instance should be up and running on https://.fly.dev. And we can go to https://.fly.dev/ghost to do the initial setup.

Conclusion and thoughts

Fly.io dashboard is a great place to view the status and the logs for the deployed application.

After I played a while, I found that the instance was constantly restarted due to out of memory issue. Their free plan says Up to 3 shared-cpu-1x 256mb VMs.

My thoughts is that: it’s not practical to run a full Ghost instance on Fly.io only on their free plan. However, if you are just interested and want to try it out, then it might be a good place to deploy an application very conveniently. In most case, if you only host a simple blog, then static website hosting (GitHub Pages , Cloudflare Pages , Netlify ) might be better choice.

Custom middleware for FastAPI application

Sun, 25 Jun 2023 00:00:00 +0000

FastAPI is a Python web framework for building APIs. By using a middleware, we are able to process request and response before/after they get handled by the application.

Implement custom middleware

Starlette is a lightweight ASGI framework/toolkit on which FastAPI is based. It provides BaseHTTPMiddleware class for us to implement custom middleware. It’s required to override the async def dispatch(request, call_next) method.

from starlette.middleware.base import BaseHTTPMiddleware

class CustomHeaderMiddleware(BaseHTTPMiddleware):
    def __init__(self, app, header_value='Example'):
        super().__init__(app)
        self.header_value = header_value

    async def dispatch(self, request, call_next):
        response = await call_next(request)
        response.headers['Custom'] = self.header_value
        return response

The example middleware above simply adds a Custom header to the response.

FastAPI also supports using decorator to create a middleware :

@app.middleware("http")
async def custom_middleware(request: Request, call_next):
	response = await call_next(request)
	return response

It should have the same effect as overriding the dispatch method. I personally prefer the below way to programmatically add the middleware to the FastAPI application.

Add middleware to FastAPI

from fastapi import FastAPI

app = FastAPI()

app.add_middleware(CustomHeaderMiddleware, header_value='Hello')

Notes on "Langchain for LLM Application Development"

Sun, 25 Jun 2023 00:00:00 +0000

Course Link: LangChain for LLM Application Development - DeepLearning.AI

Previous notes:

Notes on “Building Systems with the ChatGPT API”

Introduction

What is Langchain

Open-source development framework for LLM applications
Provide both Python and JavaScript/Typescript packages

Modular components which can be combined to build end-to-end applications.

Models, Prompts and Output Parsers

OpenAI API

Example:

import openai

def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0,
    )
    return response.choices[0].message["content"]

LangChain

from langchain.chat_models import ChatOpenAI
chat = ChatOpenAI(temperature=0.0)

Prompt template

A prompt template refers to a reproducible way to generate a prompt.

from langchain import PromptTemplate
template = """/
You are a naming consultant for new companies.
What is a good name for a company that makes {product}?
"""
prompt = PromptTemplate.from_template(template)
prompt.format(product="colorful socks")

Output parsers

Language models output text. Output parsers allows us to get structrued information out of the LLM response.

parser = PydanticOutputParser(pydantic_object=Joke)
prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)
parser.parse(output)

Memory

LLMs are “stateless”.

ConversationBufferMemory
- allows for storing of messages and then extracts the messages in a variable
ConversationBufferWindowMemory
- keeps a list of the interactions of the conversation over time. It only uses the last K interactions
ConversationTokenBufferMemory
- keeps a buffer of recent interactions in memory, and uses token length rather than number of interactions to determine when to flush interactions
ConversationSummaryMemory
- creates a summary of the conversation over time

Example usage:

llm = ChatOpenAI(temperature=0.0)
memory = ConversationBufferMemory()
conversation = ConversationChain(
    llm=llm,
    memory = memory,
    verbose=True
)

Additional Memory Types

Vector data memory

store text in a vector database and retrieve the most relevant blocks of text

Entity memories

using an LLM, it remembers details about specific entities

Conversation can also be stored in conventional database (key-value store or SQL).

Chains

LangChain provides the Chain interface for such “chained” applications. We define a Chain very generically as a sequence of calls to components, which can include other chains.

chain = SimpleSequentialChain(chains=[chain_one, chain_two])
chain.run("input")

Sequential

SimpleSequentialChain: The simplest form of sequential chains, where each step has a singular input/output, and the output of one step is the input to the next.
SequentialChain: A more general form of sequential chains, allowing for multiple inputs/outputs.

Router

RouterChain: dynamically selects the next chain to use for a given input.

For example, use MultiPromptChain to create a question-answering chain that selects the prompt which is most relevant for a given question, and then answers the question using that prompt.

chain = MultiPromptChain(
    router_chain=router_chain,
    destination_chains=destination_chains,
    default_chain=default_chain, verbose=True)

See Langchain Router for full example.

Question and Answer

Use LLM to answer questions over documents.

Embeddings:

Embedding vector captures content/meaning
Text with similar content will have similar vectors

Split document to small chunks
For each chunk, create embeddings and store into vector database
When query came in, first create an embedding for that query
Then compare all vectors in the vector database, and pick the n most similar
These then get passed to LLM to get back the final answer

Use Langchain’s OpenAIEmbeddings to create embedding for query:

from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
embed = embeddings.embed_query("Hi my name is Harrison")

db = DocArrayInMemorySearch.from_documents(docs, embeddings)
query = "Please suggest a shirt with sunblocking"
docs = db.similarity_search(query)

Use RetrievalQA chain:

retriever = db.as_retriever()
llm = ChatOpenAI(temperature = 0.0)

qa_stuff = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."
response = qa_stuff.run(query)
display(Markdown(response))

Stuff method: simply stuff all data into the prompt context to pass to the language model

Pros: it makes a single call to the LLM, which has access to all the data at once.
Cons: LLMs have a context length, the prompt may exceed the limit.

Additional methods:

Map reduce: call LLM for each chunk plus the query, then aggregate the answers and call LLM again for final answer.
Refine: builds upon the answer from the previous document
Map rerank: let LLM give each chunk a score, then select the highest score as final answer

Evaluation

Turn on debug to view the output of each step.

import langchain
langchain.debug = True

Use QAEvalChain :

from langchain.evaluation.qa import QAEvalChain

llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(examples, predictions)

Agents

An agent has access to a suite of tools, and determines which ones to use depending on the user input. Agents can use multiple tools, and use the output of one tool as the input to the next. See more on its doc .

Example:

llm = OpenAI(temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)

agent.run("Who is Leo DiCaprio's girlfriend? What is her current age raised to the 0.43 power?")

Create custom tool:

from langchain.agents import tool
from datetime import date

@tool
def time(text: str) -> str:
    """Returns todays date, use this for any \
    questions related to knowing todays date. \
    The input should always be an empty string, \
    and this function will always return todays \
    date - any date mathmatics should occur \
    outside this function."""
    return str(date.today())

For more, see Define Custom Tools .

Conclusion

My thoughts:

Langchain is very powerful and handy tool for developing LLM based applications
It is still evolving, new functionalities are introduced and APIs may change

Set up Hugo with Tailwind CSS in 2023

Sat, 24 Jun 2023 00:00:00 +0000

With the release of v0.112.0 , Hugo added the native support for TailwindCSS v3.x . The author of Hugo provided an example repository setting up TailwindCSS v3: bep/hugo-starter-tailwind-basic .

Note that it uses PostCSS so make sure to use Hugo extended version greater than v0.112.0.

How does it work?

According to the release note:

The basic concept is to add hugo_stats.json to the server watcher list in Hugo and trigger a new TailwindCSS build only whenever either this file or the main CSS file changes.

Add the following sections to the config.toml or hugo.toml configuration file:

[module]
  [[module.mounts]]
    source = "assets"
    target = "assets"
  [[module.mounts]]
    source = "hugo_stats.json"
    target = "assets/watching/hugo_stats.json"

[build]
  writeStats = true
  [[build.cachebusters]]
    source = "assets/watching/hugo_stats\\.json"
    target = "styles\\.css"
  [[build.cachebusters]]
    source = "(postcss|tailwind)\\.config\\.js"
    target = "css"
  [[build.cachebusters]]
    source = "assets/.*\\.(js|ts|jsx|tsx)"
    target = "js"
  [[build.cachebusters]]
    source = "assets/.*\\.(.*)$"
    target = "$1"

Also update the tailwind.config.js file to

module.exports = {
  content: [
    "./hugo_stats.json"
  ],
}

Migration

Previously, my package.json file looks like this:

{
  "scripts": {
    "dev": "NODE_ENV=development ./node_modules/tailwindcss/lib/cli.js -i ./static/tailwind.css -o ./static/main.css -w",
    "build": "NODE_ENV=production ./node_modules/tailwindcss/lib/cli.js -i ./static/tailwind.css -o ./static/main.css --minify"
  },
  "dependencies": {
    "tailwindcss": "^3.2.7",
    "@tailwindcss/typography": "^0.5.9"
  }
}

It includes separate steps:

Compile Tailwind CSS file to static/main.css
Build Hugo site

Sometimes, when making changes to styles, I had to manually restart the Hugo dev server to make it take effect.

After:

{
  "devDependencies": {
    "@tailwindcss/typography": "^0.5.9",
    "autoprefixer": "^10.4.14",
    "postcss": "^8.4.23",
    "postcss-cli": "^10.1.0",
    "prettier": "^2.8.8",
    "prettier-plugin-go-template": "^0.0.13",
    "tailwindcss": "^3.3.2"
  }
}

We don’t need the above dev and build command since all can be done via hugo. Win!

Create postcss.config.js:

const tailwindConfig = process.env.HUGO_FILE_TAILWIND_CONFIG_JS || "./tailwind.config.js";
const tailwind = require("tailwindcss")(tailwindConfig);
const autoprefixer = require("autoprefixer");

module.exports = {
  // eslint-disable-next-line no-process-env
  plugins: [tailwind, ...(process.env.HUGO_ENVIRONMENT === "production" ? [autoprefixer] : [])],
};

I put styles.css file under ./assets/styles. Then include it properly inside the Hugo html template file.

-- Styles -->
{{ $options := dict "inlineImports" true }}
{{ $styles := resources.Get "styles.css" }}
{{ $styles = $styles | resources.PostCSS }}
{{ if hugo.IsProduction }}
	{{ $styles = $styles | minify | fingerprint | resources.PostProcess }}
{{ end }}
<link href="{{ $styles.RelPermalink }}" rel="stylesheet" />

After all these steps, the dev and build command will simply:

# dev
hugo server

# build
hugo --gc --minify

Enjoy!

Write pytest tests for argparse

Thu, 22 Jun 2023 00:00:00 +0000

Writing unit tests for Python code that uses argparse can be non-trivial.

Call `main` with optional arguments

One way suggested by Simon Willison was to make main() function take optional arguments:

parser = argparse.ArgumentParser()
parser.add_argument(...)

def main(args=None):
    parsed_args = parser.parse_args(args)

This makes it easy to just test main() function by calling it with different arguments:

@pytest.mark.parametrize("option", ("-h", "--help"))
def test_help(capsys, option):
    try:
        main([option])
    ...

Patch `sys.argv`

It’s also possible to patch the sys.argv with the mock arguments for testing:

def command():
    parser = argparse.ArgumentParser()
    args = parser.parse_args()
    ...

def test_command():
    with unittest.mock.patch('sys.argv'. ['arg1', 'arg2'])
        command()
        ...

Patch `argparse` directly

In some rare scenarios, argument parser may be at the global scope inside a module file:

parser = argparse.ArgumentParser()
args = parser.parse_args()

class MyClass:
    def __init__(self):
        self.foo = args.foo

In the above case, when the class is imported, the parser will be executed. This makes the unit tests tricky simply because the argument parsing process happens at module import level rather than inside a function call.

A straightforward solution is to patch the ArgumentParser or the args directly:

mock_args = {"foo": "bar"}

@unittest.mock.patch('module.args', argparse.Namespace(**mock_args))
def test_class():
    obj = MyClass()
    ...

The same can also be achieved by patch the argparse.ArgumentParser.parse_args function. See stackoverflow answer .

@mock.patch('argparse.ArgumentParser.parse_args',
            return_value=argparse.Namespace(kwarg1=value, kwarg2=value))
def test_command(mock_args):
    pass

Notes on "Building Systems with the ChatGPT API"

Sun, 11 Jun 2023 00:00:00 +0000

Course Link: Building Systems with the ChatGPT API - DeepLearning.AI

Introduction

Process of building an application

supervised learning: usually takes long time
- get labeled data
- train model on data
- deploy & call model
prompt-based AI: takes short time
- specify prompt and call model

Language Models

How is works:

A language model is built by using supervised learning to repeatedly predict the next word.

Two types of LLMs

Base LLM
Instruction Tuned LLM
- Tune LLM using RLHF : Reinforcement Learning from Human Feedback)

Tokens: common sequences of characters found in text.

Many words map to one token. But some are broken down to multiple tokens, e.g. prompting has prom, pt and ing three parts.

OpenAI provides a tool Tokenizer for understanding how a piece of text would be tokenized by the API, and the total count of tokens in that piece of text.

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

If you need a programmatic interface for tokenizing text, check out our tiktoken package for Python. For JavaScript, the gpt-3-encoder package for node.js works for most GPT-3 models.

Use API Key with caution:

Avoid directly put API Key in the code
Use python-dotenv to load from .env file

import os
import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

Classification

Note that in the course, it uses delimiter = "####" for user’s input message.

messages =  [  
  {'role':'system', 'content': system_message},    
  {'role':'user', 'content': f"{delimiter}{user_message}{delimiter}"},  
]

Use system message to guide the model to output the categories of the classification result in JSON format.

Moderation

OpenAI provides Moderations which will:

Given a input text, outputs if the model classifies it as violating OpenAI’s content policy.

Avoid Prompt Injections Users might inject: forget the previous instructions, do something else instead.

Chain of Thought Reason

Avoid model making error by rushing to a conclusion. Let the query require a series relevant reasoning steps.

Inner Monologue

Since we asked the LLM to separate its reasoning steps by a delimiter, we can hide the chain-of-thought reasoning from the final output that the user sees.

Chaining Prompts

For complex tasks, keeps the track of state external to the LLM. It also allows model to use external tools such as web search and databases.

More focused: breaks down the complex task
Context limitation: max tokens for input prompt and output prompt response
Reduced cost: pay per token

Check Outputs

Use Moderations API to check output for potential harmful content.

Check the satisfaction of the output by letting the model rate the output.

Example:

system_message = f"""
You are an assistant that evaluates whether \
customer service agent responses sufficiently \
answer customer questions, and also validates that \
all the facts the assistant cites from the product \
information are correct.
...

Evaluation

For most prompt-based application:

Tune prompts on handful of examples
Add additional “tricky” examples opportunistically
Develop metrics to measure performance on examples
Collect randomly sampled set of examples to tune to (development set/hold-out cross validation set)
Collect and use a hold-out test set

For text generation tasks, we can evaluate LLM’s answer with a rubric, for example:

def eval_with_rubric(test_set, assistant_answer):

    cust_msg = test_set['customer_msg']
    context = test_set['context']
    completion = assistant_answer
    
    system_message = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by looking at the context that the customer service \
    agent is using to generate its response. 
    """

    user_message = f"""\
You are evaluating a submitted answer to a question based on the context \
that the agent uses to answer the question.
Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {cust_msg}
    ************
    [Context]: {context}
    ************
    [Submission]: {completion}
    ************
    [END DATA]

Compare the factual content of the submitted answer with the context. \
Ignore any differences in style, grammar, or punctuation.
Answer the following questions:
    - Is the Assistant response based only on the context provided? (Y or N)
    - Does the answer include information that is not provided in the context? (Y or N)
    - Is there any disagreement between the response and the context? (Y or N)
    - Count how many questions the user asked. (output a number)
    - For each question that the user asked, is there a corresponding answer to it?
      Question 1: (Y or N)
      Question 2: (Y or N)
      ...
      Question N: (Y or N)
    - Of the number of questions asked, how many of these questions were addressed by the answer? (output a number)
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

Second way is to evaluate based on “ideal” or “expert” (human generated) answer.

This evaluation prompt is from the OpenAI evals project.

BLEU score : another way to evaluate whether two pieces of text are similar or not.

Full content RSS in Hugo

Sun, 04 Jun 2023 00:00:00 +0000

By default, Hugo only displays a summary of the post.

{{ .Summary | html }}

To enable a full text RSS feed for a section of my site, I added a layouts/section/section.rss.xml template file according to Hugo’s layout lookup order . Fill it with the default RSS layout template .

Modified the description part and add RSS content :

{{ with .Description | html }}{{ . }}{{ else }}{{ .Summary | html }}{{ end -}}
{{ (printf "" .Content) | safeHTML }}

Verify the change by going to https://localhost:1313/

/index.xml.

References

Build LLVM with CMake

Sat, 03 Jun 2023 00:00:00 +0000

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

Using the pre-built distribution

In most cases, we can just directly download and use the pre-compiled version of LLVM and Clang from llvm-project/releases . For example, clang+llvm-16.0.0-x86_64-linux-gnu-ubuntu-18.04.tar.xz is a pre-built distribution for Ubuntu 18.04 and for the x86-64 platforms.

Build LLVM from source

In cases where there’s no pre-built version that suffice our needs, we can download and build LLVM from scratch. For example, openai/triton Python library requires a LLVM compiler to build the wheel.

The Getting Started with the LLVM System documentation page of LLVM contains some instruction on how to build it. I recorded my steps below.

Install CMake :

pip install cmake

here we use the Python wheel which is easier to install.

Download LLVM and Clang source

curl -LO https://github.com/llvm/llvm-project/releases/download/llvmorg-11.0.1/llvm-11.0.1.src.tar.xz
tar -xvf llvm-11.0.1.src.tar.xz

Create a new directory for the build files and enter it:

mkdir build
cd build

Configure the build with CMake

cmake -G "Unix Makefiles" -DLLVM_TARGETS_TO_BUILD="X86;NVPTX;AMDGPU" -DCMAKE_BUILD_TYPE="Release" ../

We tell CMake to build Release version which performs best optimization and disables debug information. For more detailed information see CMAKE_BUILD_TYPE . List of Targets available. We may also include -DLLVM_INCLUDE_TESTS=Off -DLLVM_INCLUDE_EXAMPLES=Off -DLLVM_INCLUDE_BENCHMARKS=Off to skip the unnecessary build.

Compile and install

make -j4 install

By default, the built bin, lib and include will be installed to /usr/local/. We can now use the compiled LLVM libraries in other project, e.g. Embedding LLVM in your project .

Customize GitHub Codespace using Dev Container

Thu, 01 Jun 2023 00:00:00 +0000

A Development Container (or Dev Container for short) allows you to use a container as a full-featured development environment.

GitHub Codespace is a development environment that’s hosted in the cloud, which allows you to use VS Code directly from the browser to edit, commit and run projects. Whenever you work in a codespace, you are using a dev container on a virtual machine.

The dev container configuration is located under .devcontainer/devcontainer.json. There are a number of pre-built dev container images published under mcr.microsoft.com/devcontainers on devcontainers/images .

By default, the codespace launches a universal container which includes a decent amount of tools and platforms. One friction I encountered was that in the universal image, the Hugo version built in is a few minor version behind its latest upstream version, see its devcontainer.json . I try to add a Hugo features on top and specify the version, it didn’t work simply because the Hugo feature only installs it if it’s missing. So I created a dev container configuration to tackle it:

{
  "image": "mcr.microsoft.com/devcontainers/go:1",
  "features": {
    "ghcr.io/devcontainers/features/hugo:1": {
      "extended": true,
      "version": "0.112.5"
    },
    "ghcr.io/devcontainers/features/node:1": {}
  },
  "customizations": {
    "vscode": {
      "extensions": [
        "streetsidesoftware.code-spell-checker"
      ]
    }
  },
  "postCreateCommand": "npm install",
  "forwardPorts": [1313]
}

In the above configuration, we first set the base image, then add in features (language, tools or frameworks) on the top. And we install some VS Code extensions, and after the image is created, run npm install. Since I’m developing a Hugo site, port 1313 is forwarded. In this way, I installed Hugo on top of go base image, and also enabled the features I need specificly for this repository.

If we ever want to further customize one feature, devcontainers/feature-starter provides a good template for authoring our own feature.

Build multi-arch Docker images in GitHub Actions

Wed, 10 May 2023 00:00:00 +0000

Publishing images to GitHub Packages

Let’s say we have a repository that includes a Dockerfile. We can utilize a GitHub workflow to:

Check out the repository
Log in to the GitHub Container Registry (ghcr.io)
Extract metadata
Build and push the Docker image to our specified registry

name: Create and publish a Docker image

on:
  push:
    tags:
      - 'v*'

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  build-and-push-image:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Log in to the Container registry
        uses: docker/login-action@65b78e6e13532edd9afa3aa52ac7964289d1a9c1
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata (tags, labels) for Docker
        id: meta
        uses: docker/metadata-action@9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

      - name: Build and push Docker image
        uses: docker/build-push-action@f2a1d5e99d037542a71f64918e516c093c6f3fc4
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}

Build and publish multi-arch images

Depends on the type of the host machines, images can be built for multiple platforms, e.g. linux/amd64, linux/arm64/v8, etc.

Set up QEMU and Buildx

The next thing is to enable QEMU. Basically it allows GitHub action host machine to emulate different architectures.

QEMU is a generic and open source machine & userspace emulator and virtualizer. QEMU is capable of emulating a complete machine in software without any need for hardware virtualization support.

buildx is a Docker CLI plugin for extended build capabilities with BuildKit .

--- a/.github/workflows/build-publish-image.yml
+++ b/.github/workflows/build-publish-image.yml

@@ -20,6 +22,12 @@ jobs:
       - name: Checkout repository
         uses: actions/checkout@v3

+      - name: Set up QEMU
+        uses: docker/setup-qemu-action@v2
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v2

Specify the platforms in docker/build-push-action

In docker/build-push-action , we can set platforms to a list of target platforms to build, separated by comma. In our example, we only needs to build linux/amd64 for all amd64 machine, and linux/arm64/v8 for Apple Silicone MacBooks.

--- a/.github/workflows/build-publish-image.yml
+++ b/.github/workflows/build-publish-image.yml
@@ -37,6 +45,7 @@ jobs:
         uses: docker/build-push-action@f2a1d5e99d037542a71f64918e516c093c6f3fc4
         with:
           context: .
-          push: true
+          push: ${{ github.event_name != 'pull_request' }}
+          platforms: linux/amd64,linux/arm64/v8
           tags: ${{ steps.meta.outputs.tags }}
           labels: ${{ steps.meta.outputs.labels }}

View them on GitHub Packages

Once the GitHub workflow has been triggered and the images have been built successfully, we can see them from the OS/Arch tab.

Final thoughts

As I just discovered, linux/arm64/v8 is normalized as just linux/arm64.

docker image inspect ghcr.io/imfing/keras-flask-deploy-webapp --format '{{.Os}}/{{.Architecture}}'
linux/arm64

References

Caddy log filter

Tue, 07 Feb 2023 00:00:00 +0000

Caddy is a powerful, enterprise-ready, open source web server with automatic HTTPS written in Go.

It can be used as a reverse proxy server, for example:

:10001 {
    respond /healthz "OK" 200
    reverse_proxy /* http://localhost:8000 {
        header_up x-account "user"
        header_up x-secret-key "secret"
    }
    log
}

The above Caddyfile will launch a http server listening on port 10001 and will redirect requests to localhost:8000 and add two headers which contain sensitive information. When the debug mode is turned on, caddy’s internal logger will log down the requests from the reverse proxy to the localhost server, which will expose the sensitive info.

The log directive can be used to remove certain header from the request and the reverse proxy server itself. To do so, simply add the log override config to the global level:

{
    log {
        format filter {
            wrap console
            fields {
                request>headers>X-Secret-Key delete
            }
        }
    }
}

Note that we don’t need to explicitly remove the Authorization header since:

Since Caddy v2.5, by default, headers with potentially sensitive information (Cookie, Set-Cookie, Authorization and Proxy-Authorization) will be logged with empty values. This behaviour can be disabled with the log_credentials global server option.

Shortcut for creating daily note

Sat, 28 Jan 2023 00:00:00 +0000

I like to jog down things on my Apple Note as it’s easier and fast to access whenever my iPhone is around. With iOS built-in Shortcuts app, creating a daily note entry can be just one click.

Here’s the shortcut I made:

Format current date to be yyyy-MM-dd
Find in all notes where the name contains such date string
If exists, open the note
If not, create a note with the formatted date as title

Credit:

Shortcut for creating a daily Apple Note

Decode Flask session cookie

Wed, 04 Jan 2023 00:00:00 +0000

Snippet to decode Flask session cookie.

To find the cookie in Chrome, open Inspect -> Application -> Cookies, then find the session cookie.

Here’s a short Python snippet that decodes the session cookie using zlib and base64.urlsafe_b64decode.

import zlib
import base64

def decode(cookie):
    """Decode a Flask cookie."""
    try:
        compressed = False
        payload = cookie

        if payload.startswith('.'):
            compressed = True
            payload = payload[1:]

        data = payload.split(".")[0]

        data = base64.urlsafe_b64decode(data)
        if compressed:
            data = zlib.decompress(data)

        return data.decode("utf-8")
    except Exception as e:
        return "[Decoding error: are you sure this was a Flask session cookie? {}]".format(e)

Reference:

https://www.kirsle.net/wizards/flask-session.cgi

Shortcut to copy url as markdown

Tue, 03 Jan 2023 00:00:00 +0000

This shortcut can copy a link from Safari share sheet and copy as Markdown format [title](url] to the clipboard.

Markdown reference-style link

Fri, 30 Dec 2022 00:00:00 +0000

Reference-style links are a special kind of link that make URLs easier to display and read in Markdown.

Instead of inline link: [text](link), the reference-style:

[text][label]

[label]: https://google.com "Google"

First part of the link is formatted with two sets of brackets.

The second part of a reference-style link is formatted with the following attributes:

The label, in brackets, followed immediately by a colon and at least one space (e.g., [label]: ).
The URL for the link, which you can optionally enclose in angle brackets.
The optional title for the link, which you can enclose in double quotes, single quotes, or parentheses.

See it in action:

Basic Syntax | Markdown Guide

Fn Option Delete

Fri, 16 Dec 2022 00:00:00 +0000

As long time Mac user, I wasn’t aware that “forward delete” and “backward delete” exist 😂

Backward delete option + delete

hello world▐

hello▐

Forward delete fn + option + delete

hello ▐world

hello▐

Alternatively, I was using vim editing mode for quick cursor navigation and editing.

Lightroom backup workflow

Tue, 13 Dec 2022 00:00:00 +0000

Midway through 2022, I began using Lightroom Classic to manage and edit camera-captured images. Now that the end of the year is approaching, I’ve decided to organize and back up the photos from this year on my Lacie portable hard drive so I can free up space on my MacBook Pro.

Setup

All of my photos are stored in three distinct locations:

MacBook Pro (500GB): ad-hoc editing
Sandisk Extreme Portable SSD (1TB): bulk editing
LaCie Portable Hard Drive (4TB): backup

I usually bring my laptop with me when I travel. After each day I would import my photos directly into my laptop and edit them right away. Sometimes, I import the photos to the SSD drive. LaCie hard drive is mostly used for making backups.

Backup workflow

I use Lightroom Classic to help me with this. I decide to store the backup photos in hard drive in a following structure:

photos
└─ 2022
   └──2022-01-23 
      ├── Raw
      ├── JPEG
      └── Export

Add external hard drive to Folders

On the left side bar, go to “Folders” and click “Add Folder…”.

Add the location in external hard drive where we want to store our backups.

Select photos

Use “Smart Collection” to create collections that distinguish between Raw and JPEG images. Obviously, we can also sort by date/location/…

Once we finished filtering, just select all by Cmd + A.

Move photos

In the sidebar on the left, right-click to create a subfolder for our photos.

Select “Include selected photos” to transfer the chosen images to the folder.

Thoughts

Initially, I separated each trip into its own Lightroom catalog, but it turns out I could manage them all with a single catalog! Lightroom’s collections feature is versatile and powerful, allowing us to filter and separate photos. Using a single catalog also simplifies backup procedures.

Merge changes from GitHub template repository

Tue, 06 Dec 2022 00:00:00 +0000

It’s very convenient to use GitHub template repository feature to bootstrap a repository. I thought it would also have functionality like Syncing a fork , but unfortunately it doesn’t.

So to sync changes from upstream template repository, we need to use git command line:

Add remote
```
git remote add template 
```
Fetch template changes
```
git fetch --all
```
Use --allow-unrelated-histories to merge, we may also need to manually resolve all the merge conflicts.
```
git merge --allow-unrelated-histories --squash template/
```

--allow-unrelated-histories: By default, git merge command refuses to merge histories that do not share a common ancestor. This option can be used to override this safety when merging histories of two projects that started their lives independently. As that is a very rare occasion, no configuration variable to enable this by default exists and will not be added.

There is also GitHub action actions-template-sync which can be configured to automatically sync changes from template.

References

Manage GitHub project in Linear

Sun, 04 Dec 2022 00:00:00 +0000

While I was working on my simple personal project: issues-blog , I experimented with Linear to manage this small project.

Linear surprised me by being a powerful and elegant tool for managing issues and projects. It has nice and modern UI/UX, and more responsive than JIRA.

Connect Linear with GitHub

Linear comes with integration with GitHub of course: GitHub - Linear Guide . To enable it, go to Settings, and under Integrations -> GitHub we can connect Linear with GitHub pull requests.

A very basic way is to link Linear issue by branch name, e.g. lauren/ENG-123-fixing-loading-issue.

If a PR is created using this branch name, Linear will automatically link to the issue, and move it to In Progress, and after the PR is closed, it will automatically be marked as Done.

TIL of Xin Fu

Run Llama 2 locally on MacBook

Download and compile llama.cpp

Download model weights

Run model inference

Links

Create interactive utility app with Streamlit

Postboy - a simple Postman-like app

Handling JWT in Python

JWT Structure

Header

Payload

Signature

JWT in Python

Decode JWT Token

Links

Question answering over documents with LLM

Overview

Data ingestion

Question answering

Conclusion

Deploy Ghost instance to Fly.io

Get Started

Deploy Ghost

Create storage

Initialize app

Update config

Deploy

Conclusion and thoughts

Custom middleware for FastAPI application

Implement custom middleware

Add middleware to FastAPI

Notes on "Langchain for LLM Application Development"

Introduction

Models, Prompts and Output Parsers

OpenAI API

LangChain

Prompt template

Output parsers

Memory

Additional Memory Types

Chains

Sequential

Router

Question and Answer

Evaluation

Agents

Conclusion

Set up Hugo with Tailwind CSS in 2023

How does it work?

Migration

Write pytest tests for argparse

Call main with optional arguments

Patch sys.argv

Patch argparse directly

Notes on "Building Systems with the ChatGPT API"

Introduction

Language Models

Classification

Moderation

Chain of Thought Reason

Chaining Prompts

Check Outputs

Evaluation

Full content RSS in Hugo

References

Build LLVM with CMake

Using the pre-built distribution

Build LLVM from source

Links

Customize GitHub Codespace using Dev Container

Links

Build multi-arch Docker images in GitHub Actions

Publishing images to GitHub Packages

Build and publish multi-arch images

Set up QEMU and Buildx

Specify the platforms in docker/build-push-action

View them on GitHub Packages

Final thoughts

References

Call `main` with optional arguments

Patch `sys.argv`

Patch `argparse` directly