Notes on "Building Systems with the ChatGPT API"
Course Link: Building Systems with the ChatGPT API - DeepLearning.AI
Introduction
Process of building an application
- supervised learning: usually takes long time
- get labeled data
- train model on data
- deploy & call model
- prompt-based AI: takes short time
- specify prompt and call model
Language Models
How is works:
- A language model is built by using supervised learning to repeatedly predict the next word.
Two types of LLMs
- Base LLM
- Instruction Tuned LLM
- Tune LLM using RLHF : Reinforcement Learning from Human Feedback)
Tokens: common sequences of characters found in text.
Many words map to one token. But some are broken down to multiple tokens, e.g. prompting
has prom
, pt
and ing
three parts.
OpenAI provides a tool Tokenizer for understanding how a piece of text would be tokenized by the API, and the total count of tokens in that piece of text.
A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).
If you need a programmatic interface for tokenizing text, check out our tiktoken package for Python. For JavaScript, the gpt-3-encoder package for node.js works for most GPT-3 models.
See also: Understanding GPT tokenizers .
Use API Key with caution:
- Avoid directly put API Key in the code
- Use python-dotenv
to load from
.env
file
import os
import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']
Classification
Note that in the course, it uses delimiter = "####"
for user’s input message.
messages = [
{'role':'system', 'content': system_message},
{'role':'user', 'content': f"{delimiter}{user_message}{delimiter}"},
]
Use system message to guide the model to output the categories of the classification result in JSON format.
Moderation
OpenAI provides Moderations which will:
Given a input text, outputs if the model classifies it as violating OpenAI’s content policy.
Avoid Prompt Injections
Users might inject: forget the previous instructions, do something else instead
.
Chain of Thought Reason
Avoid model making error by rushing to a conclusion. Let the query require a series relevant reasoning steps.
Inner Monologue
- Since we asked the LLM to separate its reasoning steps by a delimiter, we can hide the chain-of-thought reasoning from the final output that the user sees.
Chaining Prompts
For complex tasks, keeps the track of state external to the LLM. It also allows model to use external tools such as web search and databases.
- More focused: breaks down the complex task
- Context limitation: max tokens for input prompt and output prompt response
- Reduced cost: pay per token
Check Outputs
Use Moderations API to check output for potential harmful content.
Check the satisfaction of the output by letting the model rate the output.
Example:
system_message = f"""
You are an assistant that evaluates whether \
customer service agent responses sufficiently \
answer customer questions, and also validates that \
all the facts the assistant cites from the product \
information are correct.
...
Evaluation
For most prompt-based application:
- Tune prompts on handful of examples
- Add additional “tricky” examples opportunistically
- Develop metrics to measure performance on examples
- Collect randomly sampled set of examples to tune to (development set/hold-out cross validation set)
- Collect and use a hold-out test set
For text generation tasks, we can evaluate LLM’s answer with a rubric, for example:
def eval_with_rubric(test_set, assistant_answer):
cust_msg = test_set['customer_msg']
context = test_set['context']
completion = assistant_answer
system_message = """\
You are an assistant that evaluates how well the customer service agent \
answers a user question by looking at the context that the customer service \
agent is using to generate its response.
"""
user_message = f"""\
You are evaluating a submitted answer to a question based on the context \
that the agent uses to answer the question.
Here is the data:
[BEGIN DATA]
************
[Question]: {cust_msg}
************
[Context]: {context}
************
[Submission]: {completion}
************
[END DATA]
Compare the factual content of the submitted answer with the context. \
Ignore any differences in style, grammar, or punctuation.
Answer the following questions:
- Is the Assistant response based only on the context provided? (Y or N)
- Does the answer include information that is not provided in the context? (Y or N)
- Is there any disagreement between the response and the context? (Y or N)
- Count how many questions the user asked. (output a number)
- For each question that the user asked, is there a corresponding answer to it?
Question 1: (Y or N)
Question 2: (Y or N)
...
Question N: (Y or N)
- Of the number of questions asked, how many of these questions were addressed by the answer? (output a number)
"""
messages = [
{'role': 'system', 'content': system_message},
{'role': 'user', 'content': user_message}
]
response = get_completion_from_messages(messages)
return response
Second way is to evaluate based on “ideal” or “expert” (human generated) answer.
This evaluation prompt is from the OpenAI evals project.
BLEU score : another way to evaluate whether two pieces of text are similar or not.