Notes on "Building Systems with the ChatGPT API"

Course Link: Building Systems with the ChatGPT API - DeepLearning.AI

Introduction

Process of building an application

supervised learning: usually takes long time
- get labeled data
- train model on data
- deploy & call model
prompt-based AI: takes short time
- specify prompt and call model

Language Models

How is works:

A language model is built by using supervised learning to repeatedly predict the next word.

Two types of LLMs

Base LLM
Instruction Tuned LLM
- Tune LLM using RLHF : Reinforcement Learning from Human Feedback)

Tokens: common sequences of characters found in text.

Many words map to one token. But some are broken down to multiple tokens, e.g. prompting has prom, pt and ing three parts.

OpenAI provides a tool Tokenizer for understanding how a piece of text would be tokenized by the API, and the total count of tokens in that piece of text.

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

If you need a programmatic interface for tokenizing text, check out our tiktoken package for Python. For JavaScript, the gpt-3-encoder package for node.js works for most GPT-3 models.

Use API Key with caution:

Avoid directly put API Key in the code
Use python-dotenv to load from .env file

import os
import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

Classification

Note that in the course, it uses delimiter = "####" for user’s input message.

messages =  [  
  {'role':'system', 'content': system_message},    
  {'role':'user', 'content': f"{delimiter}{user_message}{delimiter}"},  
]

Use system message to guide the model to output the categories of the classification result in JSON format.

Moderation

OpenAI provides Moderations which will:

Given a input text, outputs if the model classifies it as violating OpenAI’s content policy.

Avoid Prompt Injections Users might inject: forget the previous instructions, do something else instead.

Chain of Thought Reason

Avoid model making error by rushing to a conclusion. Let the query require a series relevant reasoning steps.

Inner Monologue

Since we asked the LLM to separate its reasoning steps by a delimiter, we can hide the chain-of-thought reasoning from the final output that the user sees.

Chaining Prompts

For complex tasks, keeps the track of state external to the LLM. It also allows model to use external tools such as web search and databases.

More focused: breaks down the complex task
Context limitation: max tokens for input prompt and output prompt response
Reduced cost: pay per token

Check Outputs

Use Moderations API to check output for potential harmful content.

Check the satisfaction of the output by letting the model rate the output.

Example:

system_message = f"""
You are an assistant that evaluates whether \
customer service agent responses sufficiently \
answer customer questions, and also validates that \
all the facts the assistant cites from the product \
information are correct.
...

Evaluation

For most prompt-based application:

Tune prompts on handful of examples
Add additional “tricky” examples opportunistically
Develop metrics to measure performance on examples
Collect randomly sampled set of examples to tune to (development set/hold-out cross validation set)
Collect and use a hold-out test set

For text generation tasks, we can evaluate LLM’s answer with a rubric, for example:

def eval_with_rubric(test_set, assistant_answer):

    cust_msg = test_set['customer_msg']
    context = test_set['context']
    completion = assistant_answer
    
    system_message = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by looking at the context that the customer service \
    agent is using to generate its response. 
    """

    user_message = f"""\
You are evaluating a submitted answer to a question based on the context \
that the agent uses to answer the question.
Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {cust_msg}
    ************
    [Context]: {context}
    ************
    [Submission]: {completion}
    ************
    [END DATA]

Compare the factual content of the submitted answer with the context. \
Ignore any differences in style, grammar, or punctuation.
Answer the following questions:
    - Is the Assistant response based only on the context provided? (Y or N)
    - Does the answer include information that is not provided in the context? (Y or N)
    - Is there any disagreement between the response and the context? (Y or N)
    - Count how many questions the user asked. (output a number)
    - For each question that the user asked, is there a corresponding answer to it?
      Question 1: (Y or N)
      Question 2: (Y or N)
      ...
      Question N: (Y or N)
    - Of the number of questions asked, how many of these questions were addressed by the answer? (output a number)
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

Second way is to evaluate based on “ideal” or “expert” (human generated) answer.

This evaluation prompt is from the OpenAI evals project.

BLEU score : another way to evaluate whether two pieces of text are similar or not.