Question answering over documents with LLM

One of the most popular applications for large language model (LLM) is question answering over various types of documents, such a plain text, web pages, and PDFs. Usually, we want to make the model answer the question which it hasn’t been trained on.


There are mainly two steps involved:

  1. Data ingestion: load source documents and convert them into vector embeddings which will be stored in a vector database
  2. Question answering: when given input question, convert to vector embedding first, then perform similarity search within the vector database, and top k results will be used as context for the LLM to generate answer to the question.


    │ Source Documents │
            │ Load & Split
    │    Text Chunks   │
            │ Embedding Model
    │ Vector Embeddings│
                  │Question Query│
                          │ Embedding Model
Similarity Search ┌──────────────┐
      ┌───────────┤ Query Vector │
      │           └──────────────┘
┌───────────┐     ┌───────────────┐
│ Vector DB ├───► │ Most K Similar│
└───────────┘     │ Source Chunks │
                          │ as context
                          │ plus question
                  │      LLM      │
                    Generated Answer

Langchain is an emerging framework for quickly prototyping and building LLM applications. In this post, I’ll use it to make an example of how to do question answering over documents using LLM.

Data ingestion

For demo purpose, we only process Markdown documents. I used MDN Web Docs HTTP section files/en-us/web/http for the documents.

import glob

def get_markdown_files(directory):
    markdown_files = []
    pattern = f"{directory}/**/*.md"
    markdown_files = glob.glob(pattern, recursive=True)
    return markdown_files

files = get_markdown_files('/path/to/directory')

Load documents, see Document Loaders for other loaders for different kinds of documents.

from langchain.document_loaders import UnstructuredMarkdownLoader

documents = []
for file_path in files:
  loader = UnstructuredMarkdownLoader(file_path)
  docs = loader.load()

Split the documents with RecursiveCharacterTextSplitter , which is the recommended one for generic text.

text_splitter = RecursiveCharacterTextSplitter(
texts = text_splitter.split_documents(documents)
print(f"Number of chunks: {len(texts)}")

Create vector embeddings using Hugging Face Embeddings with all-MiniLM-L6-v2 model.

from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

Chroma is a AI-native open-source vector database. Langchain provides integration with Chroma vector store .

Create embeddings from splitted texts and persist embeddings into Chroma vector DB:

from chromadb.config import Settings
from langchain.vectorstores import Chroma

chroma_settings = Settings(

db = Chroma.from_documents(

After persisting, the db directory structure shall look like this:

❯ exa ./db --tree -L 2
├── chroma-collections.parquet
├── chroma-embeddings.parquet
└── index
   ├── id_to_uuid_257a38bd-b642-48ca-b23e-4182417aef0d.pkl
   ├── index_257a38bd-b642-48ca-b23e-4182417aef0d.bin
   ├── index_metadata_257a38bd-b642-48ca-b23e-4182417aef0d.pkl
   └── uuid_to_id_257a38bd-b642-48ca-b23e-4182417aef0d.pkl

Question answering

Langchain provides Retrieval QA to allow us conveniently do question answering over an index.

In this example, OpenAI model will be used as the LLM to generate the answer.

retriever = db.as_retriever(search_kwargs={"k": 5})
qa = RetrievalQA.from_chain_type(
query = "What is http?"
result = qa(query)
HTTP (Hypertext Transfer Protocol) is an application-layer protocol for transmitting hypermedia documents, such as HTML. It was designed for communication between web browsers and web servers, but it can also be used for other purposes. HTTP follows a classical client-server model, with a client opening a connection to make a request, then waiting until it receives a response. HTTP is a stateless protocol, meaning that the server does not keep any data (state) between two requests.


  • This post depicts a typical flow for addressing questions over documents using LLM. It demonstrated that LLM can efficiently extract information and synthesize answers according on the context and corpus on which it has been trained.
  • There are many factors that can affect the output quality, what I can think of are:
    • In the ingestion step: chunk_size, chunk_overlap as well as the embedding model, the dimension of the embedding
    • In the question answering step: which LLM is used, the parameters of it (such as temperature), how the prompt is constructed, etc.