Vector Database

Khái niệm cơ bản

Tưởng tượng bạn đang xây thư viện cho 1 triệu cuốn sách…

Với thư viện truyền thống, bạn tìm sách bằng:

Mã số sách (ISBN): Chính xác nhưng phải biết mã
Từ khóa tên sách: Có thể miss nếu viết sai chính tả

Với Vector Database, bạn có thể:

Hỏi: “Sách về AI làm thay đổi xã hội”
Tìm được: “Artificial Intelligence: A Modern Approach” (dù không khớp từ nào!)

Vector DB = Thư viện thông minh hiểu ý nghĩa, không chỉ từ khóa.

Tại sao cần Vector DB?

SQL truyền thống giỏi tìm kiếm chính xác:


SELECT * FROM books WHERE title LIKE '%AI%'

Nhưng không thể tìm kiếm tương đồng về ý nghĩa:


-- ❌ Không thể viết
SELECT * FROM books WHERE meaning SIMILAR TO 'trí tuệ nhân tạo'

Vector DB được thiết kế để:

Lưu trữ vectors (mảng 1536+ số thực)
Tìm kiếm “gần nhất” (Approximate Nearest Neighbor - ANN)
Scale đến hàng tỷ vectors

So sánh các Vector Database

Database	Hosting	Giá	Use case	Đặc điểm
Chroma	Local/Cloud	Free	Dev/Prototype	Python-first, 5 phút setup
Pinecone	Cloud	Freemium	Production	Không cần quản lý, scale tốt
Weaviate	Both	Free/Paid	Enterprise	Hybrid search mạnh
Qdrant	Both	Free	Performance	Rust, cực nhanh
pgvector	Self-host	Free	Postgres users	Không cần thêm DB
Milvus	Self-host	Free	Big Data	Tỷ vectors, GPU support

Khuyến nghị theo giai đoạn


[Prototype]     → Chroma (local, free)
[MVP/Startup]   → Pinecone (managed, cheap)
[Scale]         → Qdrant/Weaviate (self-host, control)
[Already Postgres] → pgvector (no new infra)

Quick Start: Chroma (5 phút)


# pip install chromadb
 
import chromadb
 
# 1. Khởi tạo client (local)
client = chromadb.Client()
 
# 2. Tạo collection (như "table" trong SQL)
collection = client.create_collection(name="my_documents")
 
# 3. Thêm documents (auto-embed bằng default model)
collection.add(
    documents=[
        "Phở là món ăn đặc trưng của Hà Nội",
        "Bún chả Obama nổi tiếng từ năm 2016",
        "iPhone 15 Pro Max có camera 48MP"
    ],
    ids=["doc1", "doc2", "doc3"]
)
 
# 4. Query
results = collection.query(
    query_texts=["Món ăn ngon ở Việt Nam"],
    n_results=2
)
 
print(results["documents"])
# Output: [['Phở là món ăn đặc trưng của Hà Nội', 'Bún chả Obama...']]

Chroma với Custom Embeddings (OpenAI)


from chromadb.utils import embedding_functions
import chromadb
 
# Dùng OpenAI embeddings thay vì default
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="sk-xxx",
    model_name="text-embedding-3-small"
)
 
client = chromadb.PersistentClient(path="./chroma_db")
 
collection = client.get_or_create_collection(
    name="company_docs",
    embedding_function=openai_ef
)
 
# Thêm với metadata
collection.add(
    documents=["Chính sách nghỉ phép: 12 ngày/năm"],
    metadatas=[{"department": "HR", "year": 2024}],
    ids=["policy_001"]
)
 
# Query với metadata filter
results = collection.query(
    query_texts=["Tôi được nghỉ bao nhiêu ngày?"],
    where={"department": "HR"},  # Filter!
    n_results=1
)

Metadata Filtering

Vector DB không chỉ tìm “gần”, mà còn lọc theo điều kiện:


# Tìm documents về "hợp đồng" trong phòng "Legal" năm 2024
results = collection.query(
    query_texts=["điều khoản hợp đồng"],
    where={
        "$and": [
            {"department": {"$eq": "Legal"}},
            {"year": {"$gte": 2024}}
        ]
    },
    n_results=5
)

Bài tập thực hành: Setup Chroma + Query 100 docs

Mục tiêu

Xây dựng một semantic search engine đơn giản.

Bước 1: Chuẩn bị data

Tạo file sample_docs.py:


DOCUMENTS = [
    {"text": "Python là ngôn ngữ lập trình phổ biến nhất cho Data Science", "category": "tech"},
    {"text": "JavaScript thống trị web development", "category": "tech"},
    {"text": "Bún bò Huế có vị cay đặc trưng", "category": "food"},
    {"text": "Cơm tấm Sài Gòn ngon với sườn nướng", "category": "food"},
    # ... thêm 96 docs nữa
]

Bước 2: Index vào Chroma


import chromadb
 
client = chromadb.PersistentClient(path="./my_search_db")
collection = client.get_or_create_collection("search_engine")
 
# Batch insert
collection.add(
    documents=[d["text"] for d in DOCUMENTS],
    metadatas=[{"category": d["category"]} for d in DOCUMENTS],
    ids=[f"doc_{i}" for i in range(len(DOCUMENTS))]
)
 
print(f"Indexed {collection.count()} documents!")

Bước 3: Build search UI (Bonus)


# streamlit run app.py
import streamlit as st
import chromadb
 
client = chromadb.PersistentClient(path="./my_search_db")
collection = client.get_collection("search_engine")
 
query = st.text_input("Tìm kiếm:")
category = st.selectbox("Category:", ["all", "tech", "food"])
 
if query:
    where_filter = {"category": category} if category != "all" else None
    results = collection.query(query_texts=[query], where=where_filter, n_results=5)
    
    for doc in results["documents"][0]:
        st.write(f"- {doc}")

Tóm tắt

Khái niệm	Ý nghĩa
Vector DB	Database tối ưu cho similarity search
ANN	Approximate Nearest Neighbor - tìm gần, không cần chính xác 100%
Metadata	Thông tin bổ sung để filter (category, date, etc.)
Collection	Như “table” trong SQL

Bài tiếp theo: Chunking Strategies - Cách cắt documents hiệu quả.