Tutorial: Chuyển đổi PDF Đề thi thành MDX

Hướng dẫn thực hành tạo Agent Skill để chuyển đổi file PDF đề thi học sinh giỏi thành trang MDX hiển thị tốt trên web.

Bài toán

Tình huống: Bạn có nhiều file PDF đề thi học sinh giỏi với chất lượng khác nhau (scan, text, hỗn hợp). Cần chuyển thành MDX để:

Hiển thị tốt trên website
Dễ tìm kiếm và index
Hỗ trợ code syntax highlighting, bảng, công thức toán

Thách thức:

PDF chất lượng scan kém cần OCR
Đề thi chứa code (Python, C++, Pascal…)
Bảng dữ liệu phức tạp
Công thức toán học (LaTeX/KaTeX)
Hình ảnh minh họa

Thiết kế Skill

Cấu trúc thư mục


pdf-to-mdx-converter/
├── SKILL.md                 # File chính
├── reference.md             # Hướng dẫn chi tiết
├── templates/
│   └── exam-template.mdx    # Template đề thi
└── scripts/
    ├── extract_text.py      # Extract text từ PDF
    ├── detect_content.py    # Nhận diện loại content
    └── convert_to_mdx.py    # Chuyển đổi sang MDX

Tạo SKILL.md


---
name: pdf-to-mdx-converter
description: Chuyển đổi file PDF đề thi học sinh giỏi thành MDX, xử lý code blocks, bảng, và công thức toán học
dependencies: python>=3.8, pypdf>=3.0, pdfplumber>=0.9, pytesseract>=0.3
---
 
# PDF to MDX Converter
 
Skill này chuyển đổi file PDF đề thi học sinh giỏi thành trang MDX 
có thể hiển thị tốt trên web với:
- Code syntax highlighting
- Bảng dữ liệu
- Công thức toán học (KaTeX)
- Hình ảnh
 
## Quy trình xử lý
 
1. **Đọc PDF** - Extract text và images từ file
2. **Nhận diện content** - Phân loại: text, code, bảng, công thức
3. **Chuyển đổi** - Convert từng phần sang MDX format
4. **Output** - Tạo file .mdx hoàn chỉnh
 
## Khi nào sử dụng
 
- Chuyển đổi đề thi PDF sang web format
- Extract content từ tài liệu học thuật
- Digitize tài liệu giáo dục
 
## Xem reference.md để biết chi tiết implementation

Xử lý các thành phần

1. Extract Text từ PDF


# scripts/extract_text.py
import pdfplumber
from pypdf import PdfReader
import pytesseract
from PIL import Image
import io
 
def extract_text(pdf_path):
    """Extract text từ PDF, hỗ trợ cả text-based và scanned PDF"""
    
    results = []
    
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            # Thử extract text trực tiếp
            text = page.extract_text()
            
            # Nếu không có text (scanned PDF), dùng OCR
            if not text or len(text.strip()) < 50:
                # Convert page to image
                img = page.to_image(resolution=300)
                pil_image = img.original
                
                # OCR với pytesseract
                text = pytesseract.image_to_string(
                    pil_image, 
                    lang='vie+eng',  # Hỗ trợ tiếng Việt và Anh
                    config='--psm 6'  # Assume uniform block of text
                )
            
            results.append({
                'page': i + 1,
                'text': text,
                'tables': page.extract_tables(),
                'images': extract_images(page)
            })
    
    return results
 
def extract_images(page):
    """Extract images từ page"""
    images = []
    for img in page.images:
        images.append({
            'x0': img['x0'],
            'y0': img['top'],
            'width': img['width'],
            'height': img['height']
        })
    return images

2. Nhận diện loại Content


# scripts/detect_content.py
import re
 
def detect_content_type(text):
    """Nhận diện loại content: code, math, table, text"""
    
    content_blocks = []
    lines = text.split('\n')
    
    current_block = {'type': 'text', 'content': []}
    
    for line in lines:
        detected_type = classify_line(line)
        
        if detected_type != current_block['type']:
            if current_block['content']:
                content_blocks.append(current_block)
            current_block = {'type': detected_type, 'content': []}
        
        current_block['content'].append(line)
    
    if current_block['content']:
        content_blocks.append(current_block)
    
    return content_blocks
 
def classify_line(line):
    """Phân loại một dòng text"""
    
    # Code patterns
    code_patterns = [
        r'^\s*(def |class |import |from |if |for |while |return |print\()',  # Python
        r'^\s*(#include|int main|void |printf|scanf)',  # C/C++
        r'^\s*(program |var |begin |end\.|procedure |function )',  # Pascal
        r'^\s*```',  # Markdown code block
    ]
    
    for pattern in code_patterns:
        if re.search(pattern, line, re.IGNORECASE):
            return 'code'
    
    # Math patterns
    math_patterns = [
        r'\$.*\$',           # Inline math
        r'\\\[.*\\\]',       # Display math
        r'\\frac|\\sum|\\int|\\sqrt|\\alpha|\\beta',  # LaTeX commands
        r'[∑∫∏√∞≤≥≠±×÷]',    # Math symbols
    ]
    
    for pattern in math_patterns:
        if re.search(pattern, line):
            return 'math'
    
    return 'text'
 
def detect_programming_language(code_block):
    """Detect ngôn ngữ lập trình"""
    
    code = '\n'.join(code_block['content'])
    
    if re.search(r'def |import |print\(|class \w+:', code):
        return 'python'
    elif re.search(r'#include|int main|printf|scanf|void ', code):
        return 'cpp'
    elif re.search(r'program |begin |end\.|writeln|readln', code, re.IGNORECASE):
        return 'pascal'
    elif re.search(r'public class|System\.out|void main', code):
        return 'java'
    
    return 'text'

3. Xử lý Code Blocks


def format_code_block(block):
    """Format code block cho MDX"""
    
    language = detect_programming_language(block)
    code = '\n'.join(block['content'])
    
    # Clean up code
    code = code.strip()
    
    # Loại bỏ số dòng nếu có
    code = re.sub(r'^\d+\s*[:\|]\s*', '', code, flags=re.MULTILINE)
    
    return f"""
```{language}
{code}

"""



### 4. Xử lý Công thức Toán

```python
def format_math(block):
    """Convert công thức toán sang KaTeX format cho MDX"""
    
    text = '\n'.join(block['content'])
    
    # Đã có LaTeX format
    if re.search(r'\$.*\$|\\\[|\\\(', text):
        return text
    
    # Convert math symbols thông dụng
    conversions = {
        '√': r'\sqrt',
        '∑': r'\sum',
        '∫': r'\int',
        '∞': r'\infty',
        '≤': r'\leq',
        '≥': r'\geq',
        '≠': r'\neq',
        '×': r'\times',
        '÷': r'\div',
        '±': r'\pm',
        'α': r'\alpha',
        'β': r'\beta',
        'γ': r'\gamma',
        'π': r'\pi',
    }
    
    for symbol, latex in conversions.items():
        text = text.replace(symbol, latex)
    
    # Wrap trong $ nếu chưa có
    if not text.startswith('$'):
        # Inline math
        if len(text) < 50:
            text = f'${text}$'
        else:
            # Display math
            text = f'$$\n{text}\n$$'
    
    return text

5. Xử lý Bảng


def format_table(table_data):
    """Convert table data sang Markdown table"""
    
    if not table_data or not table_data[0]:
        return ""
    
    # Clean data
    cleaned = []
    for row in table_data:
        cleaned_row = [str(cell).strip() if cell else '' for cell in row]
        cleaned.append(cleaned_row)
    
    # Xác định số cột
    max_cols = max(len(row) for row in cleaned)
    
    # Normalize rows
    for row in cleaned:
        while len(row) < max_cols:
            row.append('')
    
    # Build markdown table
    lines = []
    
    # Header
    header = '| ' + ' | '.join(cleaned[0]) + ' |'
    lines.append(header)
    
    # Separator
    separator = '| ' + ' | '.join(['---'] * max_cols) + ' |'
    lines.append(separator)
    
    # Data rows
    for row in cleaned[1:]:
        line = '| ' + ' | '.join(row) + ' |'
        lines.append(line)
    
    return '\n'.join(lines)

6. Tạo file MDX hoàn chỉnh


# scripts/convert_to_mdx.py
def convert_to_mdx(pdf_path, output_path):
    """Main function: Convert PDF to MDX"""
    
    # Extract content
    pages = extract_text(pdf_path)
    
    # Build MDX content
    mdx_content = []
    
    # Frontmatter
    mdx_content.append('''---
title: "Đề thi Học sinh giỏi"
description: "Đề thi được chuyển đổi từ PDF"
---
''')
    
    for page_data in pages:
        mdx_content.append(f"\n## Trang {page_data['page']}\n")
        
        # Process text content
        blocks = detect_content_type(page_data['text'])
        
        for block in blocks:
            if block['type'] == 'code':
                mdx_content.append(format_code_block(block))
            elif block['type'] == 'math':
                mdx_content.append(format_math(block))
            else:
                mdx_content.append('\n'.join(block['content']))
        
        # Process tables
        for table in page_data['tables']:
            mdx_content.append('\n' + format_table(table) + '\n')
    
    # Write output
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write('\n'.join(mdx_content))
    
    return output_path

Template MDX đầu ra

Dưới đây là ví dụ output MDX mà skill sẽ tạo ra:


---
title: "Đề thi HSG Tin học - Tỉnh ABC 2024"
description: "Đề thi chọn học sinh giỏi cấp tỉnh môn Tin học"
subject: "Tin học"
level: "THPT"
year: 2024
---
 
# ĐỀ THI HỌC SINH GIỎI CẤP TỈNH
 
**Môn:** Tin học  
**Thời gian:** 180 phút  
**Năm:** 2024
 
---
 
## Bài 1: Dãy con tăng (4 điểm)
 
Cho dãy số nguyên $a_1, a_2, ..., a_n$ với $1 \leq n \leq 10^5$.
 
**Yêu cầu:** Tìm dãy con tăng dài nhất.
 
**Input:** File `DAYCON.INP`
- Dòng 1: Số nguyên $n$
- Dòng 2: $n$ số nguyên $a_i$ ($|a_i| \leq 10^9$)
 
**Output:** File `DAYCON.OUT`
- Dòng 1: Độ dài dãy con tăng dài nhất
 
**Ví dụ:**
 
| DAYCON.INP | DAYCON.OUT |
|------------|------------|
| 6          | 4          |
| 1 3 2 4 3 5|            |
 
---
 
## Bài 2: Đường đi ngắn nhất (5 điểm)
 
Cho đồ thị có hướng $G = (V, E)$ với $n$ đỉnh và $m$ cạnh.
 
Công thức: $$d(u, v) = \min_{p \in P(u,v)} \sum_{e \in p} w(e)$$
 
### Code mẫu (minh họa syntax highlighting)
 
```cpp
#include <bits/stdc++.h>
using namespace std;
 
void dijkstra(int s, int n) {
    // ... implementation
}

Hết đề



> **Lưu ý:** Đây là ví dụ minh họa cách skill format output. Nội dung thực tế sẽ được trích xuất từ PDF đề thi của bạn.

## Sử dụng Skill

Sau khi tạo và upload skill, bạn có thể yêu cầu Claude:

“Chuyển đổi file de-thi-tin-hoc-2024.pdf thành MDX, đảm bảo format đúng code C++, công thức toán, và bảng dữ liệu”



Claude sẽ:
1. Load skill `pdf-to-mdx-converter`
2. Chạy scripts extract và convert
3. Tạo file .mdx hoàn chỉnh
4. Trả về file để bạn download

## Tips xử lý PDF chất lượng kém

1. **Tăng resolution OCR** khi PDF scan mờ:
   ```python
   img = page.to_image(resolution=600)  # Thay vì 300

Pre-process image trước OCR:


from PIL import ImageFilter, ImageEnhance
 
img = img.filter(ImageFilter.SHARPEN)
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(2.0)

Manual review cho công thức phức tạp - OCR khó nhận diện LaTeX chính xác
Sử dụng GPT-4 Vision cho PDF scan phức tạp (thay vì OCR truyền thống)

Mở rộng

Skill này có thể mở rộng để:

Hỗ trợ thêm ngôn ngữ lập trình
Auto-generate solutions từ đề thi
Tạo metadata SEO tự động
Export sang nhiều format (HTML, LaTeX, Word)

Kết luận

Với Agent Skill này, bạn có thể tự động hóa việc chuyển đổi hàng loạt PDF đề thi sang MDX, tiết kiệm thời gian đáng kể so với làm thủ công.

Tiếp theo: Thử áp dụng pattern này để tạo skills cho các workflow khác trong công việc của bạn!