A
AgentSkill
SkillsAbout
A

AgentSkill

ยฉ 2025

Not affiliated with Anthropic

Back to Skills

Skill_Seekers

Convert documentation websites, GitHub repositories, and PDFs into Claude AI skills with automatic conflict detection

Automation
yusufkaraaslan
5004
530
Updated Dec 15, 2025
ai-toolsast-parserautomationclaude-aiclaude-skillscode-analysisconflict-detectiondocumentationdocumentation-generatorgithubgithub-scrapermcpmcp-servermulti-sourceocrpdfpythonweb-scraping
View on GitHub

Installation

git clone https://github.com/yusufkaraaslan/Skill_Seekers ~/.claude/skills/skill-seekers

SKILL.md

[![MseeP.ai Security Assessment Badge](https://mseep.net/pr/yusufkaraaslan-skill-seekers-badge.png)](https://mseep.ai/app/yusufkaraaslan-skill-seekers)

# Skill Seeker

[![Version](https://img.shields.io/badge/version-2.1.1-blue.svg)](https://github.com/yusufkaraaslan/Skill_Seekers/releases/tag/v2.1.1)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![MCP Integration](https://img.shields.io/badge/MCP-Integrated-blue.svg)](https://modelcontextprotocol.io)
[![Tested](https://img.shields.io/badge/Tests-427%20Passing-brightgreen.svg)](tests/)
[![Project Board](https://img.shields.io/badge/Project-Board-purple.svg)](https://github.com/users/yusufkaraaslan/projects/2)
[![PyPI version](https://badge.fury.io/py/skill-seekers.svg)](https://pypi.org/project/skill-seekers/)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/skill-seekers.svg)](https://pypi.org/project/skill-seekers/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/skill-seekers.svg)](https://pypi.org/project/skill-seekers/)

**Automatically convert documentation websites, GitHub repositories, and PDFs into Claude AI skills in minutes.**

> ๐Ÿ“‹ **[View Development Roadmap & Tasks](https://github.com/users/yusufkaraaslan/projects/2)** - 134 tasks across 10 categories, pick any to contribute!

## What is Skill Seeker?

Skill Seeker is an automated tool that transforms documentation websites, GitHub repositories, and PDF files into production-ready [Claude AI skills](https://www.anthropic.com/news/skills). Instead of manually reading and summarizing documentation, Skill Seeker:

1. **Scrapes** multiple sources (docs, GitHub repos, PDFs) automatically
2. **Analyzes** code repositories with deep AST parsing
3. **Detects** conflicts between documentation and code implementation
4. **Organizes** content into categorized reference files
5. **Enhances** with AI to extract best examples and key concepts
6. **Packages** everything into an uploadable `.zip` file for Claude

**Result:** Get comprehensive Claude skills for any framework, API, or tool in 20-40 minutes instead of hours of manual work.

## Why Use This?

- ๐ŸŽฏ **For Developers**: Create skills from documentation + GitHub repos with conflict detection
- ๐ŸŽฎ **For Game Devs**: Generate skills for game engines (Godot docs + GitHub, Unity, etc.)
- ๐Ÿ”ง **For Teams**: Combine internal docs + code repositories into single source of truth
- ๐Ÿ“š **For Learners**: Build comprehensive skills from docs, code examples, and PDFs
- ๐Ÿ” **For Open Source**: Analyze repos to find documentation gaps and outdated examples

## Key Features

### ๐ŸŒ Documentation Scraping
- โœ… **llms.txt Support** - Automatically detects and uses LLM-ready documentation files (10x faster)
- โœ… **Universal Scraper** - Works with ANY documentation website
- โœ… **Smart Categorization** - Automatically organizes content by topic
- โœ… **Code Language Detection** - Recognizes Python, JavaScript, C++, GDScript, etc.
- โœ… **8 Ready-to-Use Presets** - Godot, React, Vue, Django, FastAPI, and more

### ๐Ÿ“„ PDF Support (**v1.2.0**)
- โœ… **Basic PDF Extraction** - Extract text, code, and images from PDF files
- โœ… **OCR for Scanned PDFs** - Extract text from scanned documents
- โœ… **Password-Protected PDFs** - Handle encrypted PDFs
- โœ… **Table Extraction** - Extract complex tables from PDFs
- โœ… **Parallel Processing** - 3x faster for large PDFs
- โœ… **Intelligent Caching** - 50% faster on re-runs

### ๐Ÿ™ GitHub Repository Scraping (**v2.0.0**)
- โœ… **Deep Code Analysis** - AST parsing for Python, JavaScript, TypeScript, Java, C++, Go
- โœ… **API Extraction** - Functions, classes, methods with parameters and types
- โœ… **Repository Metadata** - README, file tree, language breakdown, stars/forks
- โœ… **GitHub Issues & PRs** - Fetch open/closed issues with labels and milestones
- โœ… **CHANGELOG & Releases** - Automatically extract version history
- โœ… **Conflict Detection** - Compare documented APIs vs actual code implementation
- โœ… **MCP Integration** - Natural language: "Scrape GitHub repo facebook/react"

### ๐Ÿ”„ Unified Multi-Source Scraping (**NEW - v2.0.0**)
- โœ… **Combine Multiple Sources** - Mix documentation + GitHub + PDF in one skill
- โœ… **Conflict Detection** - Automatically finds discrepancies between docs and code
- โœ… **Intelligent Merging** - Rule-based or AI-powered conflict resolution
- โœ… **Transparent Reporting** - Side-by-side comparison with โš ๏ธ warnings
- โœ… **Documentation Gap Analysis** - Identifies outdated docs and undocumented features
- โœ… **Single Source of Truth** - One skill showing both intent (docs) and reality (code)
- โœ… **Backward Compatible** - Legacy single-source configs still work

### ๐Ÿค– AI & Enhancement
- โœ… **AI-Powered Enhancement** - Transforms basic templates into comprehensive guides
- โœ… **No API Costs** - FREE local enhancement using Claude Code Max
- โœ… **MCP Server for Claude Code** - Use directly from Claude Code with natural language

### โšก Performance & Scale
- โœ… **Async Mode** - 2-3x faster scraping with async/await (use `--async` flag)
- โœ… **Large Documentation Support** - Handle 10K-40K+ page docs with intelligent splitting
- โœ… **Router/Hub Skills** - Intelligent routing to specialized sub-skills
- โœ… **Parallel Scraping** - Process multiple skills simultaneously
- โœ… **Checkpoint/Resume** - Never lose progress on long scrapes
- โœ… **Caching System** - Scrape once, rebuild instantly

### โœ… Quality Assurance
- โœ… **Fully Tested** - 391 tests with comprehensive coverage

---

## ๐Ÿ“ฆ Now Available on PyPI!

**Skill Seekers is now published on the Python Package Index!** Install with a single command:

```bash
pip install skill-seekers
```

Get started in seconds. No cloning, no setup - just install and run. See installation options below.

---

## Quick Start

### Option 1: Install from PyPI (Recommended)

```bash
# Install from PyPI (easiest method!)
pip install skill-seekers

# Use the unified CLI
skill-seekers scrape --config configs/react.json
skill-seekers github --repo facebook/react
skill-seekers enhance output/react/
skill-seekers package output/react/
```

**Time:** ~25 minutes | **Quality:** Production-ready | **Cost:** Free

๐Ÿ“– **New to Skill Seekers?** Check out our [Quick Start Guide](QUICKSTART.md) or [Bulletproof Guide](BULLETPROOF_QUICKSTART.md)

### Option 2: Install via uv (Modern Python Tool)

```bash
# Install with uv (fast, modern alternative)
uv tool install skill-seekers

# Or run directly without installing
uv tool run --from skill-seekers skill-seekers scrape --config https://raw.githubusercontent.com/yusufkaraaslan/Skill_Seekers/main/configs/react.json

# Unified CLI - simple commands
skill-seekers scrape --config configs/react.json
skill-seekers github --repo facebook/react
skill-seekers package output/react/
```

**Time:** ~25 minutes | **Quality:** Production-ready | **Cost:** Free

### Option 3: Development Install (From Source)

```bash
# Clone and install in editable mode
git clone https://github.com/yusufkaraaslan/Skill_Seekers.git
cd Skill_Seekers
pip install -e .

# Use the unified CLI
skill-seekers scrape --config configs/react.json
```

### Option 4: Use from Claude Code (MCP Integration)

```bash
# One-time setup (5 minutes)
./setup_mcp.sh

# Then in Claude Code, just ask:
"Generate a React skill from https://react.dev/"
"Scrape PDF at docs/manual.pdf and create skill"
```

**Time:** Automated | **Quality:** Production-ready | **Cost:** Free

### Option 5: Legacy CLI (Backwards Compatible)

```bash
# Install dependencies
pip3 install requests beautifulsoup4

# Run scripts directly (old method)
python3 src/skill_seekers/cli/doc_scraper.py --config configs/react.json

# Upload output/react.zip to Claude - Done!
```

**Time:** ~25 minutes | **Quality:** Production-ready | **Cost:** Free

## Usage Examples

### Documentation Scraping

```bash
# Scrape documentation website
skill-seekers scrape --config configs/react.json

# Quick scrape without config
skill-seekers scrape --url https://react.dev --name react

# With async mode (3x faster)
skill-seekers scrape --config configs/godot.json --async --workers 8
```

### PDF Extraction

```bash
# Basic PDF extraction
skill-seekers pdf --pdf docs/manual.pdf --name myskill

# Advanced features
skill-seekers pdf --pdf docs/manual.pdf --name myskill \
    --extract-tables \        # Extract tables
    --parallel \              # Fast parallel processing
    --workers 8               # Use 8 CPU cores

# Scanned PDFs (requires: pip install pytesseract Pillow)
skill-seekers pdf --pdf docs/scanned.pdf --name myskill --ocr

# Password-protected PDFs
skill-seekers pdf --pdf docs/encrypted.pdf --name myskill --password mypassword
```

**Time:** ~5-15 minutes (or 2-5 minutes with parallel) | **Quality:** Production-ready | **Cost:** Free

### GitHub Repository Scraping

```bash
# Basic repository scraping
skill-seekers github --repo facebook/react

# Using a config file
skill-seekers github --config configs/react_github.json

# With authentication (higher rate limits)
export GITHUB_TOKEN=ghp_your_token_here
skill-seekers github --repo facebook/react

# Customize what to include
skill-seekers github --repo django/django \
    --include-issues \        # Extract GitHub Issues
    --max-issues 100 \        # Limit issue count
    --include-changelog \     # Extract CHANGELOG.md
    --include-releases        # Extract GitHub Releases
```

**Time:** ~5-10 minutes | **Quality:** Production-ready | **Cost:** Free

### Unified Multi-Source Scraping (**NEW - v2.0.0**)

**The Problem:** Documentation and code often drift apart. Docs might be outdated, missing features that exist in code, or documenting features that were removed.

**The Solution:** Combine documentation + GitHub + PDF into one unified skill that shows BOTH what's documented AND what actually exists, with clear warnings about discrepancies.

```bash
# Use existing unified configs
skill-seekers unified --config configs/react_unified.json
skill-seekers unified --config configs/django_unified.json

# Or create unified config (mix documentation + GitHub)
cat > configs/myframework_unified.json << 'EOF'
{
  "name": "myframework",
  "description": "Complete framework knowledge from docs + code",
  "merge_mode": "rule-based",
  "sources": [
    {
      "type": "documentation",
      "base_url": "https://docs.myframework.com/",
      "extract_api": true,
      "max_pages": 200
    },
    {
      "type": "github",
      "repo": "owner/myframework",
      "include_code": true,
      "code_analysis_depth": "surface"
    }
  ]
}
EOF

# Run unified scraper
skill-seekers unified --config configs/myframework_unified.json

# Package and upload
skill-seekers package output/myframework/
# Upload output/myframework.zip to Claude - Done!
```

**Time:** ~30-45 minutes | **Quality:** Production-ready with conflict detection | **Cost:** Free

**What Makes It Special:**

โœ… **Conflict Detection** - Automatically finds 4 types of discrepancies:
- ๐Ÿ”ด **Missing in code** (high): Documented but not implemented
- ๐ŸŸก **Missing in docs** (medium): Implemented but not documented
- โš ๏ธ **Signature mismatch**: Different parameters/types
- โ„น๏ธ **Description mismatch**: Different explanations

โœ… **Transparent Reporting** - Shows both versions side-by-side:
```markdown
#### `move_local_x(delta: float)`

โš ๏ธ **Conflict**: Documentation signature differs from implementation

**Documentation says:**
```
def move_local_x(delta: float)
```

**Code implementation:**
```python
def move_local_x(delta: float, snap: bool = False) -> None
```
```

โœ… **Advantages:**
- **Identifies documentation gaps** - Find outdated or missing docs automatically
- **Catches code changes** - Know when APIs change without docs being updated
- **Single source of truth** - One skill showing intent (docs) AND reality (code)
- **Actionable insights** - Get suggestions for fixing each conflict
- **Development aid** - See what's actually in the codebase vs what's documented

**Example Unified Configs:**
- `configs/react_unified.json` - React docs + GitHub repo
- `configs/django_unified.json` - Django docs + GitHub repo
- `configs/fastapi_unified.json` - FastAPI docs + GitHub repo

**Full Guide:** See [docs/UNIFIED_SCRAPING.md](docs/UNIFIED_SCRAPING.md) for complete documentation.

## How It Works

```mermaid
graph LR
    A[Documentation Website] --> B[Skill Seeker]
    B --> C[Scraper]
    B --> D[AI Enhancement]
    B --> E[Packager]
    C --> F[Organized References]
    D --> F
    F --> E
    E --> G[Claude Skill .zip]
    G --> H[Upload to Claude AI]
```

0. **Detect llms.txt** - Checks for llms-full.txt, llms.txt, llms-small.txt first
1. **Scrape**: Extracts all pages from documentation
2. **Categorize**: Organizes content into topics (API, guides, tutorials, etc.)
3. **Enhance**: AI analyzes docs and creates comprehensive SKILL.md with examples
4. **Package**: Bundles everything into a Claude-ready `.zip` file

## ๐Ÿ“‹ Prerequisites

**Before you start, make sure you have:**

1. **Python 3.10 or higher** - [Download](https://www.python.org/downloads/) | Check: `python3 --version`
2. **Git** - [Download](https://git-scm.com/) | Check: `git --version`
3. **15-30 minutes** for first-time setup

**First time user?** โ†’ **[Start Here: Bulletproof Quick Start Guide](BULLETPROOF_QUICKSTART.md)** ๐ŸŽฏ

This guide walks you through EVERYTHING step-by-step (Python install, git clone, first skill creation).

---

## ๐Ÿš€ Quick Start

### Method 1: MCP Server for Claude Code (Easiest)

Use Skill Seeker directly from Claude Code with natural language!

```bash
# Clone repository
git clone https://github.com/yusufkaraaslan/Skill_Seekers.git
cd Skill_Seekers

# One-time setup (5 minutes)
./setup_mcp.sh

# Restart Claude Code, then just ask:
```

**In Claude Code:**
```
List all available configs
Generate config for Tailwind at https://tailwindcss.com/docs
Scrape docs using configs/react.json
Package skill at output/react/
```

**Benefits:**
- โœ… No manual CLI commands
- โœ… Natural language interface
- โœ… Integrated with your workflow
- โœ… 9 tools available instantly (includes automatic upload!)
- โœ… **Tested and working** in production

**Full guides:**
- ๐Ÿ“˜ [MCP Setup Guide](docs/MCP_SETUP.md) - Complete installation instructions
- ๐Ÿงช [MCP Testing Guide](docs/TEST_MCP_IN_CLAUDE_CODE.md) - Test all 9 tools
- ๐Ÿ“ฆ [Large Documentation Guide](docs/LARGE_DOCUMENTATION.md) - Handle 10K-40K+ pages
- ๐Ÿ“ค [Upload Guide](docs/UPLOAD_GUIDE.md) - How to upload skills to Claude

### Method 2: CLI (Traditional)

#### One-Time Setup: Create Virtual Environment

```bash
# Clone repository
git clone https://github.com/yusufkaraaslan/Skill_Seekers.git
cd Skill_Seekers

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate  # macOS/Linux
# OR on Windows: venv\Scripts\activate

# Install dependencies
pip install requests beautifulsoup4 pytest

# Save dependencies
pip freeze > requirements.txt

# Optional: Install anthropic for API-based enhancement (not needed for LOCAL enhancement)
# pip install anthropic
```

**Always activate the virtual environment before using Skill Seeker:**
```bash
source venv/bin/activate  # Run this each time you start a new terminal session
```

#### Easiest: Use a Preset

```bash
# Make sure venv is activated (you should see (venv) in your prompt)
source venv/bin/activate

# Optional: Estimate pages first (fast, 1-2 minutes)
skill-seekers estimate configs/godot.json

# Use Godot preset
skill-seekers scrape --config configs/godot.json

# Use React preset
skill-seekers scrape --config configs/react.json

# See all presets
ls configs/
```

### Interactive Mode

```bash
skill-seekers scrape --interactive
```

### Quick Mode

```bash
skill-seekers scrape \
  --name react \
  --url https://react.dev/ \
  --description "React framework for UIs"
```

## ๐Ÿ“ค Uploading Skills to Claude

Once your skill is packaged, you need to upload it to Claude:

### Option 1: Automatic Upload (API-based)

```bash
# Set your API key (one-time)
export ANTHROPIC_API_KEY=sk-ant-...

# Package and upload automatically
skill-seekers package output/react/ --upload

# OR upload existing .zip
skill-seekers upload output/react.zip
```

**Benefits:**
- โœ… Fully automatic
- โœ… No manual steps
- โœ… Works from command line

**Requirements:**
- Anthropic API key (get from https://console.anthropic.com/)

### Option 2: Manual Upload (No API Key)

```bash
# Package skill
skill-seekers package output/react/

# This will:
# 1. Create output/react.zip
# 2. Open the output/ folder automatically
# 3. Show upload instructions

# Then manually upload:
# - Go to https://claude.ai/skills
# - Click "Upload Skill"
# - Select output/react.zip
# - Done!
```

**Benefits:**
- โœ… No API key needed
- โœ… Works for everyone
- โœ… Folder opens automatically

### Option 3: Claude Code (MCP) - Smart & Automatic

```
In Claude Code, just ask:
"Package and upload the React skill"

# With API key set:
# - Packages the skill
# - Uploads to Claude automatically
# - Done! โœ…

# Without API key:
# - Packages the skill
# - Shows where to find the .zip
# - Provides manual upload instructions
```

**Benefits:**
- โœ… Natural language
- โœ… Smart auto-detection (uploads if API key available)
- โœ… Works with or without API key
- โœ… No errors or failures

---

## ๐Ÿ“ Simple Structure

```
doc-to-skill/
โ”œโ”€โ”€ cli/
โ”‚   โ”œโ”€โ”€ doc_scraper.py      # Main scraping tool
โ”‚   โ”œโ”€โ”€ package_skill.py    # Package to .zip
โ”‚   โ”œโ”€โ”€ upload_skill.py     # Auto-upload (API)
โ”‚   โ””โ”€โ”€ enhance_skill.py    # AI enhancement
โ”œโ”€โ”€ mcp/                    # MCP server for Claude Code
โ”‚   โ””โ”€โ”€ server.py           # 9 MCP tools
โ”œโ”€โ”€ configs/                # Preset configurations
โ”‚   โ”œโ”€โ”€ godot.json         # Godot Engine
โ”‚   โ”œโ”€โ”€ react.json         # React
โ”‚   โ”œโ”€โ”€ vue.json           # Vue.js
โ”‚   โ”œโ”€โ”€ django.json        # Django
โ”‚   โ””โ”€โ”€ fastapi.json       # FastAPI
โ””โ”€โ”€ output/                 # All output (auto-created)
    โ”œโ”€โ”€ godot_data/        # Scraped data
    โ”œโ”€โ”€ godot/             # Built skill
    โ””โ”€โ”€ godot.zip          # Packaged skill
```

## โœจ Features

### 1. Fast Page Estimation (NEW!)

```bash
skill-seekers estimate configs/react.json

# Output:
๐Ÿ“Š ESTIMATION RESULTS
โœ… Pages Discovered: 180
๐Ÿ“ˆ Estimated Total: 230
โฑ๏ธ  Time Elapsed: 1.2 minutes
๐Ÿ’ก Recommended max_pages: 280
```

**Benefits:**
- Know page count BEFORE scraping (saves time)
- Validates URL patterns work correctly
- Estimates total scraping time
- Recommends optimal `max_pages` setting
- Fast (1-2 minutes vs 20-40 minutes full scrape)

### 2. Auto-Detect Existing Data

```bash
skill-seekers scrape --config configs/godot.json

# If data exists:
โœ“ Found existing data: 245 pages
Use existing data? (y/n): y
โญ๏ธ  Skipping scrape, using existing data
```

### 3. Knowledge Generation

**Automatic pattern extraction:**
- Extracts common code patterns from docs
- Detects programming language
- Creates quick reference with real examples
- Smarter categorization with scoring

**Enhanced SKILL.md:**
- Real code examples from documentation
- Language-annotated code blocks
- Common patterns section
- Quick reference from actual usage examples

### 4. Smart Categorization

Automatically infers categories from:
- URL structure
- Page titles
- Content keywords
- With scoring for better accuracy

### 5. Code Language Detection

```python
# Automatically detects:
- Python (def, import, from)
- JavaScript (const, let, =>)
- GDScript (func, var, extends)
- C++ (#include, int main)
- And more...
```

### 5. Skip Scraping

```bash
# Scrape once
skill-seekers scrape --config configs/react.json

# Later, just rebuild (instant)
skill-seekers scrape --config configs/react.json --skip-scrape
```

### 6. Async Mode for Faster Scraping (2-3x Speed!)

```bash
# Enable async mode with 8 workers (recommended for large docs)
skill-seekers scrape --config configs/react.json --async --workers 8

# Small docs (~100-500 pages)
skill-seekers scrape --config configs/mydocs.json --async --workers 4

# Large docs (2000+ pages) with no rate limiting
skill-seekers scrape --config configs/largedocs.json --async --workers 8 --no-rate-limit
```

**Performance Comparison:**
- **Sync mode (threads):** ~18 pages/sec, 120 MB memory
- **Async mode:** ~55 pages/sec, 40 MB memory
- **Result:** 3x faster, 66% less memory!

**When to use:**
- โœ… Large documentation (500+ pages)
- โœ… Network latency is high
- โœ… Memory is constrained
- โŒ Small docs (< 100 pages) - overhead not worth it

**See full guide:** [ASYNC_SUPPORT.md](ASYNC_SUPPORT.md)

### 7. AI-Powered SKILL.md Enhancement

```bash
# Option 1: During scraping (API-based, requires API key)
pip3 install anthropic
export ANTHROPIC_API_KEY=sk-ant-...
skill-seekers scrape --config configs/react.json --enhance

# Option 2: During scraping (LOCAL, no API key - uses Claude Code Max)
skill-seekers scrape --config configs/react.json --enhance-local

# Option 3: After scraping (API-based, standalone)
skill-seekers enhance output/react/

# Option 4: After scraping (LOCAL, no API key, standalone)
skill-seekers enhance output/react/
```

**What it does:**
- Reads your reference documentation
- Uses Claude to generate an excellent SKILL.md
- Extracts best code examples (5-10 practical examples)
- Creates comprehensive quick reference
- Adds domain-specific key concepts
- Provides navigation guidance for different skill levels
- Automatically backs up original
- **Quality:** Transforms 75-line templates into 500+ line comprehensive guides

**LOCAL Enhancement (Recommended):**
- Uses your Claude Code Max plan (no API costs)
- Opens new terminal with Claude Code
- Analyzes reference files automatically
- Takes 30-60 seconds
- Quality: 9/10 (comparable to API version)

### 7. Large Documentation Support (10K-40K+ Pages)

**For massive documentation sites like Godot (40K pages), AWS, or Microsoft Docs:**

```bash
# 1. Estimate first (discover page count)
skill-seekers estimate configs/godot.json

# 2. Auto-split into focused sub-skills
python3 -m skill_seekers.cli.split_config configs/godot.json --strategy router

# Creates:
# - godot-scripting.json (5K pages)
# - godot-2d.json (8K pages)
# - godot-3d.json (10K pages)
# - godot-physics.json (6K pages)
# - godot-shaders.json (11K pages)

# 3. Scrape all in parallel (4-8 hours instead of 20-40!)
for config in configs/godot-*.json; do
  skill-seekers scrape --config $config &
done
wait

# 4. Generate intelligent router/hub skill
python3 -m skill_seekers.cli.generate_router configs/godot-*.json

# 5. Package all skills
python3 -m skill_seekers.cli.package_multi output/godot*/

# 6. Upload all .zip files to Claude
# Users just ask questions naturally!
# Router automatically directs to the right sub-skill!
```

**Split Strategies:**
- **auto** - Intelligently detects best strategy based on page count
- **category** - Split by documentation categories (scripting, 2d, 3d, etc.)
- **router** - Create hub skill + specialized sub-skills (RECOMMENDED)
- **size** - Split every N pages (for docs without clear categories)

**Benefits:**
- โœ… Faster scraping (parallel execution)
- โœ… More focused skills (better Claude performance)
- โœ… Easier maintenance (update one topic at a time)
- โœ… Natural user experience (router handles routing)
- โœ… Avoids context window limits

**Configuration:**
```json
{
  "name": "godot",
  "max_pages": 40000,
  "split_strategy": "router",
  "split_config": {
    "target_pages_per_skill": 5000,
    "create_router": true,
    "split_by_categories": ["scripting", "2d", "3d", "physics"]
  }
}
```

**Full Guide:** [Large Documentation Guide](docs/LARGE_DOCUMENTATION.md)

### 8. Checkpoint/Resume for Long Scrapes

**Never lose progress on long-running scrapes:**

```bash
# Enable in config
{
  "checkpoint": {
    "enabled": true,
    "interval": 1000  // Save every 1000 pages
  }
}

# If scrape is interrupted (Ctrl+C or crash)
skill-seekers scrape --config configs/godot.json --resume

# Resume from last checkpoint
โœ… Resuming from checkpoint (12,450 pages scraped)
โญ๏ธ  Skipping 12,450 already-scraped pages
๐Ÿ”„ Continuing from where we left off...

# Start fresh (clear checkpoint)
skill-seekers scrape --config configs/godot.json --fresh
```

**Benefits:**
- โœ… Auto-saves every 1000 pages (configurable)
- โœ… Saves on interruption (Ctrl+C)
- โœ… Resume with `--resume` flag
- โœ… Never lose hours of scraping progress

## ๐ŸŽฏ Complete Workflows

### First Time (With Scraping + Enhancement)

```bash
# 1. Scrape + Build + AI Enhancement (LOCAL, no API key)
skill-seekers scrape --config configs/godot.json --enhance-local

# 2. Wait for new terminal to close (enhancement completes)
# Check the enhanced SKILL.md:
cat output/godot/SKILL.md

# 3. Package
skill-seekers package output/godot/

# 4. Done! You have godot.zip with excellent SKILL.md
```

**Time:** 20-40 minutes (scraping) + 60 seconds (enhancement) = ~21-41 minutes

### Using Existing Data (Fast!)

```bash
# 1. Use cached data + Local Enhancement
skill-seekers scrape --config configs/godot.json --skip-scrape
skill-seekers enhance output/godot/

# 2. Package
skill-seekers package output/godot/

# 3. Done!
```

**Time:** 1-3 minutes (build) + 60 seconds (enhancement) = ~2-4 minutes total

### Without Enhancement (Basic)

```bash
# 1. Scrape + Build (no enhancement)
skill-seekers scrape --config configs/godot.json

# 2. Package
skill-seekers package output/godot/

# 3. Done! (SKILL.md will be basic template)
```

**Time:** 20-40 minutes
**Note:** SKILL.md will be generic - enhancement strongly recommended!

## ๐Ÿ“‹ Available Presets

| Config | Framework | Description |
|--------|-----------|-------------|
| `godot.json` | Godot Engine | Game development |
| `react.json` | React | UI framework |
| `vue.json` | Vue.js | Progressive framework |
| `django.json` | Django | Python web framework |
| `fastapi.json` | FastAPI | Modern Python API |
| `ansible-core.json` | Ansible Core 2.19 | Automation & configuration |

### Using Presets

```bash
# Godot
skill-seekers scrape --config configs/godot.json

# React
skill-seekers scrape --config configs/react.json

# Vue
skill-seekers scrape --config configs/vue.json

# Django
skill-seekers scrape --config configs/django.json

# FastAPI
skill-seekers scrape --config configs/fastapi.json

# Ansible
skill-seekers scrape --config configs/ansible-core.json
```

## ๐ŸŽจ Creating Your Own Config

### Option 1: Interactive

```bash
skill-seekers scrape --interactive
# Follow prompts, it will create the config for you
```

### Option 2: Copy and Edit

```bash
# Copy a preset
cp configs/react.json configs/myframework.json

# Edit it
nano configs/myframework.json

# Use it
skill-seekers scrape --config configs/myframework.json
```

### Config Structure

```json
{
  "name": "myframework",
  "description": "When to use this skill",
  "base_url": "https://docs.myframework.com/",
  "selectors": {
    "main_content": "article",
    "title": "h1",
    "code_blocks": "pre code"
  },
  "url_patterns": {
    "include": ["/docs", "/guide"],
    "exclude": ["/blog", "/about"]
  },
  "categories": {
    "getting_started": ["intro", "quickstart"],
    "api": ["api", "reference"]
  },
  "rate_limit": 0.5,
  "max_pages": 500
}
```

## ๐Ÿ“Š What Gets Created

```
output/
โ”œโ”€โ”€ godot_data/              # Scraped raw data
โ”‚   โ”œโ”€โ”€ pages/              # JSON files (one per page)
โ”‚   โ””โ”€โ”€ summary.json        # Overview
โ”‚
โ””โ”€โ”€ godot/                   # The skill
    โ”œโ”€โ”€ SKILL.md            # Enhanced with real examples
    โ”œโ”€โ”€ references/         # Categorized docs
    โ”‚   โ”œโ”€โ”€ index.md
    โ”‚   โ”œโ”€โ”€ getting_started.md
    โ”‚   โ”œโ”€โ”€ scripting.md
    โ”‚   โ””โ”€โ”€ ...
    โ”œโ”€โ”€ scripts/            # Empty (add your own)
    โ””โ”€โ”€ assets/             # Empty (add your own)
```

## ๐ŸŽฏ Command Line Options

```bash
# Interactive mode
skill-seekers scrape --interactive

# Use config file
skill-seekers scrape --config configs/godot.json

# Quick mode
skill-seekers scrape --name react --url https://react.dev/

# Skip scraping (use existing data)
skill-seekers scrape --config configs/godot.json --skip-scrape

# With description
skill-seekers scrape \
  --name react \
  --url https://react.dev/ \
  --description "React framework for building UIs"
```

## ๐Ÿ’ก Tips

### 1. Test Small First

Edit `max_pages` in config to test:
```json
{
  "max_pages": 20  // Test with just 20 pages
}
```

### 2. Reuse Scraped Data

```bash
# Scrape once
skill-seekers scrape --config configs/react.json

# Rebuild multiple times (instant)
skill-seekers scrape --config configs/react.json --skip-scrape
skill-seekers scrape --config configs/react.json --skip-scrape
```

### 3. Finding Selectors

```python
# Test in Python
from bs4 import BeautifulSoup
import requests

url = "https://docs.example.com/page"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

# Try different selectors
print(soup.select_one('article'))
print(soup.select_one('main'))
print(soup.select_one('div[role="main"]'))
```

### 4. Check Output Quality

```bash
# After building, check:
cat output/godot/SKILL.md  # Should have real examples
cat output/godot/references/index.md  # Categories
```

## ๐Ÿ› Troubleshooting

### No Content Extracted?
- Check your `main_content` selector
- Try: `article`, `main`, `div[role="main"]`

### Data Exists But Won't Use It?
```bash
# Force re-scrape
rm -rf output/myframework_data/
skill-seekers scrape --config configs/myframework.json
```

### Categories Not Good?
Edit the config `categories` section with better keywords.

### Want to Update Docs?
```bash
# Delete old data
rm -rf output/godot_data/

# Re-scrape
skill-seekers scrape --config configs/godot.json
```

## ๐Ÿ“ˆ Performance

| Task | Time | Notes |
|------|------|-------|
| Scraping (sync) | 15-45 min | First time only, thread-based |
| Scraping (async) | 5-15 min | 2-3x faster with --async flag |
| Building | 1-3 min | Fast! |
| Re-building | <1 min | With --skip-scrape |
| Packaging | 5-10 sec | Final zip |

## โœ… Summary

**One tool does everything:**
1. โœ… Scrapes documentation
2. โœ… Auto-detects existing data
3. โœ… Generates better knowledge
4. โœ… Creates enhanced skills
5. โœ… Works with presets or custom configs
6. โœ… Supports skip-scraping for fast iteration

**Simple structure:**
- `doc_scraper.py` - The tool
- `configs/` - Presets
- `output/` - Everything else

**Better output:**
- Real code examples with language detection
- Common patterns extracted from docs
- Smart categorization
- Enhanced SKILL.md with actual examples

## ๐Ÿ“š Documentation

### Getting Started
- **[BULLETPROOF_QUICKSTART.md](BULLETPROOF_QUICKSTART.md)** - ๐ŸŽฏ **START HERE** if you're new!
- **[QUICKSTART.md](QUICKSTART.md)** - Quick start for experienced users
- **[TROUBLESHOOTING.md](TROUBLESHOOTING.md)** - Common issues and solutions

### Guides
- **[docs/LARGE_DOCUMENTATION.md](docs/LARGE_DOCUMENTATION.md)** - Handle 10K-40K+ page docs
- **[ASYNC_SUPPORT.md](ASYNC_SUPPORT.md)** - Async mode guide (2-3x faster scraping)
- **[docs/ENHANCEMENT.md](docs/ENHANCEMENT.md)** - AI enhancement guide
- **[docs/TERMINAL_SELECTION.md](docs/TERMINAL_SELECTION.md)** - Configure terminal app for local enhancement
- **[docs/UPLOAD_GUIDE.md](docs/UPLOAD_GUIDE.md)** - How to upload skills to Claude
- **[docs/MCP_SETUP.md](docs/MCP_SETUP.md)** - MCP integration setup

### Technical
- **[docs/CLAUDE.md](docs/CLAUDE.md)** - Technical architecture
- **[STRUCTURE.md](STRUCTURE.md)** - Repository structure

## ๐ŸŽฎ Ready?

```bash
# Try Godot
skill-seekers scrape --config configs/godot.json

# Try React
skill-seekers scrape --config configs/react.json

# Or go interactive
skill-seekers scrape --interactive
```

## ๐Ÿ“ License

MIT License - see [LICENSE](LICENSE) file for details

---

Happy skill building! ๐Ÿš€