Firecrawl Scraper

Name: Firecrawl Scraper
Author: jackspace
Complete knowledge domain for Firecrawl v2 API - web scraping and crawling that converts websites into LLM-ready markdown or structured data. Use when: scraping websites, crawling entire sites, extracting web content, converting HTML to markdown, building web scrapers, handling dynamic JavaScript content, bypassing anti-bot protection, extracting structured data from web pages, or when encountering "content not loading", "JavaScript rendering issues", or "blocked by bot detection". Keywords...
2 stars
0 votes
0 copies
0 views
Added 12/19/2025
developmentjavascripttypescriptpythongojavabashnodegitapidocumentation
Works with

cliapi
Download Zip
Files
SKILL.md
---
name: firecrawl-scraper
description: |
  Complete knowledge domain for Firecrawl v2 API - web scraping and crawling that converts websites into LLM-ready markdown or structured data.

  Use when: scraping websites, crawling entire sites, extracting web content, converting HTML to markdown, building web scrapers, handling dynamic JavaScript content, bypassing anti-bot protection, extracting structured data from web pages, or when encountering "content not loading", "JavaScript rendering issues", or "blocked by bot detection".

  Keywords: firecrawl, firecrawl api, web scraping, web crawler, scrape website, crawl website, extract content, html to markdown, site crawler, content extraction, web automation, firecrawl-py, firecrawl-js, llm ready data, structured data extraction, bot bypass, javascript rendering, scraping api, crawling api, map urls, batch scraping
license: MIT
---

# Firecrawl Web Scraper Skill

**Status**: Production Ready ✅
**Last Updated**: 2025-10-24
**Official Docs**: https://docs.firecrawl.dev
**API Version**: v2

---

## What is Firecrawl?

Firecrawl is a **Web Data API for AI** that turns entire websites into LLM-ready markdown or structured data. It handles:

- **JavaScript rendering** - Executes client-side JavaScript to capture dynamic content
- **Anti-bot bypass** - Gets past CAPTCHA and bot detection systems
- **Format conversion** - Outputs as markdown, JSON, or structured data
- **Screenshot capture** - Saves visual representations of pages
- **Browser automation** - Full headless browser capabilities

---

## API Endpoints

### 1. `/v2/scrape` - Single Page Scraping
Scrapes a single webpage and returns clean, structured content.

**Use Cases**:
- Extract article content
- Get product details
- Scrape specific pages
- Convert HTML to markdown

**Key Options**:
- `formats`: ["markdown", "html", "screenshot"]
- `onlyMainContent`: true/false (removes nav, footer, ads)
- `waitFor`: milliseconds to wait before scraping
- `actions`: browser automation actions (click, scroll, etc.)

### 2. `/v2/crawl` - Full Site Crawling
Crawls all accessible pages from a starting URL.

**Use Cases**:
- Index entire documentation sites
- Archive website content
- Build knowledge bases
- Scrape multi-page content

**Key Options**:
- `limit`: max pages to crawl
- `maxDepth`: how many links deep to follow
- `allowedDomains`: restrict to specific domains
- `excludePaths`: skip certain URL patterns

### 3. `/v2/map` - URL Discovery
Maps all URLs on a website without scraping content.

**Use Cases**:
- Find sitemap
- Discover all pages
- Plan crawling strategy
- Audit website structure

### 4. `/v2/extract` - Structured Data Extraction
Uses AI to extract specific data fields from pages.

**Use Cases**:
- Extract product prices and names
- Parse contact information
- Build structured datasets
- Custom data schemas

**Key Options**:
- `schema`: Zod or JSON schema defining desired structure
- `systemPrompt`: guide AI extraction behavior

---

## Authentication

Firecrawl requires an API key for all requests.

### Get API Key
1. Sign up at https://www.firecrawl.dev
2. Go to dashboard → API Keys
3. Copy your API key (starts with `fc-`)

### Store Securely
**NEVER hardcode API keys in code!**

```bash
# .env file
FIRECRAWL_API_KEY=fc-your-api-key-here
```

```bash
# .env.local (for local development)
FIRECRAWL_API_KEY=fc-your-api-key-here
```

---

## Python SDK Usage

### Installation

```bash
pip install firecrawl-py
```

**Latest Version**: `firecrawl-py v4.5.0+`

### Basic Scrape

```python
import os
from firecrawl import FirecrawlApp

# Initialize client
app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY"))

# Scrape a single page
result = app.scrape_url(
    url="https://example.com/article",
    params={
        "formats": ["markdown", "html"],
        "onlyMainContent": True
    }
)

# Access markdown content
markdown = result.get("markdown")
print(markdown)
```

### Crawl Multiple Pages

```python
import os
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY"))

# Start crawl
crawl_result = app.crawl_url(
    url="https://docs.example.com",
    params={
        "limit": 100,
        "scrapeOptions": {
            "formats": ["markdown"]
        }
    },
    poll_interval=5  # Check status every 5 seconds
)

# Process results
for page in crawl_result.get("data", []):
    url = page.get("url")
    markdown = page.get("markdown")
    print(f"Scraped: {url}")
```

### Extract Structured Data

```python
import os
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY"))

# Define schema
schema = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string"},
        "product_price": {"type": "number"},
        "availability": {"type": "string"}
    },
    "required": ["company_name", "product_price"]
}

# Extract data
result = app.extract(
    urls=["https://example.com/product"],
    params={
        "schema": schema,
        "systemPrompt": "Extract product information from the page"
    }
)

print(result)
```

---

## TypeScript/Node.js SDK Usage

### Installation

```bash
npm install @mendable/firecrawl-js
# or
pnpm add @mendable/firecrawl-js
# or use the unscoped package:
npm install firecrawl
```

**Latest Version**: `@mendable/firecrawl-js v4.4.1+` (or `firecrawl v4.4.1+`)

### Basic Scrape

```typescript
import FirecrawlApp from '@mendable/firecrawl-js';

// Initialize client
const app = new FirecrawlApp({
  apiKey: process.env.FIRECRAWL_API_KEY
});

// Scrape a single page
const result = await app.scrapeUrl('https://example.com/article', {
  formats: ['markdown', 'html'],
  onlyMainContent: true
});

// Access markdown content
const markdown = result.markdown;
console.log(markdown);
```

### Crawl Multiple Pages

```typescript
import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({
  apiKey: process.env.FIRECRAWL_API_KEY
});

// Start crawl
const crawlResult = await app.crawlUrl('https://docs.example.com', {
  limit: 100,
  scrapeOptions: {
    formats: ['markdown']
  }
});

// Process results
for (const page of crawlResult.data) {
  console.log(`Scraped: ${page.url}`);
  console.log(page.markdown);
}
```

### Extract Structured Data with Zod

```typescript
import FirecrawlApp from '@mendable/firecrawl-js';
import { z } from 'zod';

const app = new FirecrawlApp({
  apiKey: process.env.FIRECRAWL_API_KEY
});

// Define schema with Zod
const schema = z.object({
  company_name: z.string(),
  product_price: z.number(),
  availability: z.string()
});

// Extract data
const result = await app.extract({
  urls: ['https://example.com/product'],
  schema: schema,
  systemPrompt: 'Extract product information from the page'
});

console.log(result);
```

---

## Common Use Cases

### 1. Documentation Scraping

**Scenario**: Convert entire documentation site to markdown for RAG/chatbot

```python
app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY"))

docs = app.crawl_url(
    url="https://docs.myapi.com",
    params={
        "limit": 500,
        "scrapeOptions": {
            "formats": ["markdown"],
            "onlyMainContent": True
        },
        "allowedDomains": ["docs.myapi.com"]
    }
)

# Save to files
for page in docs.get("data", []):
    filename = page["url"].replace("https://", "").replace("/", "_") + ".md"
    with open(f"docs/{filename}", "w") as f:
        f.write(page["markdown"])
```

### 2. Product Data Extraction

**Scenario**: Extract structured product data for e-commerce

```typescript
const schema = z.object({
  title: z.string(),
  price: z.number(),
  description: z.string(),
  images: z.array(z.string()),
  in_stock: z.boolean()
});

const products = await app.extract({
  urls: productUrls,
  schema: schema,
  systemPrompt: 'Extract all product details including price and availability'
});
```

### 3. News Article Scraping

**Scenario**: Extract clean article content without ads/navigation

```python
article = app.scrape_url(
    url="https://news.com/article",
    params={
        "formats": ["markdown"],
        "onlyMainContent": True,
        "removeBase64Images": True
    }
)

# Get clean markdown
content = article.get("markdown")
```

---

## Error Handling

### Python

```python
from firecrawl import FirecrawlApp
from firecrawl.exceptions import FirecrawlException

app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY"))

try:
    result = app.scrape_url("https://example.com")
except FirecrawlException as e:
    print(f"Firecrawl error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")
```

### TypeScript

```typescript
import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({
  apiKey: process.env.FIRECRAWL_API_KEY
});

try {
  const result = await app.scrapeUrl('https://example.com');
} catch (error) {
  if (error.response) {
    // API error
    console.error('API Error:', error.response.data);
  } else {
    // Network or other error
    console.error('Error:', error.message);
  }
}
```

---

## Rate Limits & Best Practices

### Rate Limits
- **Free tier**: 500 credits/month
- **Paid tiers**: Higher limits based on plan
- Credits consumed vary by endpoint and options

### Best Practices

1. **Use `onlyMainContent: true`** to reduce credits and get cleaner data
2. **Set reasonable limits** on crawls to avoid excessive costs
3. **Handle retries** with exponential backoff for transient errors
4. **Cache results** locally to avoid re-scraping same content
5. **Use `map` endpoint first** to plan crawling strategy
6. **Batch extract calls** when processing multiple URLs
7. **Monitor credit usage** in dashboard

---

## Cloudflare Workers Integration

### ⚠️ Important: SDK Compatibility

**The Firecrawl SDK cannot run in Cloudflare Workers** due to Node.js dependencies (specifically `axios` which uses Node.js `http` module). Workers require Web Standard APIs.

**✅ Use the direct REST API with `fetch` instead** (see example below).

**Alternative**: Self-host with [workers-firecrawl](https://github.com/G4brym/workers-firecrawl) - a Workers-native implementation (requires Workers Paid Plan, only implements `/search` endpoint).

---

### Workers Example: Direct REST API

This example uses the `fetch` API to call Firecrawl directly - works perfectly in Cloudflare Workers:

```typescript
interface Env {
  FIRECRAWL_API_KEY: string;
  SCRAPED_CACHE?: KVNamespace; // Optional: for caching results
}

interface FirecrawlScrapeResponse {
  success: boolean;
  data: {
    markdown?: string;
    html?: string;
    metadata: {
      title?: string;
      description?: string;
      language?: string;
      sourceURL: string;
    };
  };
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    if (request.method !== 'POST') {
      return Response.json({ error: 'Method not allowed' }, { status: 405 });
    }

    try {
      const { url } = await request.json<{ url: string }>();

      if (!url) {
        return Response.json({ error: 'URL is required' }, { status: 400 });
      }

      // Check cache (optional)
      if (env.SCRAPED_CACHE) {
        const cached = await env.SCRAPED_CACHE.get(url, 'json');
        if (cached) {
          return Response.json({ cached: true, data: cached });
        }
      }

      // Call Firecrawl API directly using fetch
      const response = await fetch('https://api.firecrawl.dev/v2/scrape', {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${env.FIRECRAWL_API_KEY}`,
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({
          url: url,
          formats: ['markdown'],
          onlyMainContent: true,
          removeBase64Images: true
        })
      });

      if (!response.ok) {
        const errorText = await response.text();
        throw new Error(`Firecrawl API error (${response.status}): ${errorText}`);
      }

      const result = await response.json<FirecrawlScrapeResponse>();

      // Cache for 1 hour (optional)
      if (env.SCRAPED_CACHE && result.success) {
        await env.SCRAPED_CACHE.put(
          url,
          JSON.stringify(result.data),
          { expirationTtl: 3600 }
        );
      }

      return Response.json({
        cached: false,
        data: result.data
      });

    } catch (error) {
      console.error('Scraping error:', error);
      return Response.json(
        { error: error instanceof Error ? error.message : 'Unknown error' },
        { status: 500 }
      );
    }
  }
};
```

**Environment Setup**: Add `FIRECRAWL_API_KEY` in Wrangler secrets:

```bash
npx wrangler secret put FIRECRAWL_API_KEY
```

**Optional KV Binding** (for caching - add to `wrangler.jsonc`):

```jsonc
{
  "kv_namespaces": [
    {
      "binding": "SCRAPED_CACHE",
      "id": "your-kv-namespace-id"
    }
  ]
}
```

See `templates/firecrawl-worker-fetch.ts` for a complete production-ready example.

---

## When to Use This Skill

✅ **Use Firecrawl when:**
- Scraping modern websites with JavaScript
- Need clean markdown output for LLMs
- Building RAG systems from web content
- Extracting structured data at scale
- Dealing with bot protection
- Need reliable, production-ready scraping

❌ **Don't use Firecrawl when:**
- Scraping simple static HTML (use cheerio/beautifulsoup)
- Have existing Puppeteer/Playwright setup working well
- Working with APIs (use direct API calls instead)
- Budget constraints (free tier has limits)

---

## Common Issues & Solutions

### Issue: "Invalid API Key"
**Cause**: API key not set or incorrect
**Fix**:
```bash
# Check env variable is set
echo $FIRECRAWL_API_KEY

# Verify key format (should start with fc-)
```

### Issue: "Rate limit exceeded"
**Cause**: Exceeded monthly credits
**Fix**:
- Check usage in dashboard
- Upgrade plan or wait for reset
- Use `onlyMainContent: true` to reduce credits

### Issue: "Timeout error"
**Cause**: Page takes too long to load
**Fix**:
```python
result = app.scrape_url(url, params={"waitFor": 10000})  # Wait 10s
```

### Issue: "Content is empty"
**Cause**: Content loaded via JavaScript after initial render
**Fix**:
```python
result = app.scrape_url(url, params={
    "waitFor": 5000,
    "actions": [{"type": "wait", "milliseconds": 3000}]
})
```

---

## Advanced Features

### Browser Actions

Perform interactions before scraping:

```python
result = app.scrape_url(
    url="https://example.com",
    params={
        "actions": [
            {"type": "click", "selector": "button.load-more"},
            {"type": "wait", "milliseconds": 2000},
            {"type": "scroll", "direction": "down"}
        ]
    }
)
```

### Custom Headers

```python
result = app.scrape_url(
    url="https://example.com",
    params={
        "headers": {
            "User-Agent": "Custom Bot 1.0",
            "Accept-Language": "en-US"
        }
    }
)
```

### Webhooks for Long Crawls

Instead of polling, receive results via webhook:

```python
crawl = app.crawl_url(
    url="https://docs.example.com",
    params={
        "limit": 1000,
        "webhook": "https://your-domain.com/webhook"
    }
)
```

---

## Package Versions

| Package | Version | Last Checked |
|---------|---------|--------------|
| firecrawl-py | 4.5.0+ | 2025-10-20 |
| @mendable/firecrawl-js (or firecrawl) | 4.4.1+ | 2025-10-24 |
| API Version | v2 | Current |

**Note**: The Node.js SDK requires Node.js >=22.0.0 and cannot run in Cloudflare Workers. Use direct REST API calls in Workers (see Cloudflare Workers Integration section).

---

## Official Documentation

- **Docs**: https://docs.firecrawl.dev
- **Python SDK**: https://docs.firecrawl.dev/sdks/python
- **Node.js SDK**: https://docs.firecrawl.dev/sdks/node
- **API Reference**: https://docs.firecrawl.dev/api-reference
- **GitHub**: https://github.com/mendableai/firecrawl
- **Dashboard**: https://www.firecrawl.dev/app

---

## Next Steps After Using This Skill

1. **Store scraped data**: Use Cloudflare D1, R2, or KV to persist results
2. **Build RAG system**: Combine with Vectorize for semantic search
3. **Add scheduling**: Use Cloudflare Queues for recurring scrapes
4. **Process content**: Use Workers AI to analyze scraped data

---

**Token Savings**: ~60% vs manual integration
**Error Prevention**: API authentication, rate limiting, format handling
**Production Ready**: ✅
Attribution

jackspace
View source
Comments (0)

No comments yet. Be the first to comment!