Back to skills
Firecrawl Scraper
Complete knowledge domain for Firecrawl v2 API - web scraping and crawling that converts websites into LLM-ready markdown or structured data. Use when: scraping websites, crawling entire sites, extracting web content, converting HTML to markdown, building web scrapers, handling dynamic JavaScript content, bypassing anti-bot protection, extracting structured data from web pages, or when encountering "content not loading", "JavaScript rendering issues", or "blocked by bot detection". Keywords...
2 stars
0 votes
0 copies
0 views
Added 12/19/2025
developmentjavascripttypescriptpythongojavabashnodegitapidocumentation
Works with
cliapi
Files
SKILL.md
---
name: firecrawl-scraper
description: |
Complete knowledge domain for Firecrawl v2 API - web scraping and crawling that converts websites into LLM-ready markdown or structured data.
Use when: scraping websites, crawling entire sites, extracting web content, converting HTML to markdown, building web scrapers, handling dynamic JavaScript content, bypassing anti-bot protection, extracting structured data from web pages, or when encountering "content not loading", "JavaScript rendering issues", or "blocked by bot detection".
Keywords: firecrawl, firecrawl api, web scraping, web crawler, scrape website, crawl website, extract content, html to markdown, site crawler, content extraction, web automation, firecrawl-py, firecrawl-js, llm ready data, structured data extraction, bot bypass, javascript rendering, scraping api, crawling api, map urls, batch scraping
license: MIT
---
# Firecrawl Web Scraper Skill
**Status**: Production Ready ✅
**Last Updated**: 2025-10-24
**Official Docs**: https://docs.firecrawl.dev
**API Version**: v2
---
## What is Firecrawl?
Firecrawl is a **Web Data API for AI** that turns entire websites into LLM-ready markdown or structured data. It handles:
- **JavaScript rendering** - Executes client-side JavaScript to capture dynamic content
- **Anti-bot bypass** - Gets past CAPTCHA and bot detection systems
- **Format conversion** - Outputs as markdown, JSON, or structured data
- **Screenshot capture** - Saves visual representations of pages
- **Browser automation** - Full headless browser capabilities
---
## API Endpoints
### 1. `/v2/scrape` - Single Page Scraping
Scrapes a single webpage and returns clean, structured content.
**Use Cases**:
- Extract article content
- Get product details
- Scrape specific pages
- Convert HTML to markdown
**Key Options**:
- `formats`: ["markdown", "html", "screenshot"]
- `onlyMainContent`: true/false (removes nav, footer, ads)
- `waitFor`: milliseconds to wait before scraping
- `actions`: browser automation actions (click, scroll, etc.)
### 2. `/v2/crawl` - Full Site Crawling
Crawls all accessible pages from a starting URL.
**Use Cases**:
- Index entire documentation sites
- Archive website content
- Build knowledge bases
- Scrape multi-page content
**Key Options**:
- `limit`: max pages to crawl
- `maxDepth`: how many links deep to follow
- `allowedDomains`: restrict to specific domains
- `excludePaths`: skip certain URL patterns
### 3. `/v2/map` - URL Discovery
Maps all URLs on a website without scraping content.
**Use Cases**:
- Find sitemap
- Discover all pages
- Plan crawling strategy
- Audit website structure
### 4. `/v2/extract` - Structured Data Extraction
Uses AI to extract specific data fields from pages.
**Use Cases**:
- Extract product prices and names
- Parse contact information
- Build structured datasets
- Custom data schemas
**Key Options**:
- `schema`: Zod or JSON schema defining desired structure
- `systemPrompt`: guide AI extraction behavior
---
## Authentication
Firecrawl requires an API key for all requests.
### Get API Key
1. Sign up at https://www.firecrawl.dev
2. Go to dashboard → API Keys
3. Copy your API key (starts with `fc-`)
### Store Securely
**NEVER hardcode API keys in code!**
```bash
# .env file
FIRECRAWL_API_KEY=fc-your-api-key-here
```
```bash
# .env.local (for local development)
FIRECRAWL_API_KEY=fc-your-api-key-here
```
---
## Python SDK Usage
### Installation
```bash
pip install firecrawl-py
```
**Latest Version**: `firecrawl-py v4.5.0+`
### Basic Scrape
```python
import os
from firecrawl import FirecrawlApp
# Initialize client
app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY"))
# Scrape a single page
result = app.scrape_url(
url="https://example.com/article",
params={
"formats": ["markdown", "html"],
"onlyMainContent": True
}
)
# Access markdown content
markdown = result.get("markdown")
print(markdown)
```
### Crawl Multiple Pages
```python
import os
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY"))
# Start crawl
crawl_result = app.crawl_url(
url="https://docs.example.com",
params={
"limit": 100,
"scrapeOptions": {
"formats": ["markdown"]
}
},
poll_interval=5 # Check status every 5 seconds
)
# Process results
for page in crawl_result.get("data", []):
url = page.get("url")
markdown = page.get("markdown")
print(f"Scraped: {url}")
```
### Extract Structured Data
```python
import os
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY"))
# Define schema
schema = {
"type": "object",
"properties": {
"company_name": {"type": "string"},
"product_price": {"type": "number"},
"availability": {"type": "string"}
},
"required": ["company_name", "product_price"]
}
# Extract data
result = app.extract(
urls=["https://example.com/product"],
params={
"schema": schema,
"systemPrompt": "Extract product information from the page"
}
)
print(result)
```
---
## TypeScript/Node.js SDK Usage
### Installation
```bash
npm install @mendable/firecrawl-js
# or
pnpm add @mendable/firecrawl-js
# or use the unscoped package:
npm install firecrawl
```
**Latest Version**: `@mendable/firecrawl-js v4.4.1+` (or `firecrawl v4.4.1+`)
### Basic Scrape
```typescript
import FirecrawlApp from '@mendable/firecrawl-js';
// Initialize client
const app = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY
});
// Scrape a single page
const result = await app.scrapeUrl('https://example.com/article', {
formats: ['markdown', 'html'],
onlyMainContent: true
});
// Access markdown content
const markdown = result.markdown;
console.log(markdown);
```
### Crawl Multiple Pages
```typescript
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY
});
// Start crawl
const crawlResult = await app.crawlUrl('https://docs.example.com', {
limit: 100,
scrapeOptions: {
formats: ['markdown']
}
});
// Process results
for (const page of crawlResult.data) {
console.log(`Scraped: ${page.url}`);
console.log(page.markdown);
}
```
### Extract Structured Data with Zod
```typescript
import FirecrawlApp from '@mendable/firecrawl-js';
import { z } from 'zod';
const app = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY
});
// Define schema with Zod
const schema = z.object({
company_name: z.string(),
product_price: z.number(),
availability: z.string()
});
// Extract data
const result = await app.extract({
urls: ['https://example.com/product'],
schema: schema,
systemPrompt: 'Extract product information from the page'
});
console.log(result);
```
---
## Common Use Cases
### 1. Documentation Scraping
**Scenario**: Convert entire documentation site to markdown for RAG/chatbot
```python
app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY"))
docs = app.crawl_url(
url="https://docs.myapi.com",
params={
"limit": 500,
"scrapeOptions": {
"formats": ["markdown"],
"onlyMainContent": True
},
"allowedDomains": ["docs.myapi.com"]
}
)
# Save to files
for page in docs.get("data", []):
filename = page["url"].replace("https://", "").replace("/", "_") + ".md"
with open(f"docs/{filename}", "w") as f:
f.write(page["markdown"])
```
### 2. Product Data Extraction
**Scenario**: Extract structured product data for e-commerce
```typescript
const schema = z.object({
title: z.string(),
price: z.number(),
description: z.string(),
images: z.array(z.string()),
in_stock: z.boolean()
});
const products = await app.extract({
urls: productUrls,
schema: schema,
systemPrompt: 'Extract all product details including price and availability'
});
```
### 3. News Article Scraping
**Scenario**: Extract clean article content without ads/navigation
```python
article = app.scrape_url(
url="https://news.com/article",
params={
"formats": ["markdown"],
"onlyMainContent": True,
"removeBase64Images": True
}
)
# Get clean markdown
content = article.get("markdown")
```
---
## Error Handling
### Python
```python
from firecrawl import FirecrawlApp
from firecrawl.exceptions import FirecrawlException
app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY"))
try:
result = app.scrape_url("https://example.com")
except FirecrawlException as e:
print(f"Firecrawl error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
```
### TypeScript
```typescript
import FirecrawlApp from '@mendable/firecrawl-js';
const app = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY
});
try {
const result = await app.scrapeUrl('https://example.com');
} catch (error) {
if (error.response) {
// API error
console.error('API Error:', error.response.data);
} else {
// Network or other error
console.error('Error:', error.message);
}
}
```
---
## Rate Limits & Best Practices
### Rate Limits
- **Free tier**: 500 credits/month
- **Paid tiers**: Higher limits based on plan
- Credits consumed vary by endpoint and options
### Best Practices
1. **Use `onlyMainContent: true`** to reduce credits and get cleaner data
2. **Set reasonable limits** on crawls to avoid excessive costs
3. **Handle retries** with exponential backoff for transient errors
4. **Cache results** locally to avoid re-scraping same content
5. **Use `map` endpoint first** to plan crawling strategy
6. **Batch extract calls** when processing multiple URLs
7. **Monitor credit usage** in dashboard
---
## Cloudflare Workers Integration
### ⚠️ Important: SDK Compatibility
**The Firecrawl SDK cannot run in Cloudflare Workers** due to Node.js dependencies (specifically `axios` which uses Node.js `http` module). Workers require Web Standard APIs.
**✅ Use the direct REST API with `fetch` instead** (see example below).
**Alternative**: Self-host with [workers-firecrawl](https://github.com/G4brym/workers-firecrawl) - a Workers-native implementation (requires Workers Paid Plan, only implements `/search` endpoint).
---
### Workers Example: Direct REST API
This example uses the `fetch` API to call Firecrawl directly - works perfectly in Cloudflare Workers:
```typescript
interface Env {
FIRECRAWL_API_KEY: string;
SCRAPED_CACHE?: KVNamespace; // Optional: for caching results
}
interface FirecrawlScrapeResponse {
success: boolean;
data: {
markdown?: string;
html?: string;
metadata: {
title?: string;
description?: string;
language?: string;
sourceURL: string;
};
};
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
if (request.method !== 'POST') {
return Response.json({ error: 'Method not allowed' }, { status: 405 });
}
try {
const { url } = await request.json<{ url: string }>();
if (!url) {
return Response.json({ error: 'URL is required' }, { status: 400 });
}
// Check cache (optional)
if (env.SCRAPED_CACHE) {
const cached = await env.SCRAPED_CACHE.get(url, 'json');
if (cached) {
return Response.json({ cached: true, data: cached });
}
}
// Call Firecrawl API directly using fetch
const response = await fetch('https://api.firecrawl.dev/v2/scrape', {
method: 'POST',
headers: {
'Authorization': `Bearer ${env.FIRECRAWL_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: url,
formats: ['markdown'],
onlyMainContent: true,
removeBase64Images: true
})
});
if (!response.ok) {
const errorText = await response.text();
throw new Error(`Firecrawl API error (${response.status}): ${errorText}`);
}
const result = await response.json<FirecrawlScrapeResponse>();
// Cache for 1 hour (optional)
if (env.SCRAPED_CACHE && result.success) {
await env.SCRAPED_CACHE.put(
url,
JSON.stringify(result.data),
{ expirationTtl: 3600 }
);
}
return Response.json({
cached: false,
data: result.data
});
} catch (error) {
console.error('Scraping error:', error);
return Response.json(
{ error: error instanceof Error ? error.message : 'Unknown error' },
{ status: 500 }
);
}
}
};
```
**Environment Setup**: Add `FIRECRAWL_API_KEY` in Wrangler secrets:
```bash
npx wrangler secret put FIRECRAWL_API_KEY
```
**Optional KV Binding** (for caching - add to `wrangler.jsonc`):
```jsonc
{
"kv_namespaces": [
{
"binding": "SCRAPED_CACHE",
"id": "your-kv-namespace-id"
}
]
}
```
See `templates/firecrawl-worker-fetch.ts` for a complete production-ready example.
---
## When to Use This Skill
✅ **Use Firecrawl when:**
- Scraping modern websites with JavaScript
- Need clean markdown output for LLMs
- Building RAG systems from web content
- Extracting structured data at scale
- Dealing with bot protection
- Need reliable, production-ready scraping
❌ **Don't use Firecrawl when:**
- Scraping simple static HTML (use cheerio/beautifulsoup)
- Have existing Puppeteer/Playwright setup working well
- Working with APIs (use direct API calls instead)
- Budget constraints (free tier has limits)
---
## Common Issues & Solutions
### Issue: "Invalid API Key"
**Cause**: API key not set or incorrect
**Fix**:
```bash
# Check env variable is set
echo $FIRECRAWL_API_KEY
# Verify key format (should start with fc-)
```
### Issue: "Rate limit exceeded"
**Cause**: Exceeded monthly credits
**Fix**:
- Check usage in dashboard
- Upgrade plan or wait for reset
- Use `onlyMainContent: true` to reduce credits
### Issue: "Timeout error"
**Cause**: Page takes too long to load
**Fix**:
```python
result = app.scrape_url(url, params={"waitFor": 10000}) # Wait 10s
```
### Issue: "Content is empty"
**Cause**: Content loaded via JavaScript after initial render
**Fix**:
```python
result = app.scrape_url(url, params={
"waitFor": 5000,
"actions": [{"type": "wait", "milliseconds": 3000}]
})
```
---
## Advanced Features
### Browser Actions
Perform interactions before scraping:
```python
result = app.scrape_url(
url="https://example.com",
params={
"actions": [
{"type": "click", "selector": "button.load-more"},
{"type": "wait", "milliseconds": 2000},
{"type": "scroll", "direction": "down"}
]
}
)
```
### Custom Headers
```python
result = app.scrape_url(
url="https://example.com",
params={
"headers": {
"User-Agent": "Custom Bot 1.0",
"Accept-Language": "en-US"
}
}
)
```
### Webhooks for Long Crawls
Instead of polling, receive results via webhook:
```python
crawl = app.crawl_url(
url="https://docs.example.com",
params={
"limit": 1000,
"webhook": "https://your-domain.com/webhook"
}
)
```
---
## Package Versions
| Package | Version | Last Checked |
|---------|---------|--------------|
| firecrawl-py | 4.5.0+ | 2025-10-20 |
| @mendable/firecrawl-js (or firecrawl) | 4.4.1+ | 2025-10-24 |
| API Version | v2 | Current |
**Note**: The Node.js SDK requires Node.js >=22.0.0 and cannot run in Cloudflare Workers. Use direct REST API calls in Workers (see Cloudflare Workers Integration section).
---
## Official Documentation
- **Docs**: https://docs.firecrawl.dev
- **Python SDK**: https://docs.firecrawl.dev/sdks/python
- **Node.js SDK**: https://docs.firecrawl.dev/sdks/node
- **API Reference**: https://docs.firecrawl.dev/api-reference
- **GitHub**: https://github.com/mendableai/firecrawl
- **Dashboard**: https://www.firecrawl.dev/app
---
## Next Steps After Using This Skill
1. **Store scraped data**: Use Cloudflare D1, R2, or KV to persist results
2. **Build RAG system**: Combine with Vectorize for semantic search
3. **Add scheduling**: Use Cloudflare Queues for recurring scrapes
4. **Process content**: Use Workers AI to analyze scraped data
---
**Token Savings**: ~60% vs manual integration
**Error Prevention**: API authentication, rate limiting, format handling
**Production Ready**: ✅
Attribution
Comments (0)
No comments yet. Be the first to comment!
