Process multiple files or URLs efficiently with concurrent requests, proper error handling, and progress tracking. This example shows you how to build robust batch processing workflows.
Overview
Batch processing allows you to extract text from multiple sources efficiently. Use concurrent requests to process multiple items in parallel while handling errors gracefully.
JavaScript Example
Basic Batch Processing
Codeasync function batchExtractUrls(urls, apiKey) { const results = await Promise.allSettled( urls.map(url => fetch('https://api.crawler.dev/v1/extract/url', { method: 'POST', headers: { 'x-api-key': apiKey, 'Content-Type': 'application/json' }, body: JSON.stringify({ url: url, cleanText: true }) }).then(res => res.json()) ) ); return results.map((result, index) => ({ url: urls[index], success: result.status === 'fulfilled', data: result.status === 'fulfilled' ? result.value : null, error: result.status === 'rejected' ? result.reason.message : null })); } // Usage const urls = [ 'https://example.com/article1', 'https://example.com/article2', 'https://example.com/article3' ]; batchExtractUrls(urls, 'YOUR_API_KEY') .then(results => { results.forEach(result => { if (result.success) { console.log(`✓ ${result.url}: ${result.data.text.length} characters`); } else { console.error(`✗ ${result.url}: ${result.error}`); } }); });
With Rate Limiting
Codeasync function batchExtractWithRateLimit(urls, apiKey, concurrency = 5) { const results = []; for (let i = 0; i < urls.length; i += concurrency) { const batch = urls.slice(i, i + concurrency); const batchResults = await Promise.allSettled( batch.map(url => fetch('https://api.crawler.dev/v1/extract/url', { method: 'POST', headers: { 'x-api-key': apiKey, 'Content-Type': 'application/json' }, body: JSON.stringify({ url, cleanText: true }) }).then(res => { if (!res.ok) throw new Error(`HTTP ${res.status}`); return res.json(); }) ) ); results.push(...batchResults.map((result, index) => ({ url: batch[index], success: result.status === 'fulfilled', data: result.status === 'fulfilled' ? result.value : null, error: result.status === 'rejected' ? result.reason.message : null }))); // Small delay between batches to respect rate limits if (i + concurrency < urls.length) { await new Promise(resolve => setTimeout(resolve, 1000)); } } return results; }
With Progress Tracking
Codeasync function batchExtractWithProgress(urls, apiKey, onProgress) { const results = []; const total = urls.length; for (let i = 0; i < urls.length; i++) { try { const response = await fetch('https://api.crawler.dev/v1/extract/url', { method: 'POST', headers: { 'x-api-key': apiKey, 'Content-Type': 'application/json' }, body: JSON.stringify({ url: urls[i], cleanText: true }) }); const data = await response.json(); results.push({ url: urls[i], success: true, data: data }); } catch (error) { results.push({ url: urls[i], success: false, error: error.message }); } // Report progress if (onProgress) { onProgress({ completed: i + 1, total: total, percentage: Math.round(((i + 1) / total) * 100) }); } } return results; } // Usage with progress callback batchExtractWithProgress(urls, 'YOUR_API_KEY', (progress) => { console.log(`Progress: ${progress.completed}/${progress.total} (${progress.percentage}%)`); }) .then(results => { const successful = results.filter(r => r.success).length; console.log(`Completed: ${successful}/${results.length} successful`); });
Python Example
Basic Batch Processing
Codeimport asyncio import aiohttp import os from typing import List, Dict async def batch_extract_urls(urls: List[str], api_key: str) -> List[Dict]: """ Extract text from multiple URLs concurrently. Args: urls: List of URLs to extract text from api_key: Your crawler.dev API key Returns: List of results with success status and data/error """ async def extract_one(session, url): try: async with session.post( 'https://api.crawler.dev/v1/extract/url', headers={ 'x-api-key': api_key, 'Content-Type': 'application/json' }, json={'url': url, 'cleanText': True} ) as response: if response.status == 200: data = await response.json() return {'url': url, 'success': True, 'data': data} else: error_data = await response.json() return { 'url': url, 'success': False, 'error': error_data.get('error', {}).get('message', 'Unknown error') } except Exception as e: return {'url': url, 'success': False, 'error': str(e)} async with aiohttp.ClientSession() as session: tasks = [extract_one(session, url) for url in urls] results = await asyncio.gather(*tasks) return results # Usage async def main(): api_key = os.getenv('CRAWLER_API_KEY') urls = [ 'https://example.com/article1', 'https://example.com/article2', 'https://example.com/article3' ] results = await batch_extract_urls(urls, api_key) for result in results: if result['success']: text_length = len(result['data']['text']) print(f"✓ {result['url']}: {text_length} characters") else: print(f"✗ {result['url']}: {result['error']}") asyncio.run(main())
With Rate Limiting and Retries
Codeimport asyncio import aiohttp from typing import List, Dict, Optional async def batch_extract_with_retry( urls: List[str], api_key: str, max_concurrent: int = 5, max_retries: int = 3 ) -> List[Dict]: """ Batch extract with rate limiting and automatic retries. Args: urls: List of URLs to process api_key: API key max_concurrent: Maximum concurrent requests max_retries: Maximum retry attempts per URL Returns: List of results """ semaphore = asyncio.Semaphore(max_concurrent) async def extract_with_retry(session, url): async with semaphore: for attempt in range(max_retries): try: async with session.post( 'https://api.crawler.dev/v1/extract/url', headers={ 'x-api-key': api_key, 'Content-Type': 'application/json' }, json={'url': url, 'cleanText': True}, timeout=aiohttp.ClientTimeout(total=30) ) as response: if response.status == 200: data = await response.json() return {'url': url, 'success': True, 'data': data} elif response.status == 429: # Rate limited - wait before retry if attempt < max_retries - 1: await asyncio.sleep(2 ** attempt) continue else: error_data = await response.json() return { 'url': url, 'success': False, 'error': error_data.get('error', {}).get('message', 'Unknown error') } except asyncio.TimeoutError: if attempt < max_retries - 1: await asyncio.sleep(2 ** attempt) continue return {'url': url, 'success': False, 'error': 'Timeout'} except Exception as e: if attempt < max_retries - 1: await asyncio.sleep(2 ** attempt) continue return {'url': url, 'success': False, 'error': str(e)} return {'url': url, 'success': False, 'error': 'Max retries exceeded'} async with aiohttp.ClientSession() as session: tasks = [extract_with_retry(session, url) for url in urls] results = await asyncio.gather(*tasks) return results
Processing Files in Batch
Codeasync function batchExtractFiles(files, apiKey) { const results = await Promise.allSettled( files.map(file => { const formData = new FormData(); formData.append('file', file); return fetch('https://api.crawler.dev/v1/extract/file', { method: 'POST', headers: { 'x-api-key': apiKey }, body: formData }).then(res => res.json()); }) ); return results.map((result, index) => ({ filename: files[index].name, success: result.status === 'fulfilled', data: result.status === 'fulfilled' ? result.value : null, error: result.status === 'rejected' ? result.reason.message : null })); }
Best Practices
- Respect Rate Limits: Use concurrency limits and delays between batches
- Handle Errors Gracefully: Use
Promise.allSettledor try-catch blocks - Track Progress: Provide feedback for long-running batches
- Retry Logic: Implement exponential backoff for transient errors
- Save Results: Persist results to avoid re-processing
Performance Tips
- Concurrent Requests: 5-10 concurrent requests is usually optimal
- Batch Size: Process in batches of 50-100 items
- Error Handling: Don't let one failure stop the entire batch
- Memory Management: Stream large batches instead of loading all at once
Next Steps
- Learn about extracting from files
- Explore extracting from web pages
- Check out the API Reference for complete documentation
