SDKs

Python SDK

Text Extraction with Python

Learn how to extract text from any kind of file or URL with the crawler.dev Python SDK.

Prerequisites

To get the most out of this guide, you'll need to:

Installation

Install the crawler.dev Python SDK using pip:

Code
 
pip install crawler.dev

Quick Start

Here's how to get started with text extraction using Python:


Code
 
import os
from api.crawler.dev_sdks import APICrawlerDevSDKs as CrawlerDev

client = CrawlerDev(
    api_key=os.environ.get("API_CRAWLER_DEV_SDKS_API_KEY"),  # This is the default and can be omitted
)

# Extract text from a file
response = client.extract.from_file(
    file=b"file content here",
)
print(response.content_type)

# Extract text from a URL
response = client.extract.from_url(
    url="https://example.com"
)
print(response.text)

Features

Full type hints support
Automatic retries with exponential backoff
Comprehensive error handling
File upload support
Async client support
Built-in request validation

Repository

GitHub: https://github.com/crawler-dot-dev/api-sdk-python
PyPI: https://pypi.org/project/crawler.dev/

Examples

Extract Text from a File


Code
 
import os
from pathlib import Path
from api.crawler.dev_sdks import APICrawlerDevSDKs as CrawlerDev

client = CrawlerDev(
    api_key=os.environ.get("API_CRAWLER_DEV_SDKS_API_KEY"),
)

# Extract text from a PDF file
# You can pass a PathLike instance, bytes, or a tuple of (filename, contents, media type)
result = client.extract.from_file(
    file=Path("document.pdf"),
)
print(result.text)
print(f"Content type: {result.content_type}")

Extract Text from Multiple URLs


Code
 
import os
from api.crawler.dev_sdks import APICrawlerDevSDKs as CrawlerDev

client = CrawlerDev(
    api_key=os.environ.get("API_CRAWLER_DEV_SDKS_API_KEY"),
)

urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
]

# Extract text from multiple URLs
results = []
for url in urls:
    result = client.extract.from_url(url=url)
    results.append(result)

for i, result in enumerate(results):
    print(f"Text from {urls[i]}: {result.text}")

Using the Async Client


Code
 
import os
import asyncio
from api.crawler.dev_sdks import AsyncAPICrawlerDevSDKs as AsyncCrawlerDev

async def main():
    client = AsyncCrawlerDev(
        api_key=os.environ.get("API_CRAWLER_DEV_SDKS_API_KEY"),
    )
    
    result = await client.extract.from_url(
        url="https://example.com"
    )
    
    print(result.text)

# Run the async function
asyncio.run(main())

Error Handling

The SDK provides comprehensive error handling:


Code
 
import os
import api.crawler.dev_sdks
from api.crawler.dev_sdks import APICrawlerDevSDKs as CrawlerDev

client = CrawlerDev(
    api_key=os.environ.get("API_CRAWLER_DEV_SDKS_API_KEY"),
)

try:
    result = client.extract.from_url(url="https://example.com")
    print(result.text)
except api.crawler.dev_sdks.APIConnectionError as e:
    print("The server could not be reached")
    print(e.__cause__)  # an underlying Exception, likely raised within httpx
except api.crawler.dev_sdks.RateLimitError as e:
    print("A 429 status code was received; we should back off a bit.")
except api.crawler.dev_sdks.APIStatusError as e:
    print("Another non-200-range status code was received")
    print(e.status_code)
    print(e.response)
    if e.status_code == 401:
        print("Invalid API key")
    elif e.status_code == 429:
        print("Rate limit exceeded")

Error codes are as follows:

Status Code	Error Type
400	`BadRequestError`
401	`AuthenticationError`
403	`PermissionDeniedError`
404	`NotFoundError`
422	`UnprocessableEntityError`
429	`RateLimitError`
>=500	`InternalServerError`
N/A	`APIConnectionError`

JavaScript Go

Quick Start

Here's how to get started with text extraction using Python:

Code

import os
from api.crawler.dev_sdks import APICrawlerDevSDKs as CrawlerDev

client = CrawlerDev(
    api_key=os.environ.get("API_CRAWLER_DEV_SDKS_API_KEY"),  # This is the default and can be omitted
)

# Extract text from a file
response = client.extract.from_file(
    file=b"file content here",
)
print(response.content_type)

# Extract text from a URL
response = client.extract.from_url(
    url="https://example.com"
)
print(response.text)

Examples

Extract Text from a File

Code

import os
from pathlib import Path
from api.crawler.dev_sdks import APICrawlerDevSDKs as CrawlerDev

client = CrawlerDev(
    api_key=os.environ.get("API_CRAWLER_DEV_SDKS_API_KEY"),
)

# Extract text from a PDF file
# You can pass a PathLike instance, bytes, or a tuple of (filename, contents, media type)
result = client.extract.from_file(
    file=Path("document.pdf"),
)
print(result.text)
print(f"Content type: {result.content_type}")

Extract Text from Multiple URLs

Code

import os
from api.crawler.dev_sdks import APICrawlerDevSDKs as CrawlerDev

client = CrawlerDev(
    api_key=os.environ.get("API_CRAWLER_DEV_SDKS_API_KEY"),
)

urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
]

# Extract text from multiple URLs
results = []
for url in urls:
    result = client.extract.from_url(url=url)
    results.append(result)

for i, result in enumerate(results):
    print(f"Text from {urls[i]}: {result.text}")

Using the Async Client

Code

import os
import asyncio
from api.crawler.dev_sdks import AsyncAPICrawlerDevSDKs as AsyncCrawlerDev

async def main():
    client = AsyncCrawlerDev(
        api_key=os.environ.get("API_CRAWLER_DEV_SDKS_API_KEY"),
    )
    
    result = await client.extract.from_url(
        url="https://example.com"
    )
    
    print(result.text)

# Run the async function
asyncio.run(main())

Error Handling

The SDK provides comprehensive error handling:

Code

import os
import api.crawler.dev_sdks
from api.crawler.dev_sdks import APICrawlerDevSDKs as CrawlerDev

client = CrawlerDev(
    api_key=os.environ.get("API_CRAWLER_DEV_SDKS_API_KEY"),
)

try:
    result = client.extract.from_url(url="https://example.com")
    print(result.text)
except api.crawler.dev_sdks.APIConnectionError as e:
    print("The server could not be reached")
    print(e.__cause__)  # an underlying Exception, likely raised within httpx
except api.crawler.dev_sdks.RateLimitError as e:
    print("A 429 status code was received; we should back off a bit.")
except api.crawler.dev_sdks.APIStatusError as e:
    print("Another non-200-range status code was received")
    print(e.status_code)
    print(e.response)
    if e.status_code == 401:
        print("Invalid API key")
    elif e.status_code == 429:
        print("Rate limit exceeded")

Error codes are as follows:

Status Code

Error Type

400

BadRequestError

401

AuthenticationError

403

PermissionDeniedError

404

NotFoundError

422

UnprocessableEntityError

429

RateLimitError

>=500

InternalServerError

N/A

APIConnectionError