Content Extraction Guide

Transform any web page into clean, structured data with Valyu’s Contents API. Whether you’re building content aggregators, research tools, or AI agents that need to process web content, Valyu provides powerful extraction capabilities with batch processing and AI-powered structuring.

Why Content Extraction Matters

Content extraction with Valyu provides clean, structured web content that enables:

🤖 AI Agent Development - Feed clean data to LLMs and AI systems without noise
📊 Data Aggregation - Collect structured content from multiple sources efficiently
📝 Content Management - Transform web content into usable formats for analysis
🔍 Research Automation - Extract key information from articles, papers, and reports

Key Content Extraction Features

Batch Processing

Process Multiple URLs Submit up to 10 URLs per request for efficient bulk content extraction.

AI-Powered Structuring

Structured Data Extraction Use JSON schemas to extract specific data points with AI precision.

Smart Summarization

Custom AI Summaries Generate tailored summaries with custom instructions for any content.

Pay-per-Success

Fair Pricing Model Only pay for URLs that are successfully processed - failed extractions cost nothing.

Getting Started

Basic Content Extraction

Start with simple content extraction from web pages:

curl -X POST https://api.valyu.network/contents \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "urls": [
      "https://techcrunch.com/category/artificial-intelligence/"
    ],
    "response_length": "medium"
  }'

This returns clean markdown content for each URL, perfect for feeding into LLMs or content management systems.

Response Length Options

Control the amount of content extracted:

Length	Characters	Best For
`short`	25,000	Summaries, key points
`medium`	50,000	Articles, blog posts
`large`	100,000	Academic papers, long-form content
`max`	Unlimited	Complete document extraction
Custom integer	1,000-1,000,000	Specific requirements

Advanced Features

Summary Field Examples

The summary field accepts four different types of values. Here are examples for each:

1. No AI Processing (`false`)

curl -X POST https://api.valyu.network/contents \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "urls": ["https://example.com/article"],
    "summary": false
  }'

2. Basic Summarization (`true`)

curl -X POST https://api.valyu.network/contents \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "urls": ["https://example.com/article"],
    "summary": true
  }'

3. Custom Instructions (`string`)

curl -X POST https://api.valyu.network/contents \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "urls": ["https://example.com/research-paper"],
    "summary": "Summarize the methodology, key findings, and practical applications in 2-3 paragraphs"
  }'

4. Structured Extraction (`object`)

curl -X POST https://api.valyu.network/contents \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "urls": ["https://example.com/product-page"],
    "summary": {
      "type": "object",
      "properties": {
        "product_name": {
          "type": "string",
          "description": "Name of the product"
        },
        "price": {
          "type": "number",
          "description": "Product price in USD"
        },
        "features": {
          "type": "array",
          "items": {"type": "string"},
          "maxItems": 5,
          "description": "Key product features"
        },
        "availability": {
          "type": "string",
          "enum": ["in_stock", "out_of_stock", "preorder"],
          "description": "Product availability status"
        }
      },
      "required": ["product_name", "price"]
    }
  }'

JSON Schema Reference

When using the summary field with an object (structured extraction), you can use any valid JSON Schema specification. For detailed information about available types, formats, and validation rules, see the JSON Schema Type Reference. Key limitations:

Maximum 5,000 characters total
Maximum 3 levels deep
Maximum 20 properties per object

Commonly used types:

string - Text data with optional format validation
number / integer - Numeric values with optional min/max
boolean - True/false values
array - Lists of items with optional size limits
object - Nested structures with properties

High-Quality Extraction

For challenging content, use enhanced extraction:

curl -X POST https://api.valyu.network/contents \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "urls": ["https://complex-website.com/article"],
    "extract_effort": "high",
    "response_length": "large"
  }'

extract_effort: "high" provides better content quality for complex websites but takes longer to process.

Common Use Cases

Content Aggregation Platform

Build a news aggregator that extracts structured article data:

{
  "urls": [
    "https://techcrunch.com/category/artificial-intelligence/",
    "https://venturebeat.com/category/entrepreneur/",
    "https://www.bbc.co.uk/news/technology"
  ],
  "summary": {
    "type": "object",
    "properties": {
      "headline": { "type": "string" },
      "summary_text": { "type": "string" },
      "category": { "type": "string" },
      "tags": {
        "type": "array",
        "items": { "type": "string" },
        "maxItems": 5
      }
    },
    "required": ["headline"]
  }
}

Research Paper Analysis

Extract structured data from academic papers:

{
  "urls": ["https://arxiv.org/paper/example"],
  "response_length": "max",
  "summary": {
    "type": "object",
    "properties": {
      "title": { "type": "string" },
      "abstract": { "type": "string" },
      "methodology": { "type": "string" },
      "key_findings": {
        "type": "array",
        "items": { "type": "string" }
      },
      "limitations": { "type": "string" }
    },
    "required": ["title"]
  }
}

Product Documentation Extraction

Extract structured product information:

{
  "urls": ["https://company.com/product-A", "https://company.com/product-B"],
  "summary": {
    "type": "object",
    "properties": {
      "product_name": { "type": "string" },
      "features": {
        "type": "array",
        "items": { "type": "string" }
      },
      "pricing": { "type": "string" },
      "target_audience": { "type": "string" }
    },
    "required": ["product_name"]
  }
}

Response Format

No AI Processing (summary: false)

{
  "success": true,
  "error": null,
  "tx_id": "tx_12345678-1234-1234-1234-123456789abc",
  "results": [
    {
      "title": "AI Breakthrough in Natural Language Processing",
      "url": "https://example.com/article?utm_source=valyu",
      "content": "# AI Breakthrough in Natural Language Processing\n\nPage content in markdown...",
      "description": "Latest AI developments",
      "source": "web",
      "price": 0.001,
      "length": 12840,
      "data_type": "unstructured",
      "source_type": "news_article",
      "publication_date": "2024-01-15",
      "id": "https://example.com/article"
    }
  ],
  "urls_requested": 1,
  "urls_processed": 1,
  "urls_failed": 0,
  "total_cost_dollars": 0.001,
  "total_characters": 12840
}

Text Summarization (summary: true or string)

{
  "success": true,
  "results": [
    {
      "title": "AI Breakthrough in Natural Language Processing",
      "content": "# AI Breakthrough in Natural Language Processing\n\nPage content in markdown...",
      "summary": "This article discusses a breakthrough in AI that enables more natural interactions...",
      "summary_success": true,
      "price": 0.002,
      "data_type": "unstructured"
    }
  ]
}

Structured Extraction (JSON Schema)

{
  "success": true,
  "results": [
    {
      "title": "AI Breakthrough in Natural Language Processing",
      "content": "# AI Breakthrough in Natural Language Processing\n\nPage content in markdown...",
      "summary": {
        "title": "AI Breakthrough in Natural Language Processing",
        "author": "John Doe",
        "category": "technology",
        "key_points": [
          "New AI model achieves 95% accuracy",
          "Reduces computational requirements by 40%"
        ]
      },
      "summary_success": true,
      "price": 0.002,
      "data_type": "structured"
    }
  ]
}

Key Response Fields

Field	Description
`content`	Extracted content in markdown format
`summary`	AI processing result (text or structured data)
`data_type`	`"unstructured"` (no AI) or `"structured"` (with AI)
`summary_success`	Whether AI processing succeeded
`price`	Cost for processing this specific URL

Best Practices

Optimize Your Summary Field

Choose the Right Type:
- false: No AI processing (fastest, cheapest)
- true: Basic summarization for general overviews
- "string": Custom instructions for specific summary needs
- {object}: Structured extraction for data processing
For JSON Schemas:
- Be Specific: Use clear descriptions to guide AI extraction
- Use Enums: Specify allowed values for consistent categorization
- Limit Complexity: Keep schemas under 3 levels deep for best results
- Required Fields: Mark essential fields as required

Efficient Batch Processing

Group Similar Content: Process similar content types together
Optimize Response Length: Use appropriate length for your use case
Handle Failures Gracefully: Check summary_success for AI processing status
Monitor Costs: Track total_cost_dollars for budget management

Error Handling

# Check for partial failures (HTTP 206)
if response.status_code == 206:
    successful_results = [r for r in response.json()["results"]]
    failed_count = response.json()["urls_failed"]

# Check AI processing success
for result in results:
    if "summary" in result and "summary_success" in result:
        if not result["summary_success"]:
            print(f"AI processing failed for {result['url']}")

# Handle complete failures (HTTP 422)
if response.status_code == 422:
    error_message = response.json()["error"]

Try the Contents API

Explore the complete API reference with interactive examples and detailed parameter documentation.

Next Steps

API Reference

Complete parameter documentation and examples

Python SDK

Easy integration with Python applications

TypeScript SDK

Type-safe integration for JavaScript/TypeScript

Integration Guides

Connect with LangChain, LlamaIndex, and more

Getting Started

Best Practices

Use Cases

Core Concepts

Account & Pricing

​Why Content Extraction Matters

​Key Content Extraction Features

Batch Processing

AI-Powered Structuring

Smart Summarization

Pay-per-Success

​Getting Started

​Basic Content Extraction

​Response Length Options

​Advanced Features

​Summary Field Examples

​1. No AI Processing (false)

​2. Basic Summarization (true)

​3. Custom Instructions (string)

​4. Structured Extraction (object)

​JSON Schema Reference

​High-Quality Extraction

​Common Use Cases

​Content Aggregation Platform

​Research Paper Analysis

​Product Documentation Extraction

​Response Format

​No AI Processing (summary: false)

​Text Summarization (summary: true or string)

​Structured Extraction (JSON Schema)

​Key Response Fields

​Best Practices

​Optimize Your Summary Field

​Efficient Batch Processing

​Error Handling

Try the Contents API

​Next Steps

API Reference

Python SDK

TypeScript SDK

Integration Guides

Why Content Extraction Matters

Key Content Extraction Features

Getting Started

Basic Content Extraction

Response Length Options

Advanced Features

Summary Field Examples

1. No AI Processing (`false`)

2. Basic Summarization (`true`)

3. Custom Instructions (`string`)

4. Structured Extraction (`object`)

JSON Schema Reference

High-Quality Extraction

Common Use Cases

Content Aggregation Platform

Research Paper Analysis

Product Documentation Extraction

Response Format

No AI Processing (summary: false)

Text Summarization (summary: true or string)

Structured Extraction (JSON Schema)

Key Response Fields

Best Practices

Optimize Your Summary Field

Efficient Batch Processing

Error Handling

Next Steps