Transform any web page into clean, structured data with Valyu’s Contents API. Whether you’re building content aggregators, research tools, or AI agents that need to process web content, Valyu provides powerful extraction capabilities with batch processing and AI-powered structuring.

Why Content Extraction Matters

Content extraction with Valyu provides clean, structured web content that enables:
  • 🤖 AI Agent Development - Feed clean data to LLMs and AI systems without noise
  • 📊 Data Aggregation - Collect structured content from multiple sources efficiently
  • 📝 Content Management - Transform web content into usable formats for analysis
  • 🔍 Research Automation - Extract key information from articles, papers, and reports

Key Content Extraction Features

Batch Processing

Process Multiple URLs Submit up to 10 URLs per request for efficient bulk content extraction.

AI-Powered Structuring

Structured Data Extraction Use JSON schemas to extract specific data points with AI precision.

Smart Summarization

Custom AI Summaries Generate tailored summaries with custom instructions for any content.

Pay-per-Success

Fair Pricing Model Only pay for URLs that are successfully processed - failed extractions cost nothing.

Getting Started

Basic Content Extraction

Start with simple content extraction from web pages:
curl -X POST https://api.valyu.network/contents \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "urls": [
      "https://techcrunch.com/category/artificial-intelligence/"
    ],
    "response_length": "medium"
  }'
This returns clean markdown content for each URL, perfect for feeding into LLMs or content management systems.

Response Length Options

Control the amount of content extracted:
LengthCharactersBest For
short25,000Summaries, key points
medium50,000Articles, blog posts
large100,000Academic papers, long-form content
maxUnlimitedComplete document extraction
Custom integer1,000-1,000,000Specific requirements

Advanced Features

Summary Field Examples

The summary field accepts four different types of values. Here are examples for each:

1. No AI Processing (false)

curl -X POST https://api.valyu.network/contents \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "urls": ["https://example.com/article"],
    "summary": false
  }'

2. Basic Summarization (true)

curl -X POST https://api.valyu.network/contents \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "urls": ["https://example.com/article"],
    "summary": true
  }'

3. Custom Instructions (string)

curl -X POST https://api.valyu.network/contents \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "urls": ["https://example.com/research-paper"],
    "summary": "Summarize the methodology, key findings, and practical applications in 2-3 paragraphs"
  }'

4. Structured Extraction (object)

curl -X POST https://api.valyu.network/contents \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "urls": ["https://example.com/product-page"],
    "summary": {
      "type": "object",
      "properties": {
        "product_name": {
          "type": "string",
          "description": "Name of the product"
        },
        "price": {
          "type": "number",
          "description": "Product price in USD"
        },
        "features": {
          "type": "array",
          "items": {"type": "string"},
          "maxItems": 5,
          "description": "Key product features"
        },
        "availability": {
          "type": "string",
          "enum": ["in_stock", "out_of_stock", "preorder"],
          "description": "Product availability status"
        }
      },
      "required": ["product_name", "price"]
    }
  }'

JSON Schema Reference

When using the summary field with an object (structured extraction), you can use any valid JSON Schema specification. For detailed information about available types, formats, and validation rules, see the JSON Schema Type Reference. Key limitations:
  • Maximum 5,000 characters total
  • Maximum 3 levels deep
  • Maximum 20 properties per object
Commonly used types:
  • string - Text data with optional format validation
  • number / integer - Numeric values with optional min/max
  • boolean - True/false values
  • array - Lists of items with optional size limits
  • object - Nested structures with properties

High-Quality Extraction

For challenging content, use enhanced extraction:
curl -X POST https://api.valyu.network/contents \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "urls": ["https://complex-website.com/article"],
    "extract_effort": "high",
    "response_length": "large"
  }'
extract_effort: "high" provides better content quality for complex websites but takes longer to process.

Common Use Cases

Content Aggregation Platform

Build a news aggregator that extracts structured article data:
{
  "urls": [
    "https://techcrunch.com/category/artificial-intelligence/",
    "https://venturebeat.com/category/entrepreneur/",
    "https://www.bbc.co.uk/news/technology"
  ],
  "summary": {
    "type": "object",
    "properties": {
      "headline": { "type": "string" },
      "summary_text": { "type": "string" },
      "category": { "type": "string" },
      "tags": {
        "type": "array",
        "items": { "type": "string" },
        "maxItems": 5
      }
    },
    "required": ["headline"]
  }
}

Research Paper Analysis

Extract structured data from academic papers:
{
  "urls": ["https://arxiv.org/paper/example"],
  "response_length": "max",
  "summary": {
    "type": "object",
    "properties": {
      "title": { "type": "string" },
      "abstract": { "type": "string" },
      "methodology": { "type": "string" },
      "key_findings": {
        "type": "array",
        "items": { "type": "string" }
      },
      "limitations": { "type": "string" }
    },
    "required": ["title"]
  }
}

Product Documentation Extraction

Extract structured product information:
{
  "urls": ["https://company.com/product-A", "https://company.com/product-B"],
  "summary": {
    "type": "object",
    "properties": {
      "product_name": { "type": "string" },
      "features": {
        "type": "array",
        "items": { "type": "string" }
      },
      "pricing": { "type": "string" },
      "target_audience": { "type": "string" }
    },
    "required": ["product_name"]
  }
}

Response Format

No AI Processing (summary: false)

{
  "success": true,
  "error": null,
  "tx_id": "tx_12345678-1234-1234-1234-123456789abc",
  "results": [
    {
      "title": "AI Breakthrough in Natural Language Processing",
      "url": "https://example.com/article?utm_source=valyu",
      "content": "# AI Breakthrough in Natural Language Processing\n\nPage content in markdown...",
      "description": "Latest AI developments",
      "source": "web",
      "price": 0.001,
      "length": 12840,
      "data_type": "unstructured",
      "source_type": "news_article",
      "publication_date": "2024-01-15",
      "id": "https://example.com/article"
    }
  ],
  "urls_requested": 1,
  "urls_processed": 1,
  "urls_failed": 0,
  "total_cost_dollars": 0.001,
  "total_characters": 12840
}

Text Summarization (summary: true or string)

{
  "success": true,
  "results": [
    {
      "title": "AI Breakthrough in Natural Language Processing",
      "content": "# AI Breakthrough in Natural Language Processing\n\nPage content in markdown...",
      "summary": "This article discusses a breakthrough in AI that enables more natural interactions...",
      "summary_success": true,
      "price": 0.002,
      "data_type": "unstructured"
    }
  ]
}

Structured Extraction (JSON Schema)

{
  "success": true,
  "results": [
    {
      "title": "AI Breakthrough in Natural Language Processing",
      "content": "# AI Breakthrough in Natural Language Processing\n\nPage content in markdown...",
      "summary": {
        "title": "AI Breakthrough in Natural Language Processing",
        "author": "John Doe",
        "category": "technology",
        "key_points": [
          "New AI model achieves 95% accuracy",
          "Reduces computational requirements by 40%"
        ]
      },
      "summary_success": true,
      "price": 0.002,
      "data_type": "structured"
    }
  ]
}

Key Response Fields

FieldDescription
contentExtracted content in markdown format
summaryAI processing result (text or structured data)
data_type"unstructured" (no AI) or "structured" (with AI)
summary_successWhether AI processing succeeded
priceCost for processing this specific URL

Best Practices

Optimize Your Summary Field

  1. Choose the Right Type:
    • false: No AI processing (fastest, cheapest)
    • true: Basic summarization for general overviews
    • "string": Custom instructions for specific summary needs
    • {object}: Structured extraction for data processing
  2. For JSON Schemas:
    • Be Specific: Use clear descriptions to guide AI extraction
    • Use Enums: Specify allowed values for consistent categorization
    • Limit Complexity: Keep schemas under 3 levels deep for best results
    • Required Fields: Mark essential fields as required

Efficient Batch Processing

  1. Group Similar Content: Process similar content types together
  2. Optimize Response Length: Use appropriate length for your use case
  3. Handle Failures Gracefully: Check summary_success for AI processing status
  4. Monitor Costs: Track total_cost_dollars for budget management

Error Handling

# Check for partial failures (HTTP 206)
if response.status_code == 206:
    successful_results = [r for r in response.json()["results"]]
    failed_count = response.json()["urls_failed"]

# Check AI processing success
for result in results:
    if "summary" in result and "summary_success" in result:
        if not result["summary_success"]:
            print(f"AI processing failed for {result['url']}")

# Handle complete failures (HTTP 422)
if response.status_code == 422:
    error_message = response.json()["error"]

Try the Contents API

Explore the complete API reference with interactive examples and detailed parameter documentation.

Next Steps