Why Content Extraction Matters
Content extraction with Valyu provides clean, structured web content that enables:- 🤖 AI Agent Development - Feed clean data to LLMs and AI systems without noise
- 📊 Data Aggregation - Collect structured content from multiple sources efficiently
- 📝 Content Management - Transform web content into usable formats for analysis
- 🔍 Research Automation - Extract key information from articles, papers, and reports
Key Content Extraction Features
Batch Processing
Process Multiple URLs Submit up to 10 URLs per request for efficient
bulk content extraction.
AI-Powered Structuring
Structured Data Extraction Use JSON schemas to extract specific data
points with AI precision.
Smart Summarization
Custom AI Summaries Generate tailored summaries with custom instructions
for any content.
Pay-per-Success
Fair Pricing Model Only pay for URLs that are successfully processed -
failed extractions cost nothing.
Getting Started
Basic Content Extraction
Start with simple content extraction from web pages:Response Length Options
Control the amount of content extracted:Length | Characters | Best For |
---|---|---|
short | 25,000 | Summaries, key points |
medium | 50,000 | Articles, blog posts |
large | 100,000 | Academic papers, long-form content |
max | Unlimited | Complete document extraction |
Custom integer | 1,000-1,000,000 | Specific requirements |
Advanced Features
Summary Field Examples
Thesummary
field accepts four different types of values. Here are examples for each:
1. No AI Processing (false
)
2. Basic Summarization (true
)
3. Custom Instructions (string
)
4. Structured Extraction (object
)
JSON Schema Reference
When using thesummary
field with an object (structured extraction), you can use any valid JSON Schema specification. For detailed information about available types, formats, and validation rules, see the JSON Schema Type Reference.
Key limitations:
- Maximum 5,000 characters total
- Maximum 3 levels deep
- Maximum 20 properties per object
string
- Text data with optional format validationnumber
/integer
- Numeric values with optional min/maxboolean
- True/false valuesarray
- Lists of items with optional size limitsobject
- Nested structures with properties
High-Quality Extraction
For challenging content, use enhanced extraction:extract_effort: "high"
provides better content quality for complex websites
but takes longer to process.Common Use Cases
Content Aggregation Platform
Build a news aggregator that extracts structured article data:Research Paper Analysis
Extract structured data from academic papers:Product Documentation Extraction
Extract structured product information:Response Format
No AI Processing (summary: false)
Text Summarization (summary: true or string)
Structured Extraction (JSON Schema)
Key Response Fields
Field | Description |
---|---|
content | Extracted content in markdown format |
summary | AI processing result (text or structured data) |
data_type | "unstructured" (no AI) or "structured" (with AI) |
summary_success | Whether AI processing succeeded |
price | Cost for processing this specific URL |
Best Practices
Optimize Your Summary Field
-
Choose the Right Type:
false
: No AI processing (fastest, cheapest)true
: Basic summarization for general overviews"string"
: Custom instructions for specific summary needs{object}
: Structured extraction for data processing
-
For JSON Schemas:
- Be Specific: Use clear descriptions to guide AI extraction
- Use Enums: Specify allowed values for consistent categorization
- Limit Complexity: Keep schemas under 3 levels deep for best results
- Required Fields: Mark essential fields as required
Efficient Batch Processing
- Group Similar Content: Process similar content types together
- Optimize Response Length: Use appropriate length for your use case
- Handle Failures Gracefully: Check
summary_success
for AI processing status - Monitor Costs: Track
total_cost_dollars
for budget management
Error Handling
Try the Contents API
Explore the complete API reference with interactive examples and detailed
parameter documentation.