PDF Extraction
Our PDF Extraction Pipeline is designed to streamline the process of extracting text and figures from PDF files and converting the content into structured LaTeX and Markdown (in beta) formats. This allows you to easily convert your PDFs into ML and agent ready formats.
Key Features
Zero Data Retention: The entire pipeline is stateless and ephemeral, the data and pipleine is deleted once the job is completed
OCR-Enhanced: Uses our proprietary OCR model for text and figure extraction.
Multi-Format Conversion: Outputs both LaTeX and Markdown (in beta) formats.
Figure Handling: Detects and extracts embedded figures and generates summaries
Quick Start Guide
Prerequisites
Install the SDK following the instructions:
Steps
Place your PDFs in the
input
folder.Run the pipeline (progress bar ui not optimized for .ipynb):
Retrieve the output files from the specified
output
folder
Example Output
LaTeX file
Markdown file (Experimental)
Parquet file
image_number: The order of appearance of the figure in the document.*
image_base64: Figure encoded in base64.
page: Page the figure is on.
caption: Extracted caption of the figure.
summary: Generated summary of the figure.
*This number allows you to find the postion of the image in the Markdown and Latex Documents by finding the special <Fiqure_x> tag.
How It Works
Input PDFs:
Place your PDF files into the specified
input
folder.The pipeline automatically scans the folder for new files to process.
OCR Model Processing:
Text is extracted using our OCR model.
Embedded figures are identified and processed.
Content Structuring:
Text is structured hierarchically based on headings, paragraphs, and lists.
Figures are labeled and referenced.
Format Conversion:
Text and figures are converted into:
LaTeX: Preserves formatting for scientific and academic use.
Markdown (Experimental): Suitable for blogs, documentation, and AI use.
Output Files:
Generated files are saved in a specified or
results
folder with a structured directory:{file_name}.md: File containing the Markdown formatted content.
{file_name}.txt: File containing the LaTeX formatted content.
{file_name}.parquet: File containing the figure content.
Troubleshooting
Known limitations (We are constantly improving this)
Old typesets
Logos
Some special characters (phonetics)
Small text after extremely big text
Throughput
If you have a demo API key, you will be provisoned a smaller compute unit. For production API keys you will be provisioned an elastic large scale compute cluster.
If you experience slow times, you are using a demo key. Please reach out for a production API key.
Future Enhancements
Table summarisation (Currently in Beta).
Improved OCR performance.
Adding support for additional formats for both input and output (e.g., XML, HTML, JSON).
Integration with cloud storage for input/output management.
We are constantly working on this give us a shout if you need some help or a feature request.
For further assistance, refer to the FAQs or contact us at support@valyu.network.
Last updated