PDF Extraction

Our PDF Extraction Pipeline is designed to streamline the process of extracting text and figures from PDF files and converting the content into structured LaTeX and Markdown (in beta) formats. This allows you to easily convert your PDFs into ML and agent ready formats.

Key Features

  • Zero Data Retention: The entire pipeline is stateless and ephemeral, the data and pipleine is deleted once the job is completed

  • OCR-Enhanced: Uses our proprietary OCR model for text and figure extraction.

  • Multi-Format Conversion: Outputs both LaTeX and Markdown (in beta) formats.

  • Figure Handling: Detects and extracts embedded figures and generates summaries


Quick Start Guide

Prerequisites

  • Install the SDK following the instructions:

pip install valyu-pipeline

Steps

  1. Place your PDFs in the input folder.

  2. Run the pipeline (progress bar ui not optimized for .ipynb):

    import os
    from valyu-pipeline.processor import PDFProcessor
    
    # Set 'YOUR-VALYU-API-KEY' as env var. Contact support for one.
    
    processor = PDFProcessor()
    processor.start_job('<folder path with .pdf files>','<output folder path (optional)>')
    # If no output folder path is specified, the outputs will be saved to a Results folder
  3. Retrieve the output files from the specified output folder


Example Output

LaTeX file

\title{
Enhancing Multimodal Models with Expert Integration
}
\author{
Anonymous Authors
}
\begin{abstract}
Multimodal models linking text and images have achieved notable success...
\end{abstract}
\section*{1 INTRODUCTION}
Recent advancements in multimodal learning have enabled robust models...

Markdown file (Experimental)

# Enhancing Multimodal Models with Expert Integration

## Authors
Anonymous Authors

## Abstract
Multimodal models linking text and images have achieved notable success...

## 1 Introduction
Recent advancements in multimodal learning have enabled robust models...

Parquet file

  • image_number: The order of appearance of the figure in the document.*

  • image_base64: Figure encoded in base64.

  • page: Page the figure is on.

  • caption: Extracted caption of the figure.

  • summary: Generated summary of the figure.

*This number allows you to find the postion of the image in the Markdown and Latex Documents by finding the special <Fiqure_x> tag.


How It Works

  1. Input PDFs:

    • Place your PDF files into the specified input folder.

    • The pipeline automatically scans the folder for new files to process.

  2. OCR Model Processing:

    • Text is extracted using our OCR model.

    • Embedded figures are identified and processed.

  3. Content Structuring:

    • Text is structured hierarchically based on headings, paragraphs, and lists.

    • Figures are labeled and referenced.

  4. Format Conversion:

    • Text and figures are converted into:

      • LaTeX: Preserves formatting for scientific and academic use.

      • Markdown (Experimental): Suitable for blogs, documentation, and AI use.

  5. Output Files:

    • Generated files are saved in a specified or results folder with a structured directory:

      • {file_name}.md: File containing the Markdown formatted content.

      • {file_name}.txt: File containing the LaTeX formatted content.

      • {file_name}.parquet: File containing the figure content.


Troubleshooting

Known limitations (We are constantly improving this)

  • Old typesets

  • Logos

  • Some special characters (phonetics)

  • Small text after extremely big text

Throughput

If you have a demo API key, you will be provisoned a smaller compute unit. For production API keys you will be provisioned an elastic large scale compute cluster.

If you experience slow times, you are using a demo key. Please reach out for a production API key.


Future Enhancements

  • Table summarisation (Currently in Beta).

  • Improved OCR performance.

  • Adding support for additional formats for both input and output (e.g., XML, HTML, JSON).

  • Integration with cloud storage for input/output management.

We are constantly working on this give us a shout if you need some help or a feature request.


For further assistance, refer to the FAQs or contact us at support@valyu.network.

Last updated