All blog posts

How Unstructured.io Accelerated Document Processing Pipelines with Codeflash

Saurabh Misra
November 10, 2025

Crag Wolfe, Chief Architect, Unstructured.io

"Codeflash is the team of performance engineers that we don't have. It augments our team with performance engineering expertise that we then don't have to pay full-time engineers for while giving us data-driven confidence in every optimization."

In the world of enterprise AI, document processing speed isn't just a nice-to-have. It's a business imperative. When Unstructured.io's customers transform millions of complex documents into structured data for GenAI applications, every millisecond of latency matters. But for a fast-moving startup serving 82% of the Fortune 1000, dedicating engineering resources to performance optimization meant choosing between shipping more features and improving speed.

"We know certain code paths are critical to optimize," explains Crag Wolfe, Chief Architect at Unstructured.io. "But we didn't have the resourcing to focus on optimizations. We don't have a team of engineers that we can dedicate to this."

As a leading ETL platform for GenAI data, Unstructured.io transforms complex, unstructured documents into clean, structured data that powers AI applications across industries. Supporting over 64 file types and 1,250+ pipelines, their platform handles everything from initial document transformation to chunking, embedding, and enrichment, all while maintaining enterprise-grade security and reliability. For their customers processing documents at scale, performance bottlenecks directly impact both time-to-insight and infrastructure costs.

The Performance Challenge

For Unstructured.io's engineering team, the challenge was clear but daunting. Their document processing pipelines across dozens of document types involve enormous computational complexity. For example, a single complex PDF page can generate hundreds of thousands or even millions of Python objects during processing. Each document requires intensive coordinate math for bounding box creation, object merging, and element type classification.

"There's so much surface area in the product that needs to be functionally complete. It just needs to work," Crag notes. "As an organization, we didn't necessarily have the resourcing to focus on optimizations."

The performance challenge extended across their entire ETL pipeline:

  • Initial transformation: Processing raw documents into structured output (their "partition step")
  • String processing: Classic NLP operations throughout the pipeline
  • Concurrent operations: Indexing documents and preparing outputs for storage
  • Chunking: Breaking documents into appropriate segments for RAG applications
  • Enrichment: Transforming document element metadata with VLMs

For enterprise customers running on-premises or in dedicated instances, slow processing meant higher compute costs. For users of their hosted API, it meant longer wait times. And for Unstructured.io itself, inefficient code directly impacted their cost of goods sold.

"Performance matters to us because it matters to our users," Crag explains. "They'll get their results sooner. For our enterprise users, the cost of compute is a real concern. They need confidence that we're being smart with CPU utilization."

Finding Automated Performance Engineering

Rather than hiring dedicated performance engineers or pulling developers away from feature development, Unstructured.io sought a solution that could:

  • Automatically identify optimization opportunities across their extensive codebase
  • Provide data-driven evidence of improvements
  • Integrate seamlessly into their existing GitHub workflow
  • Scale across both open-source and private repositories

That's when they discovered Codeflash. By integrating Codeflash as a GitHub action, Unstructured.io gained an automated performance optimization system that could analyze code, generate comprehensive tests, and suggest improvements, all without disrupting developer workflows.

Transforming Performance Across the Pipeline

The impact was immediate and measurable. Codeflash identified and optimized critical bottlenecks throughout Unstructured.io's codebase:

Hot Path Optimizations: For requests on their serverless API that typically took 3 seconds, Codeflash shaved off 150-200 milliseconds in specific areas, a significant 5-7% improvement. "That's kind of a big deal," Crag notes.

Aggregate Performance Gains: Across multiple improvements to hot path requests, the aggregate performance boost reached approximately 300 milliseconds per request, a 10% improvement in overall latency.

String Processing Improvements: In areas involving classic natural language processing and string evaluation, Codeflash found optimization opportunities that their team hadn’t noticed.

Partition Step Enhancement: For their critical partition step (transforming raw documents to initial structured output), Codeflash merged several significant optimizations, with additional improvements still in review.

What impressed the team most was how Codeflash provided verifiable, data-driven results. "I really like the ability for Codeflash to show the metric, show the speed improvements by automatically creating the tests," Crag explains. "You get a data-driven understanding of what the improvement is rather than some abstract handwave-y like, 'well, this is probably a better way to structure a loop.'"

Expert-Level Optimization Without the Overhead

For Unstructured.io, Codeflash fundamentally changed their approach to performance. Instead of treating optimization as a separate initiative requiring dedicated resources, it's now embedded in their development process.

"The nice thing about it is it's not interfering with developers' existing workflows" Crag notes. "They're getting these like, 'hey, by the way, you could optimize the heck out of this loop. Maybe you should do that.' And the developer can be like, 'yeah, sure. I'll go ahead and include that.'"

The team implemented a two-stage approach:

  1. Initial optimization: Running Codeflash against existing repositories to capture immediate improvements and speedup the whole project
  2. Continuous optimization: Installing Codeflash as a GitHub workflow action to automatically optimize new code before it's even merged

"We view it as having this integrated basically across all of our code bases will just allow for more performant code to be shipped," Crag explains. "The developer can get the benefit with not much cognitive overhead on their part."

This transformation has direct business impact. By improving processing speed across their platform, Unstructured.io reduces latency for customers, lowers compute costs for enterprise deployments, and decreases their own hosting expenses, all while their engineering team stays focused on building the features that differentiate their product.

Looking Ahead

For Unstructured.io, automated optimization through Codeflash represents a fundamental shift in how they balance feature velocity with performance excellence. By embedding optimization directly into their development workflow, they've eliminated the traditional trade-off between shipping new capabilities and improving efficiency.

"We're still in early innings," Crag notes. "We have a lot of code. But having Codeflash integrated across all of our code bases will just allow for more performant code to be shipped."

As they continue expanding their platform to support more file types, destinations, and transformation capabilities for enterprise GenAI applications, Codeflash provides assurance that every line of code is delivering maximum performance, enabling their customers to unlock the full potential of their unstructured data faster and more cost-effectively than ever before.


Technical Details:

  • Environment: Python-based document processing pipelines
  • Performance Improvements:
    • 150-200ms reduction in specific hot path operations (5-7% improvement on 3-second requests)
    • ~300ms aggregate improvement across hot path requests (~10% total latency reduction)
    • Multiple optimizations across partition step, string processing, and concurrent operations
  • Implementation: GitHub Actions integration with workflow automation
  • Stack: Python (NumPy for numerical operations), TypeScript frontend, GitHub for version control

Example PR: Unstructured-IO/unstructured#4080


About Unstructured.io Unstructured.io is the leading ETL+ platform for GenAI data, transforming complex, unstructured documents into clean, structured data. Trusted by 82% of the Fortune 1000, Unstructured.io supports over 64 file types and 1,250+ pipelines with enterprise-grade security and seamless integrations. The platform enables organizations to focus on AI innovation while Unstructured.io handles the complexity of document processing, transformation, and loading at scale.

Table of contents
This is some text inside of a div block.

Stay in the Loop!

Join our newsletter and stay updated with the latest in performance optimization automation.

Thank you! We received your submission!
Oops! Something went wrong.
Share article
hillsidehillside
Before equalization
After equalization

Stay in the Loop!

Join our newsletter and stay updated with fresh insights and exclusive content.

Thank you! We received your submission!
Oops! Something went wrong.