All blog posts

How a 13.7× Speedup Landed in vLLM - and What It Teaches Us About Writing Fast Python

Saurabh Misra
October 2, 2025

When you think of vLLM, one of the most widely used LLM inference servers, you expect the code to already be highly tuned for performance. And it is. But even in such carefully engineered projects, hidden inefficiencies lurk.

Recently, in Pull Request #20413 Codeflash discovered and fixed one such hotspot — achieving a 13.7× speedup in the detokenization path.

Let’s unpack how this optimization worked, what lessons you can apply to your own code, and how Codeflash can automatically surface optimizations like this in any Python project.

The Problem: Hot Loops Doing Too Much Work

The function _convert_tokens_to_string_with_added_encoders takes a sequence of tokens (including special and “added vocab” tokens) and converts them back into a string, respecting options like skip_special_tokens.

In the original implementation:

  • Every token iteration called tokenizer.get_added_vocab() — an expensive dictionary lookup.
  • Membership tests against tokenizer.all_special_tokens sometimes converted lists into sets inside the loop.
  • Method calls like tokenizer.convert_tokens_to_string(...) were made repeatedly inside the loop.
  • Lists like current_sub_text were reallocated over and over instead of reused.

Each of these things might sound small. But in a loop running thousands of times, they compound.

The result? ~0.28 s for 1,000 iterations in benchmarks.

The Fix: Pre-computing, Hoisting, and Reuse

The optimized version makes a handful of targeted changes:

1def _convert_tokens_to_string_with_added_encoders(
2    tokenizer: AnyTokenizer,
3    output_tokens: list[str],
4    skip_special_tokens: bool,
5    spaces_between_special_tokens: bool,
6) -> str:
7    sub_texts: list[str] = []
8    current_sub_text: list[str] = []
9    convert_tokens_to_string = tokenizer.convert_tokens_to_string
10    added_vocab_set = set(tokenizer.get_added_vocab())
11    all_special_tokens = set(
12        tokenizer.all_special_tokens) if skip_special_tokens else ()
13
14    for token in output_tokens:
15        # Use precomputed set for skip-special check
16        if token in all_special_tokens:
17            continue
18        if token in added_vocab_set:
19            if current_sub_text:
20                sub_texts.append(convert_tokens_to_string(current_sub_text))
21                current_sub_text.clear()
22            sub_texts.append(token)
23        else:
24            current_sub_text.append(token)
25    if current_sub_text:
26        sub_texts.append(convert_tokens_to_string(current_sub_text))
27    if spaces_between_special_tokens:
28        return " ".join(sub_texts)
29    return "".join(sub_texts)

Key takeaways:

  • Move invariant work out of the loop. Cache sets like added_vocab_set once, don’t rebuild them every iteration.
  • Localize method lookups. convert_tokens_to_string = tokenizer.convert_tokens_to_string avoids repeated attribute resolution.
  • Reuse lists. list.clear() beats reallocation when buffers are flushed repeatedly.
  • Optimize membership tests. Sets give O(1) lookups, but only if you don’t recreate them each time.

With these changes, benchmarks dropped to ~0.019 s for 1,000 iterations — a 13.7× speedup.

What Python Developers Can Learn

This vLLM story reinforces a few principles that apply to all Python code:

  • Profile before you optimize. Don’t guess. Use line profilers to see what’s really eating time.
  • Hoist work out of hot loops. If it doesn’t change every iteration, compute it once.
  • Cache expensive lookups. Object attributes and method calls aren’t free. Cache them locally.
  • Reduce allocations. In performance-critical loops, reusing data structures pays off.
  • Sets > lists for membership checks. Especially for repeated lookups.

These aren’t “clever hacks” — they’re repeatable patterns you can apply anywhere in Python.

Why This Matters Beyond vLLM

If even vLLM, a project built by performance-conscious experts, contains opportunities for 10×+ wins, you can bet your codebase does too.

The challenge is scale: no one has time to profile every function, inspect every loop, and test every micro-optimization. That’s where automation comes in.

How Codeflash Helps

At Codeflash, our mission is to make Python code fast by default. We automatically:

  • Benchmark and profile your code.
  • Pinpoint performance bottlenecks.
  • Suggest and verify optimizations like the ones above.

In fact, the vLLM optimization itself was discovered using Codeflash. And we see this pattern across every project we’ve worked on — there are always optimizations waiting to be found.

Whether it’s AI inference pipelines, data processing, or backend services, Codeflash ensures your code runs at peak performance — automatically, and continuously.

Final Thoughts

Performance isn’t just about bragging rights. Faster code means:

  • Lower infrastructure bills
  • Better user experience
  • More room for innovation

And as this vLLM case shows, performance wins can be hiding in plain sight.

Want to see what Codeflash can unlock in your codebase? Get in touch with us or install it as a GitHub action and let’s make all your Python code fast by default.

Want more Codeflash content?

Join our newsletter and stay updated with fresh insights and exclusive content.

Thank you! We received your submission!
Oops! Something went wrong.
cta graphic base
Table of contents
This is some text inside of a div block.

Stay in the Loop!

Join our newsletter and stay updated with the latest in performance optimization automation.

Thank you! We received your submission!
Oops! Something went wrong.
cta graphic base
Share article
hillsidehillside
Before equalization
After equalization