How a 13.7× Speedup Landed in vLLM - and What It Teaches Us About Writing Fast Python

When you think of vLLM, one of the most widely used LLM inference servers, you expect the code to already be highly tuned for performance. And it is. But even in such carefully engineered projects, hidden inefficiencies lurk.
Recently, in Pull Request #20413 Codeflash discovered and fixed one such hotspot — achieving a 13.7× speedup in the detokenization path.
Let’s unpack how this optimization worked, what lessons you can apply to your own code, and how Codeflash can automatically surface optimizations like this in any Python project.
The Problem: Hot Loops Doing Too Much Work
The function _convert_tokens_to_string_with_added_encoders
takes a sequence of tokens (including special and “added vocab” tokens) and converts them back into a string, respecting options like skip_special_tokens
.
In the original implementation:
- Every token iteration called
tokenizer.get_added_vocab()
— an expensive dictionary lookup. - Membership tests against
tokenizer.all_special_tokens
sometimes converted lists into sets inside the loop. - Method calls like
tokenizer.convert_tokens_to_string(...)
were made repeatedly inside the loop. - Lists like
current_sub_text
were reallocated over and over instead of reused.
Each of these things might sound small. But in a loop running thousands of times, they compound.
The result? ~0.28 s for 1,000 iterations in benchmarks.
The Fix: Pre-computing, Hoisting, and Reuse
The optimized version makes a handful of targeted changes:
1def _convert_tokens_to_string_with_added_encoders(
2 tokenizer: AnyTokenizer,
3 output_tokens: list[str],
4 skip_special_tokens: bool,
5 spaces_between_special_tokens: bool,
6) -> str:
7 sub_texts: list[str] = []
8 current_sub_text: list[str] = []
9 convert_tokens_to_string = tokenizer.convert_tokens_to_string
10 added_vocab_set = set(tokenizer.get_added_vocab())
11 all_special_tokens = set(
12 tokenizer.all_special_tokens) if skip_special_tokens else ()
13
14 for token in output_tokens:
15 # Use precomputed set for skip-special check
16 if token in all_special_tokens:
17 continue
18 if token in added_vocab_set:
19 if current_sub_text:
20 sub_texts.append(convert_tokens_to_string(current_sub_text))
21 current_sub_text.clear()
22 sub_texts.append(token)
23 else:
24 current_sub_text.append(token)
25 if current_sub_text:
26 sub_texts.append(convert_tokens_to_string(current_sub_text))
27 if spaces_between_special_tokens:
28 return " ".join(sub_texts)
29 return "".join(sub_texts)
Key takeaways:
- Move invariant work out of the loop. Cache sets like
added_vocab_set
once, don’t rebuild them every iteration. - Localize method lookups.
convert_tokens_to_string = tokenizer.convert_tokens_to_string
avoids repeated attribute resolution. - Reuse lists.
list.clear()
beats reallocation when buffers are flushed repeatedly. - Optimize membership tests. Sets give O(1) lookups, but only if you don’t recreate them each time.
With these changes, benchmarks dropped to ~0.019 s for 1,000 iterations — a 13.7× speedup.
What Python Developers Can Learn
This vLLM story reinforces a few principles that apply to all Python code:
- Profile before you optimize. Don’t guess. Use line profilers to see what’s really eating time.
- Hoist work out of hot loops. If it doesn’t change every iteration, compute it once.
- Cache expensive lookups. Object attributes and method calls aren’t free. Cache them locally.
- Reduce allocations. In performance-critical loops, reusing data structures pays off.
- Sets > lists for membership checks. Especially for repeated lookups.
These aren’t “clever hacks” — they’re repeatable patterns you can apply anywhere in Python.
Why This Matters Beyond vLLM
If even vLLM, a project built by performance-conscious experts, contains opportunities for 10×+ wins, you can bet your codebase does too.
The challenge is scale: no one has time to profile every function, inspect every loop, and test every micro-optimization. That’s where automation comes in.
How Codeflash Helps
At Codeflash, our mission is to make Python code fast by default. We automatically:
- Benchmark and profile your code.
- Pinpoint performance bottlenecks.
- Suggest and verify optimizations like the ones above.
In fact, the vLLM optimization itself was discovered using Codeflash. And we see this pattern across every project we’ve worked on — there are always optimizations waiting to be found.
Whether it’s AI inference pipelines, data processing, or backend services, Codeflash ensures your code runs at peak performance — automatically, and continuously.
Final Thoughts
Performance isn’t just about bragging rights. Faster code means:
- Lower infrastructure bills
- Better user experience
- More room for innovation
And as this vLLM case shows, performance wins can be hiding in plain sight.
Want to see what Codeflash can unlock in your codebase? Get in touch with us or install it as a GitHub action and let’s make all your Python code fast by default.
Want more Codeflash content?
Join our newsletter and stay updated with fresh insights and exclusive content.

Stay in the Loop!
Join our newsletter and stay updated with the latest in performance optimization automation.


