When you think of vLLM, one of the most widely used LLM inference servers, you expect the code to already be highly tuned for performance. And it is. But even in such carefully engineered projects, hidden inefficiencies lurk.
Recently, in Pull Request #20413 Codeflash discovered and fixed one such hotspot — achieving a 13.7× speedup in the detokenization path.
Let’s unpack how this optimization worked, what lessons you can apply to your own code, and how Codeflash can automatically surface optimizations like this in any Python project.
The function _convert_tokens_to_string_with_added_encoders takes a sequence of tokens (including special and “added vocab” tokens) and converts them back into a string, respecting options like skip_special_tokens.
In the original implementation:
tokenizer.get_added_vocab() — an expensive dictionary lookup.tokenizer.all_special_tokens sometimes converted lists into sets inside the loop.tokenizer.convert_tokens_to_string(...) were made repeatedly inside the loop.current_sub_text were reallocated over and over instead of reused.Each of these things might sound small. But in a loop running thousands of times, they compound.
The result? ~0.28 s for 1,000 iterations in benchmarks.
The optimized version makes a handful of targeted changes:
def _convert_tokens_to_string_with_added_encoders(
2 tokenizer: AnyTokenizer,
3 output_tokens: list[str],
4 skip_special_tokens: bool,
5 spaces_between_special_tokens: bool,
6) -> str:
7 sub_texts: list[str] = []
8 current_sub_text: list[str] = []
9 convert_tokens_to_string = tokenizer.convert_tokens_to_string
10 added_vocab_set = set(tokenizer.get_added_vocab())
11 all_special_tokens = set(
12 tokenizer.all_special_tokens) if skip_special_tokens else ()
13
14 for token in output_tokens:
15 # Use precomputed set for skip-special check
16 if token in all_special_tokens:
17 continue
18 if token in added_vocab_set:
19 if current_sub_text:
20 sub_texts.append(convert_tokens_to_string(current_sub_text))
21 current_sub_text.clear()
22 sub_texts.append(token)
23 else:
24 current_sub_text.append(token)
25 if current_sub_text:
26 sub_texts.append(convert_tokens_to_string(current_sub_text))
27 if spaces_between_special_tokens:
28 return " ".join(sub_texts)
29 return "".join(sub_texts)Key takeaways:
added_vocab_set once, don’t rebuild them every iteration.convert_tokens_to_string = tokenizer.convert_tokens_to_string avoids repeated attribute resolution.list.clear() beats reallocation when buffers are flushed repeatedly.With these changes, benchmarks dropped to ~0.019 s for 1,000 iterations — a 13.7× speedup.
This vLLM story reinforces a few principles that apply to all Python code:
These aren’t “clever hacks” — they’re repeatable patterns you can apply anywhere in Python.
If even vLLM, a project built by performance-conscious experts, contains opportunities for 10×+ wins, you can bet your codebase does too.
The challenge is scale: no one has time to profile every function, inspect every loop, and test every micro-optimization. That’s where automation comes in.
At Codeflash, our mission is to make Python code fast by default. We automatically:
In fact, the vLLM optimization itself was discovered using Codeflash. And we see this pattern across every project we’ve worked on — there are always optimizations waiting to be found.
Whether it’s AI inference pipelines, data processing, or backend services, Codeflash ensures your code runs at peak performance — automatically, and continuously.
Performance isn’t just about bragging rights. Faster code means:
And as this vLLM case shows, performance wins can be hiding in plain sight.
Want to see what Codeflash can unlock in your codebase? Get in touch with us or install it as a GitHub action and let’s make all your Python code fast by default.
Join our newsletter and stay updated with the latest in performance optimization automation.


Join our newsletter and stay updated with fresh insights and exclusive content.
