Bioinformatics

Speeding Up UMI-tools with Cython: A 4.6x Performance Win

Overview

You know that feeling when you're running a bioinformatics pipeline and you watch the progress bar crawl along at a glacial pace? You grab a coffee, check your email, maybe start a second analysis on another dataset... and 90 seconds later your first job is still processing the same FASTQ file.

That was me with UMI-tools whitelist. Don't get me wrong—UMI-tools is a fantastic, battle-tested tool for single-cell RNA-seq analysis, and I love using other people's code. But when you're iterating on analysis parameters or processing dozens of samples, those 85-second runs start to add up. My patience is a very finite resource!

So I did what any reasonable person would do: I rewrote the hot paths in Cython and got a 4.6x speedup (85.9s → 18.8s) without changing a single line of the actual algorithm. Same inputs, same outputs, just adding in cython to our project gets us a considerable speedup.

In this post, I'll walk you through how I identified what to optimize, give you a quick Cython primer, and show you the specific changes that made the difference. By the end, you'll know exactly how to apply these techniques to your own Python bottlenecks.

Why UMI-tools? Why `whitelist`?

If you work with single-cell RNA-seq data, you've probably used UMI-tools at some point. The whitelist command is particularly critical—it takes millions of raw cell barcodes from your FASTQ file and figures out which ones are "real" cells versus background noise or sequencing errors.

(More importantly, its the job I was hired to do.)

For a typical 10x Genomics dataset:
- Input: 7.2 million reads with 194,000 unique barcodes
- Task: Identify the top 100 true cells, correct sequencing errors
- Output: A whitelist of valid barcodes with error-correction mappings

This is a perfect candidate for optimization because:
1. It's pure computation (no I/O waiting)
2. It runs on every single-cell dataset (high-impact)
3. The algorithm is sound—we just need it to run faster

Finding the Hot Paths: Profile Before You Optimize

Before you start optimizing, you need to know what to optimize. The cardinal rule: never guess, always profile.

I threw cProfile at the UMI-tools code and got this breakdown:

# Rough profiling breakdown (using cProfile + pstats)
Total runtime: 85.9 seconds

Top functions by cumulative time:
1. FASTQ parsing + barcode counting: 76.0s (88%)
2. Error correction mapping: 8.6s (10%)
3. Knee detection: 1.3s (2%)
4. Everything else: ~0.1s (<1%)

Three functions accounted for 99% of the runtime. These are your hot paths—the only things worth optimizing.

Key insight: The FASTQ parsing loop runs 7.2 million times. The Hamming distance function gets called ~194 million times inside the error-correction BK-tree. Even tiny per-call overhead adds up when you're doing millions of iterations.

This is exactly the kind of code that benefits from Cython: tight loops with predictable types, called millions of times.

What to Look For When Optimizing Code

Here's my mental checklist when I'm hunting for optimization targets:

1. Tight Loops with Many Iterations

# ❌ Python overhead on 7 million iterations
for read in fastq_file:
    cell_barcode = extract_barcode(read)  # function call overhead
    counter[cell_barcode] += 1            # dict lookup

Every function call, every generator yield, every dictionary operation has Python overhead. When you're doing it millions of times, that overhead becomes your bottleneck.

2. String Operations in Hot Paths

# ❌ Called 194 million times in the BK-tree
def hamming_distance(first, second):
    return sum([not a == b for a, b in zip(first, second)])

This innocent-looking function allocates a list, creates a zip iterator, and does a comprehension—all in pure Python. Cython can replace this with a tight C loop that compares raw bytes.

3. Repeated NumPy Allocations

# ❌ Inside a convergence loop (10-15 iterations)
for iteration in range(max_iterations):
    distances = np.vstack(...)      # allocate
    repeated = np.repmat(...)        # allocate
    result = np.outer(...)           # allocate

NumPy is fast for vectorized operations, but allocating new arrays repeatedly? That's slow. Cython memoryviews let you iterate over arrays without creating intermediates.

4. Type-Ambiguous Code

Python's dynamic typing is convenient, but it means the interpreter has to figure out types at runtime:

# Python has to check types on every iteration
for i in range(len(barcodes)):
    if barcodes[i] == target:  # type checks
        count += 1              # type checks

If you know i is always an integer and count is always an integer, why make Python check every time?

(NOTE - this is one of the reasons why Cython is so much faster. You do need to add some type annotations, but once you're there you get a considerable speed up.)

Cython Primer: The Bare Minimum You Need to Know

Cython sits in the sweet spot between "I want to write Python" and "I need C performance." You write Python-like code, but with type annotations that let Cython compile it to C.

I'm a big believer in the magic of Cython. I can get all of my nice Python-y code. I love typer and the rest of the python ecosystem is so convenient. Most of the code that comes to me is written in Python, and I wouldn't want to have to rewrite every line in C++. Typically I'll benchmark the code, find a slow spot, and rewrite that function in Cython. Often I don't even need to do much. Just the compilation step gets me the speedup I need.

The Three Function Types

Depending on your needs you write Cython in several ways. The easiest is to use the Python implementation with def, but if you have a function that lives in a deep dark cave, you can write it in pure Cython using cdef. Since we're not translating from Python <-> C we get a considerable speed up.

Most often I use cpdef, which is the mixing of the two.

Side Note: Py_ssize_t is basically the int type. Its just a special int type the compiler translates to sane instructions so we don't have to care.

1. `def` — Python-accessible functions

def hamming_distance(str first, str second):
    """Can be called from Python code."""
    # ... implementation

Use def when:
- You need to call this function from regular Python code
- It's part of your public API
- You're okay with a tiny bit of Python/C conversion overhead

2. `cdef` — Internal C functions

cdef int _fast_count(char* sequence, int length):
    """Pure C function, not accessible from Python."""
    # ... implementation

Use cdef when:
- This function is only called from other Cython code
- You want maximum speed (no Python wrapper)
- You don't need to call it from regular Python

3. `cpdef` — Both worlds

cpdef int count_bases(str sequence):
    """Callable from Python, but uses C internally."""
    # ... implementation

Use cpdef when:
- You want both Python accessibility AND C speed
- Cython will create a C version for internal calls and a Python wrapper for external calls

Type Annotations: Where the Magic Happens

# Without types (pure Python speed)
def slow_hamming(first, second):
    dist = 0
    for i in range(len(first)):
        if first[i] != second[i]:
            dist += 1
    return dist

# With types (C speed)
def fast_hamming(str first, str second):
    cdef Py_ssize_t dist = 0
    cdef Py_ssize_t i
    cdef Py_ssize_t n = len(first)

    for i in range(n):
        if first[i] != second[i]:
            dist += 1
    return dist

The difference:
- slow_hamming: Python checks types on every iteration
- fast_hamming: Compiles to a tight C loop with no type checks

Common type declarations:
- cdef int x — C integer
- cdef Py_ssize_t i — Python-sized integer (for indices)
- cdef str s — Python string (still a Python object, but declared)
- cdef double[:] — Memory view of a double array (no copies!)

Compiler Directives: Safety Off, Speed On

# cython: language_level=3, boundscheck=False, wraparound=False

boundscheck=False: Don't check if array indices are in bounds (assume you got it right)
wraparound=False: Disable Python-style negative indexing (list[-1])

These are dangerous if you have bugs in your indexing, but they eliminate runtime checks. Use them only after your code is working and tested.

The Three Optimizations

Let me walk you through the three hot-path functions I rewrote and exactly what made them faster.

1. FASTQ Parsing + Barcode Counting (4.7x faster)

What it does: Read 7.2 million FASTQ records, extract the 16-character cell barcode from each read, count occurrences.

Original Python approach:

# Three separate steps with overhead between each
def analyze_fastq_original(filepath, pattern):
    counter = Counter()

    # Step 1: Generator yielding FASTQ records
    for read in fastqIterate(filepath):
        # Step 2: Extract barcode via method call
        barcode = getBarcodes(read, pattern)
        # Step 3: Update counter
        counter[barcode] += 1

    return counter

Problems:
- Generator overhead (fastqIterate yields through multiple layers)
- Method call overhead (getBarcodes dispatched every read)
- Dictionary operations in pure Python

Cython optimization:

# _fastq_parser.pyx
def fastq_parse_and_count(object infile, str pattern, str method="reads", 
                          Py_ssize_t subset_reads=100_000_000):
    cdef Py_ssize_t cell_len = 0
    cdef Py_ssize_t umi_start = 0
    cdef Py_ssize_t pattern_len = len(pattern)

    # Pre-compute barcode length from pattern
    cdef Py_ssize_t i
    for i in range(pattern_len):
        if pattern[i] == 'C':
            cell_len += 1

    umi_start = cell_len

    cdef Py_ssize_t n_reads = 0
    cdef Py_ssize_t n_valid = 0

    cell_barcode_counts = collections.Counter()

    cdef str line1, line2, line3, line4
    cdef str seq, cell

    # Fused loop: read + extract + count in one pass
    while True:
        line1 = infile.readline()
        if not line1:
            break
        line2 = infile.readline()
        line3 = infile.readline()
        line4 = infile.readline()

        n_reads += 1

        # Direct string slicing with typed indices
        seq = line2.strip()
        if len(seq) < pattern_len:
            continue

        cell = seq[:cell_len]
        cell_barcode_counts[cell] += 1
        n_valid += 1

        if n_valid >= subset_reads:
            break

    return cell_barcode_counts, n_reads, n_valid

Why it's faster:
- Fused loop: Read → extract → count happens in one pass
- Typed indices: cdef Py_ssize_t i compiles to a C integer
- Direct slicing: seq[:cell_len] becomes pointer arithmetic
- No generator overhead: Direct readline() calls

Result: 76.0s → 16.2s (4.7x faster)

2. Hamming Distance (3.3x faster in context)

What it does: Compare two barcode strings and count mismatches. Called ~194 million times inside the BK-tree during error correction.

Original Python:

def hamming_distance(first, second):
    """Pure Python with list comprehension."""
    return sum([not a == b for a, b in zip(first, second)])

Problems:
- zip() creates an iterator
- List comprehension allocates memory
- sum() iterates over the list
- All in Python with type checks

Cython optimization:

# _hamming.pyx
def hamming_distance(str first, str second):
    cdef Py_ssize_t n = len(first)
    if n != <Py_ssize_t>len(second):
        return float('inf')

    cdef Py_ssize_t dist = 0
    cdef Py_ssize_t i

    # Tight C loop, no allocations
    for i in range(n):
        if first[i] != second[i]:
            dist += 1

    return dist

Why it's faster:
- No allocations: Just a counter and a loop
- Typed loop: i is a C integer, no Python overhead
- Direct character comparison: Compiles to char comparison
- Early exit: Returns inf immediately for length mismatch

Result: Called 194 million times, each call ~3x faster → 8.6s total → 2.6s total (3.3x faster)

3. Knee Detection (13x faster)

What it does: Find the "elbow" in a curve by computing perpendicular distances. Involves iterative refinement with matrix operations.

Original Python:

def getKneeDistance(values, x, y):
    # Build matrices on every iteration
    xy_data = np.vstack([x, y]).T
    first_point = xy_data[0]
    line_vec = xy_data[-1] - first_point
    line_vec_norm = line_vec / np.linalg.norm(line_vec)

    vec_from_first = xy_data - np.repmat(first_point, xy_data.shape[0], 1)
    scalar_product = np.sum(vec_from_first * np.repmat(line_vec_norm, xy_data.shape[0], 1), axis=1)

    # ... more matrix operations

Problems:
- np.vstack allocates a new array
- np.repmat creates repeated copies of data
- np.outer creates another temporary matrix
- All of this happens 10-15 times in the convergence loop

Cython optimization:

# _knee.pyx
def knee_distance(double[:] x, double[:] y):
    cdef Py_ssize_t n = x.shape[0]
    cdef double x0 = x[0]
    cdef double y0 = y[0]
    cdef double x1 = x[n-1]
    cdef double y1 = y[n-1]

    # Pre-compute line parameters
    cdef double dx = x1 - x0
    cdef double dy = y1 - y0
    cdef double line_length = (dx*dx + dy*dy)**0.5

    cdef double[:] distances = np.empty(n, dtype=np.float64)
    cdef Py_ssize_t i

    # Single pass over data
    for i in range(n):
        # Perpendicular distance formula
        distances[i] = abs((dy * (x[i] - x0) - dx * (y[i] - y0)) / line_length)

    return np.asarray(distances)

Why it's faster:
- Memory views: double[:] is a typed view, no copies
- Pre-computed constants: Line parameters computed once
- Single loop: One pass instead of multiple matrix operations
- No temporary arrays: Direct computation into result array

Result: 1.3s → 0.1s (13x faster)

The Results: 4.6x Faster, Same Output

Here's the final breakdown:

Stage	Original	Cython	Speedup
FASTQ parsing + counting	76.0s	16.2s	4.7x
Knee detection	1.3s	0.1s	13x
Error correction	8.6s	2.6s	3.3x
TOTAL	85.9s	18.8s	4.6x

Memory usage: Basically unchanged (~30 MB)

Output verification: All 21 test cases pass—byte-for-byte identical output to the original.

Running the Optimized Version

The full source code is available on GitHub: dabble-of-devops-bioanalyze/umitools-cython

Installation is dead simple:

git clone https://github.com/dabble-of-devops-bioanalyze/umitools-cython.git
cd umitools-cython
pip install -e .

Then run it just like UMI-tools:

# Cython-accelerated version
umitools-cy whitelist \
    --stdin data/hgmm_100_R1.fastq.gz \
    --bc-pattern CCCCCCCCCCCCCCCCNNNNNNNNNN \
    --set-cell-number 100

Output: 18.8 seconds

Compare to the original:

# Original UMI-tools
umi_tools whitelist \
    --stdin data/hgmm_100_R1.fastq.gz \
    --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \
    --set-cell-number=100 \
    --log2stderr > whitelist.txt

Output: 85.9 seconds

Same inputs, same outputs, 4.6x faster. No downsides.

Lessons Learned: When Cython Wins

After this project, I've got a pretty clear mental model for when Cython is the right tool:

✅ Use Cython When:

You have tight loops — Anything called thousands or millions of times
Types are predictable — You know i is always an int, s is always a string
You're doing string operations — Character-level operations in hot paths
You want Python compatibility — Drop-in replacements for existing code
NumPy is allocating too much — Intermediate arrays that could be views

❌ Skip Cython When:

It's I/O-bound — Network calls, disk reads, database queries (Cython won't help)
It's already using C libraries — If you're calling NumPy/Pandas/Polars/PyArrow, those are already in C
Types are highly dynamic — If you can't declare types, Cython can't optimize
It's rarely called — One-time setup code doesn't benefit from optimization

Alternative Approaches (And Why I Didn't Use Them)

NumPy vectorization: Limited here—we're parsing text, not doing linear algebra. PyArrow would help for batch operations, but the hot path is still parsing.

PyPy JIT: Incompatible with pybktree, which has C extensions. Would give ~1.5-2x at best.

Rust + PyO3: Probably 5-10x faster, but requires:
- Learning Rust (high barrier)
- Complex build system
- FFI layer maintenance

Raw C extension: Same speed as Cython, but way more painful to write and maintain.

Cython hits the sweet spot: Python-like syntax, compiles to C, works with existing Python packages, 4.6x speedup. That's a win, and I'm happy to move on with my day.

Key Takeaways

Profile first, optimize second — Don't guess what's slow, measure it
Hot paths matter most — 3 functions = 99% of runtime
Cython is practical — Python syntax, C speed, minimal learning curve
Type annotations unlock performance — cdef variables compile to C types
Tight loops benefit most — Millions of iterations = huge gains
Correctness is non-negotiable — Test everything, verify outputs match

This project took me from "UMI-tools is slow" to "UMI-tools is 4.6x faster" in a few hours of focused work. Same algorithm, same outputs, just way less waiting around.

If you've got a Python bottleneck in your bioinformatics pipeline, Cython is probably the right tool for the job. You don't need to become a C expert—you just need to find your slowest functions and add some type annotations.

Then watch those benchmarks fly!

References

umitools-cython GitHub Repository — Full source code and benchmarks
Cython Documentation
UMI-tools GitHub
pybktree: BK-tree implementation
Profiling Python code with cProfile

Speeding Up UMI-tools with Cython: A 4.6x Performance Win

Overview

Why UMI-tools? Why `whitelist`?

Finding the Hot Paths: Profile Before You Optimize

What to Look For When Optimizing Code

1. Tight Loops with Many Iterations

2. String Operations in Hot Paths

3. Repeated NumPy Allocations

4. Type-Ambiguous Code

Cython Primer: The Bare Minimum You Need to Know

The Three Function Types

1. `def` — Python-accessible functions

2. `cdef` — Internal C functions

3. `cpdef` — Both worlds

Type Annotations: Where the Magic Happens

Compiler Directives: Safety Off, Speed On

The Three Optimizations

1. FASTQ Parsing + Barcode Counting (4.7x faster)

2. Hamming Distance (3.3x faster in context)

3. Knee Detection (13x faster)

The Results: 4.6x Faster, Same Output

Running the Optimized Version

Lessons Learned: When Cython Wins

✅ Use Cython When:

❌ Skip Cython When:

Alternative Approaches (And Why I Didn't Use Them)

Key Takeaways

References

Read next

Bioinformatics Pipeline Engineering: Do This Not That: Split Tasks

Bioinformatics Pipeline Engineering: Do This Not That: Large Files

Bioinformatics Solutions on AWS Book - Preview

Overview

Why UMI-tools? Why whitelist?

Finding the Hot Paths: Profile Before You Optimize

What to Look For When Optimizing Code

1. Tight Loops with Many Iterations

2. String Operations in Hot Paths

3. Repeated NumPy Allocations

4. Type-Ambiguous Code

Cython Primer: The Bare Minimum You Need to Know

The Three Function Types

1. def — Python-accessible functions

2. cdef — Internal C functions

3. cpdef — Both worlds

Type Annotations: Where the Magic Happens

Compiler Directives: Safety Off, Speed On

The Three Optimizations

1. FASTQ Parsing + Barcode Counting (4.7x faster)

2. Hamming Distance (3.3x faster in context)

3. Knee Detection (13x faster)

The Results: 4.6x Faster, Same Output

Running the Optimized Version

Lessons Learned: When Cython Wins

✅ Use Cython When:

❌ Skip Cython When:

Alternative Approaches (And Why I Didn't Use Them)

Key Takeaways

References

Read next

Why UMI-tools? Why `whitelist`?

1. `def` — Python-accessible functions

2. `cdef` — Internal C functions

3. `cpdef` — Both worlds