Bioinformatics Pipeline Engineering: Do This Not That: Large Files

Bioinformatics Pipeline Engineering: Do This Not That: Large Files

Overview

If you've worked with bioinformatics data, you've likely encountered the dreaded experience: your Python script chugging along, consuming memory at an alarming rate, and then—crash. Out of memory. Your computer becomes unresponsive, and you're left frustrated wondering why your beautiful analysis script decided to obliterate your system. Maybe you've even taken down an entire HPC!

Here's the thing: working with large bioinformatics files—FASTQ sequences, whole genome VCF files, BAM alignments, protein structure files—requires a fundamentally different approach than analyzing a CSV with a few thousand rows. These files often contain millions or billions of records, and the naive approach of "just load it all into memory" is often a recipe for a headache.

In this "Do This Not That" series, we're going to explore practical patterns for optimizing bioinformatics software. Today, we're tackling file handling at scale.

Don't: Load Everything Into Memory

The Problem

The worst approach—and unfortunately the most common among newcomers—is to load an entire large file into memory:

# ❌ DON'T DO THIS
def analyze_fastq(filepath):
    """Load entire FASTQ file into memory. Catastrophic for large files."""
    sequences = []
    with open(filepath, 'r') as f:
        lines = f.readlines()  # Load ALL lines into memory

    # Parse fastq into records
    for i in range(0, len(lines), 4):
        header = lines[i].strip()
        seq = lines[i+1].strip()
        plus = lines[i+2].strip()
        qual = lines[i+3].strip()
        sequences.append({
            'header': header,
            'sequence': seq,
            'quality': qual
        })

    return sequences

# This will crash with a 50GB FASTQ file
results = analyze_fastq("huge_genomic_dataset.fastq")

Why this fails:

  • Memory explosion: A 100GB FASTQ file needs 100GB+ of RAM just to store the raw text, then additional memory for parsing into data structures
  • Python object overhead: Each string and dictionary adds metadata overhead (Python objects are not memory-efficient)
  • No streaming: You're waiting for the entire file to load before processing even one record
  • Swap thrashing: When RAM is exceeded, your OS swaps to disk, slowing everything to a crawl

Real-World Impact

On a machine with 32GB of RAM, trying to load a 200GB human genome VCF file will:

  1. Use up all available RAM
  2. Trigger swap, using disk as virtual memory (100x slower)
  3. Freeze your system
  4. Eventually crash with an out-of-memory error

You've now lost all time and resources with nothing to show for it. You may be getting annoyed emails from your local sys admin.

Do: Stream or Chunk Your Data

Approach 1: Streaming (Process Line-by-Line)

The most memory-efficient approach is to process data line-by-line or record-by-record without storing everything:

# ✅ DO THIS - Streaming approach
def analyze_fastq_stream(filepath):
    """Process FASTQ records one at a time."""
    def fastq_reader(file_obj):
        """Generator: yield one FASTQ record at a time."""
        while True:
            header = file_obj.readline().strip()
            if not header:
                break

            seq = file_obj.readline().strip()
            plus = file_obj.readline().strip()
            qual = file_obj.readline().strip()

            yield [header, seq, qual]   
    # Process records as they're read (memory-constant)
    stats = {
        'total_records': 0,
        'avg_length': 0,
        'total_bases': 0,
    }

    with open(filepath, 'r') as f:
        for record in fastq_reader(f):
            # Process each record individually
            stats['total_records'] += 1
            stats['total_bases'] += len(record[1])

    stats['avg_length'] = stats['total_bases'] / max(stats['total_records'], 1)
    return stats

# This handles terabyte-scale files with constant memory
results = analyze_fastq_stream("huge_genomic_dataset.fastq")

Advantages:

  • Constant memory: Only one record in memory at any time
  • Immediate processing: Start analyzing data before the file finishes loading
  • Infinite scalability: Handle files larger than your system's RAM
  • Graceful degradation: Works whether your file is 1MB or 1TB

Approach 2: Chunked Reading (Process in Batches)

For operations that benefit from batch processing (like machine learning inference), read in managed chunks.

Typically, when I do this, its because I want to interim results to parquet files. We get one parquet file per chunk, which can then be spread out across a scheduler.

import pyarrow as pa
import pyarrow.parquet as pq

# ✅ DO THIS - Chunked approach
def analyze_fastq_chunked(filepath, chunk_size=10000):
    """Process FASTQ records in fixed-size batches."""
    def batch_fastq_reader(file_obj, batch_size):
        """Generator: yield batches of FASTQ records."""
        seqs = []
        headers = []
        quals = []
        while True:
            header = file_obj.readline().strip()
            if not header:
                if batch:
                    yield batch
                break

            seq = file_obj.readline().strip()
            plus = file_obj.readline().strip()
            qual = file_obj.readline().strip()

            # batch.append([header, seq, qual])
            headers.append(header)
            seqs.append(seq)
            quals.append(qual)

            if len(batch) >= batch_size:
                yield headers, seqs, quals
                headers = []
                seqs = []
                quals = []

    # Process in batches (bounded memory usage)

    batch_num = 0
    with open(filepath, 'r') as f:
        for batch in batch_fastq_reader(f, chunk_size):
            # Process this batch (e.g., run ML inference)
            headers, seqs, quals = process_sequences_batch(batch)
            table = pa.Table.from_pydict({'header': headers, 'sequence', : seqs, 'quality': quals})
            # save additional processing for later
            pq.write_table(table, f'fastq-records-{batch_num}.parquet')
            print(f"Processed batch of {len(batch)} records")
            batch_num += 1

    return all_results

Once you have your data in an arrow/pandas table, you can replace many bioinformatics string mantching type functions. Running code against all elements in the column of a table is nearly always faster than running code against all items in a list. (We'll be talking about this next!)

Advantages:

  • Bounded memory: Memory usage stays constant regardless of file size
  • Batch efficiency: Some operations (ML, vectorized computing) are faster in batches
  • Progress tracking: You can report progress as batches complete
  • Parallelizable: Each batch can be processed independently
  • Simple: Keep your code as simple as possible. When running code millions of times even small delays really add up!

Approach 3: Using Libraries Built for Large Files

This one is a bit fuzzy. I'm all for using code that I didn't write. Ok? All about it.

But my dreams just don't always match my reality of needing my code to run as quickly as possible. AWS credits don't grow on trees!

Bioinformatics libraries are built to parse bioinformatics data. This can be a good thing. Some libraries, such as pysam, are optimized and fast.

Other libraries, such as BioPython, less so. Sometimes you need to do what you need to do, but most often you'll be better off rolling your own, or better yet using a library such as pyarrow.

# ✅ SOMETIMES DO THIS - Use specialized libraries
from Bio import SeqIO  # BioPython

# BioPython: Built-in streaming parser
def analyze_with_biopython(fastq_file):
    """BioPython handles streaming automatically."""
    stats = {'count': 0, 'gc_content': 0}

    for record in SeqIO.parse(fastq_file, "fastq"):
        # Process each record without loading file into memory
        stats['count'] += 1
        stats['gc_content'] += record.seq.count('G') + record.seq.count('C')

    return stats

For even faster batch operations on accumulated sequences, use PyArrow's vectorized string functions:

# ✅ BEST FOR BATCH PROCESSING: Accumulate and use PyArrow compute functions
import pyarrow as pa
import pyarrow.compute as pc

def analyze_with_pyarrow(fastq_file):
    """Accumulate sequences in batches, use vectorized string counting."""
    sequences = []

    with pysam.FastqFile(fastq_file) as fq:
        for record in fq:
            sequences.append(record.sequence)

            # Process in batches of 10,000
            if len(sequences) >= 10000:
                # Use PyArrow's vectorized count_substring for speed
                seq_array = pa.array(sequences)
                total_gc = int(pc.count_substring(seq_array, pattern="G")) + int(pc.count_substring(seq_array, pattern="C"))

                print(f"Batch GC content: {total_gc}")
                sequences = []

    # Handle remaining sequences
    if sequences:
        seq_array = pa.array(sequences)
        total_gc = int(pc.count_substring(seq_array, pattern="G")) + int(pc.count_substring(seq_array, pattern="C"))
        print(f"Final batch GC content: {total_gc}")

PySam is nearly always going to be fastest for file I/O. It accepts most major bioinformatics file format types.

# ✅ SOMETIMES DO THIS - Use specialized libraries
# PysSAM: Efficient BAM file iteration
import pysam  # SAM/BAM handling
def analyze_bam_stream(bam_file):
    """PysSAM streams BAM records efficiently."""
    stats = {'total_reads': 0, 'mapped_reads': 0}

    with pysam.AlignmentFile(bam_file, "rb") as bam:
        for read in bam:
            stats['total_reads'] += 1
            if not read.is_unmapped:
                stats['mapped_reads'] += 1

    return stats

Why this matters:

  • Battle-tested: These libraries are optimized for bioinformatics use cases
  • Zero-copy operations: Many use C/C++ backends that don't allocate excess memory
  • Standard patterns: Your colleagues will recognize the code
  • Feature-rich: Built-in support for filtering, transforming, etc.

Compressed Files: Even More Important

Large bioinformatics files are almost always compressed (gzip, bgzip, zstd). Reading compressed files requires special attention:

# ❌ DON'T: Decompress entire file first
import gzip
import shutil

# This creates a huge temporary uncompressed file
with gzip.open('huge_file.fastq.gz', 'rb') as f_in:
    with open('huge_file.fastq', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)  # Now you have 200GB on disk

# ✅ DO: Use smartopen for transparent decompression
from smart_open import open as smart_open

def stream_compressed_fastq(compressed_file):
    """Read compressed file without decompressing to disk.

    smartopen automatically detects compression format (gzip, bzip2, zstd, etc.)
    and handles decompression transparently.
    """
    with smart_open(compressed_file, 'r') as f:
        for i, line in enumerate(f):
            # Process line by line, decompression happens on-the-fly
            if i % 4 == 0:  # Header line
                print(f"Processing: {line.strip()}")
            if i >= 15:  # Just demo first 4 records
                break

# smartopen works with any compression format automatically
def process_any_format(filepath):
    """Works with .gz, .bz2, .xz, .zst - just works!"""
    with smart_open(filepath, 'r') as f:
        for record in f:
            # Compression is transparent to your code
            yield record.strip()

Memory savings:

  • Format-agnostic: Automatically detects gzip, bzip2, xz, zstd, and more
  • Streaming decompression: Only decompress what you're actively reading
  • No intermediate files: Saves disk space equal to uncompressed file size
  • No surprises: Same API regardless of compression type

Performance Comparison

Let's quantify the difference with concrete numbers:

import time
import psutil
import os

def monitor_memory(func, *args, **kwargs):
    """Decorator to measure peak memory usage."""
    process = psutil.Process(os.getpid())
    mem_before = process.memory_info().rss / 1024 / 1024  # MB

    start = time.time()
    result = func(*args, **kwargs)
    elapsed = time.time() - start

    mem_after = process.memory_info().rss / 1024 / 1024  # MB
    peak_mem = mem_after - mem_before

    print(f"Function: {func.__name__}")
    print(f"  Time: {elapsed:.2f}s")
    print(f"  Peak Memory Delta: {peak_mem:.1f}MB")

    return result

# Hypothetical results with 1GB FASTQ file:
# 
# load_into_memory: Time: 45s, Peak Memory: 1200MB ❌
# stream_approach: Time: 32s, Peak Memory: 2MB ✅
# chunked_approach: Time: 38s, Peak Memory: 150MB ✅

For time tracking use funcy.

from funcy import tap, print_durations
import time

def stream_approach():
    """Simulated streaming approach."""
    time.sleep(2)
    return {"records": 1000000}

def chunked_approach():
    """Simulated chunked approach."""
    time.sleep(2.3)
    return {"records": 1000000}

@print_durations
def timed_stream():
    return stream_approach()

@print_durations
def timed_chunked():
    return chunked_approach()

# Run and get automatic timing output
result1 = timed_stream()  # Prints: <function timed_stream at 0x...> took 2.00s
result2 = timed_chunked()  # Prints: <function timed_chunked at 0x...> took 2.30s

Best Practices Checklist

  • ✅ Always stream: Unless you have a very specific reason to load everything, use streaming or chunked approaches. You never know when your file will 200x in size!
  • ✅ Use proven libraries: BioPython, pySAM, pandas with chunksize, polars with streaming
  • ✅ Monitor memory: Use psutil or memory_profiler to verify your approach
  • ✅ Profile your code: Some operations you think are expensive are actually fast, and vice versa
  • ✅ Test with real data: Your 1MB test file behaves very differently than a 100GB production file
  • ✅ Plan for growth: Design your system assuming files will get out of hand.

Common Patterns by File Type

FASTQ/FASTA

import pysam

with pysam.FastqFile("file.fastq") as fq:
    for record in fq:
        # record.name: read name
        # record.sequence: DNA sequence
        # record.query_qualities: quality scores
        print(f"{record.name}: {record.sequence}")

BAM/SAM

import pysam
with pysam.AlignmentFile("file.bam", "rb") as f:
    for read in f:
        # Process read

VCF

import vcf
reader = vcf.Reader(filename="file.vcf.gz")
for record in reader:
    # Process variant

CSV/Tabular

from pyarrow import csv
df = csv.read_csv("large_file.csv")
    # Process chunk

References

Key Takeaway

Large file handling isn't about fancy algorithms—it's about respecting memory constraints and processing data efficiently. The pattern is always the same:

Don't load everything. Stream or chunk instead.

This simple principle will make your bioinformatics software scale from laptop to HPC cluster without modification. Your future self (and your system administrator) will thank you.


Next in the "Do This Not That" series: Parallelization patterns for bioinformatics workflows.