Troubleshooting Long-Running Dorado Basecalling Jobs: A Case Study

 https://claude.ai/public/artifacts/0eb514f0-5482-4a7e-964a-94eaaf30e347 


The Problem

You're running Oxford Nanopore basecalling with methylation detection using Dorado on an HPC cluster, and your job keeps timing out after 12 hours. Each time you restart with --resume-from, it takes 50 minutes just to scan the existing BAM file before continuing. Meanwhile, you're anxious about whether it's actually making progress or just reprocessing the same reads.

Sound familiar? Here's what I learned from debugging this exact scenario.

Understanding the Scale

First, let's talk about expectations. I was used to working with ~12 million reads from FASTQ files (already basecalled, filtered at Q10). But when working directly with POD5 files, the reality is different:

pod5 view *.pod5 --include "read_id" --output summary.tsv
wc -l summary.tsv
# Result: 45,030,095 reads

45 million raw reads before any quality filtering! This explained why basecalling was taking so much longer than expected. With methylation calling (sup,5mC_5hmC model), you're looking at:

  • ~45 million reads to process
  • Complex modified base detection
  • High-accuracy "super" model
  • Potentially hours or days of GPU time

The Resume Process: Why So Slow?

When using --resume-from, Dorado needs to:

  1. Scan the entire existing BAM file to extract all read IDs that were already processed
  2. Index which POD5 files still need processing
  3. Skip completed reads during basecalling

With 12+ million reads already processed, this scanning phase alone takes ~50 minutes. This is normal and expected behavior, not a bug.

The log output shows:

[2025-10-20 11:07:59.651] [info] > Inspecting resume file...
[2025-10-20 11:58:49.769] [info] > 12152968 original read ids found in resume file.

That's 51 minutes just to count what's done.

Checking for Duplicate Processing

My biggest concern was: "Is it reprocessing the same reads?" Here's how to check:

# Count duplicates (but wait until the job finishes!)
samtools view PollenDec24_methylbasecall.bam | cut -f1 | sort | uniq -d | wc -l

Important caveat: You can't run this while Dorado is writing to the file. You'll get errors like:

[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_uncompress] CRC32 checksum mismatch

These errors occur because the BAM file doesn't have proper end markers yet. Wait until the job completes, or make a copy first.

In my case: 0 duplicates - the resume function works correctly!

Monitoring Progress Without Breaking Things

Since you can't read the BAM while it's being written, here are safe alternatives:

1. Watch file size growth

watch -n 60 "ls -lh PollenDec24_methylbasecall.bam"

2. Check GPU utilization

nvidia-smi -l 5

If GPU utilization is high (~90-100%), basecalling is happening.

3. Monitor the SLURM output file

tail -f slurm-JOBID.out

4. Check process status

ps aux | grep dorado

The Append vs Overwrite Confusion

Initially, I was using:

--resume-from input.bam ... >> input.bam

This appends to the same file being read from. While it seemed to work, a cleaner approach is:

# Read from old file, write to new file
--resume-from input.bam ... > output_continued.bam

# Then merge later if needed
samtools merge final.bam input.bam output_continued.bam

Adding Quality Filtering to Speed Things Up

Since we know there are 45 million raw reads but only ~12 million will pass Q10 filtering, we can speed up the pipeline significantly:

dorado basecaller \
  sup,5mC_5hmC \
  --min-qscore 10 \
  --resume-from PollenDec24_methylbasecall.bam \
  pod5/ \
  > PollenDec24_methylbasecall_Q10.bam

Quality score options:

  • --min-qscore 7 - Very permissive (~80% accuracy)
  • --min-qscore 9 - Standard (~87% accuracy)
  • --min-qscore 10 - Good quality (~90% accuracy)
  • --min-qscore 15 - High quality (~97% accuracy)

For methylation calling, Q9-Q10 is a good balance between yield and accuracy.

Managing Long Jobs on HPC Systems

Use longer walltime requests

#SBATCH --time=48:00:00

Use screen or tmux for interactive jobs

tmux new -s dorado
# Run your command
# Detach: Ctrl+B then D
# Reattach: tmux attach -t dorado

Pre-download models to avoid repeated downloads

dorado download --model sup,5mC_5hmC

This saves ~2 minutes per job restart.

Key Takeaways

  1. 50 minutes to scan a BAM with 12M reads is normal - be patient
  2. 45M raw reads ≠ 12M filtered reads - check your POD5 file counts
  3. You can't read a BAM file while it's being written - wait or copy first
  4. Add --min-qscore during basecalling to filter early and save time
  5. The --resume-from function works correctly - no duplicate processing
  6. Monitor file size and GPU usage instead of trying to read incomplete BAMs
  7. Request adequate walltime - methylation basecalling is slow

Final Command

Here's my optimized command for resuming:

#!/bin/bash
#SBATCH --time=48:00:00
#SBATCH --gres=gpu:1
#SBATCH --mem=64G

dorado basecaller \
  sup,5mC_5hmC \
  --min-qscore 10 \
  --resume-from PollenDec24_methylbasecall.bam \
  Nanopore_data/pod5/ \
  > PollenDec24_methylbasecall_continued.bam

Conclusion

Long-running bioinformatics jobs on HPC systems require patience and proper monitoring strategies. Understanding the scale of your data (45M reads, not 12M!), the resume process, and using quality filtering appropriately can save hours of compute time and reduce anxiety about whether your job is actually working.

The resume function in Dorado works as designed - trust the process, monitor file growth, and give it the time it needs to complete.


Have you encountered similar issues with long-running Nanopore basecalling? Share your experiences and solutions in the comments!


Comments