https://claude.ai/public/artifacts/0eb514f0-5482-4a7e-964a-94eaaf30e347
The Problem
You're running Oxford Nanopore basecalling with methylation detection using Dorado on an HPC cluster, and your job keeps timing out after 12 hours. Each time you restart with --resume-from, it takes 50 minutes just to scan the existing BAM file before continuing. Meanwhile, you're anxious about whether it's actually making progress or just reprocessing the same reads.
Sound familiar? Here's what I learned from debugging this exact scenario.
Understanding the Scale
First, let's talk about expectations. I was used to working with ~12 million reads from FASTQ files (already basecalled, filtered at Q10). But when working directly with POD5 files, the reality is different:
pod5 view *.pod5 --include "read_id" --output summary.tsv
wc -l summary.tsv
# Result: 45,030,095 reads
45 million raw reads before any quality filtering! This explained why basecalling was taking so much longer than expected. With methylation calling (sup,5mC_5hmC model), you're looking at:
- ~45 million reads to process
- Complex modified base detection
- High-accuracy "super" model
- Potentially hours or days of GPU time
The Resume Process: Why So Slow?
When using --resume-from, Dorado needs to:
- Scan the entire existing BAM file to extract all read IDs that were already processed
- Index which POD5 files still need processing
- Skip completed reads during basecalling
With 12+ million reads already processed, this scanning phase alone takes ~50 minutes. This is normal and expected behavior, not a bug.
The log output shows:
[2025-10-20 11:07:59.651] [info] > Inspecting resume file...
[2025-10-20 11:58:49.769] [info] > 12152968 original read ids found in resume file.
That's 51 minutes just to count what's done.
Checking for Duplicate Processing
My biggest concern was: "Is it reprocessing the same reads?" Here's how to check:
# Count duplicates (but wait until the job finishes!)
samtools view PollenDec24_methylbasecall.bam | cut -f1 | sort | uniq -d | wc -l
Important caveat: You can't run this while Dorado is writing to the file. You'll get errors like:
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_uncompress] CRC32 checksum mismatch
These errors occur because the BAM file doesn't have proper end markers yet. Wait until the job completes, or make a copy first.
In my case: 0 duplicates - the resume function works correctly!
Monitoring Progress Without Breaking Things
Since you can't read the BAM while it's being written, here are safe alternatives:
1. Watch file size growth
watch -n 60 "ls -lh PollenDec24_methylbasecall.bam"
2. Check GPU utilization
nvidia-smi -l 5
If GPU utilization is high (~90-100%), basecalling is happening.
3. Monitor the SLURM output file
tail -f slurm-JOBID.out
4. Check process status
ps aux | grep dorado
The Append vs Overwrite Confusion
Initially, I was using:
--resume-from input.bam ... >> input.bam
This appends to the same file being read from. While it seemed to work, a cleaner approach is:
# Read from old file, write to new file
--resume-from input.bam ... > output_continued.bam
# Then merge later if needed
samtools merge final.bam input.bam output_continued.bam
Adding Quality Filtering to Speed Things Up
Since we know there are 45 million raw reads but only ~12 million will pass Q10 filtering, we can speed up the pipeline significantly:
dorado basecaller \
sup,5mC_5hmC \
--min-qscore 10 \
--resume-from PollenDec24_methylbasecall.bam \
pod5/ \
> PollenDec24_methylbasecall_Q10.bam
Quality score options:
--min-qscore 7- Very permissive (~80% accuracy)--min-qscore 9- Standard (~87% accuracy)--min-qscore 10- Good quality (~90% accuracy)--min-qscore 15- High quality (~97% accuracy)
For methylation calling, Q9-Q10 is a good balance between yield and accuracy.
Managing Long Jobs on HPC Systems
Use longer walltime requests
#SBATCH --time=48:00:00
Use screen or tmux for interactive jobs
tmux new -s dorado
# Run your command
# Detach: Ctrl+B then D
# Reattach: tmux attach -t dorado
Pre-download models to avoid repeated downloads
dorado download --model sup,5mC_5hmC
This saves ~2 minutes per job restart.
Key Takeaways
- 50 minutes to scan a BAM with 12M reads is normal - be patient
- 45M raw reads ≠ 12M filtered reads - check your POD5 file counts
- You can't read a BAM file while it's being written - wait or copy first
- Add
--min-qscoreduring basecalling to filter early and save time - The
--resume-fromfunction works correctly - no duplicate processing - Monitor file size and GPU usage instead of trying to read incomplete BAMs
- Request adequate walltime - methylation basecalling is slow
Final Command
Here's my optimized command for resuming:
#!/bin/bash
#SBATCH --time=48:00:00
#SBATCH --gres=gpu:1
#SBATCH --mem=64G
dorado basecaller \
sup,5mC_5hmC \
--min-qscore 10 \
--resume-from PollenDec24_methylbasecall.bam \
Nanopore_data/pod5/ \
> PollenDec24_methylbasecall_continued.bam
Conclusion
Long-running bioinformatics jobs on HPC systems require patience and proper monitoring strategies. Understanding the scale of your data (45M reads, not 12M!), the resume process, and using quality filtering appropriately can save hours of compute time and reduce anxiety about whether your job is actually working.
The resume function in Dorado works as designed - trust the process, monitor file growth, and give it the time it needs to complete.
Have you encountered similar issues with long-running Nanopore basecalling? Share your experiences and solutions in the comments!
Comments
Post a Comment