A genome assembly workflow for Oxford Nanopore long reads using hifiasm on CHTC's high-throughput computing infrastructure.
For this tutorial we will be using hifiasm, a fast and simple haplotype-resolved de novo assembler; however, it should be noted that the concepts learned in this tutorial can be applied to most other genome assembly programs.
This tutorial walks you through assembling the genome of the Palla's Cat (Otocolobus manul), a small wild cat native to the grasslands and montane steppes of Central Asia. The sequencing data comes from a Palla's Cat named Tater, sequenced using Oxford Nanopore's Ligation Sequencing Kit by the University of Minnesota. The expected genome size is approximately 2.4 Gb, comparable to the domestic cat (Felis catus).
This tutorial teaches you how to run a genome assembly on CHTC using hifiasm and scalable, high-throughput compute practices. You will learn how to:
- Understand the genome assembly workflow on CHTC, including how hifiasm maps to CPU and memory resources.
- Prepare and stage large ONT sequencing datasets for use with HTCondor jobs, using OSDF for efficient data transfer.
- Leverage CHTC's high-memory capacity for genome assembly, including selecting appropriate resource requests for large mammalian genomes.
- Use containers and HTCondor data-transfer mechanisms to build reproducible, portable assembly workflows.
- Submit and monitor genome assembly jobs using standard HTCondor patterns and best practices.
All steps run using the HTCondor workload manager and Apptainer containers. The tutorial uses real genomics data and emphasizes performance, reproducibility, and portability.
Start here
- Introduction
- Tutorial Setup
- Understanding the Genome Assembly Workflow
- Running Genome Assembly on CHTC
- Next Steps
- Reference Material
- Getting Help
You will need the following before moving forward with the tutorial:
- A CHTC HTC account. If you do not have one, request access at the CHTC Account Request Page.
- A CHTC "staging" folder with at least 500 GB of available disk space. If you do not have a staging folder, request one by contacting the CHTC Research Computing Facilitator Team
- Basic familiarity with HTCondor job submission. If you are new to HTCondor, complete the CHTC "Roadmap to getting started" and read the "Practice: Submit HTC Jobs using HTCondor".
- Basic familiarity with genome assembly workflows.
Caution
Genome assembly is a resource-intensive process that can easily fail if you do not have sufficient disk space. Make sure you have a staging folder with at least 500 GB of available disk space before proceeding with this tutorial. If you need to request additional disk space, please request it via our Quota Increase Form.
This tutorial also assumes that you:
- Have basic command-line experience (e.g., navigating directories, using bash, editing text files)
- Have sufficient disk quota and file permissions in your CHTC
/homeand/stagingdirectories
Tip
If you are new to running jobs on CHTC, complete the CHTC "Roadmap to getting started" and our "Practice: Submit HTC Jobs using HTCondor" guide before starting this tutorial.
Estimated time: plan ~1-2 hours for the tutorial walkthrough. The assembly step itself typically takes 4-72 hours depending on read coverage, genome complexity, and available compute resources.
-
Log into your CHTC account:
ssh user.name@ap####.chtc.wisc.edu -
Clone the repository:
git clone https://github.com/CHTC/tutorial-CHTC-Genome-Assembly.git cd tutorial-CHTC-Genome-Assembly/
This tutorial uses Oxford Nanopore Ligation Sequencing reads from the Palla's Cat (Otocolobus manul). The sample was taken from Tater, a Palla's Cat living in Utica Zoo in New York, and sequences by the University of Minnesota's Faulk Lab. Learn more about how Tater made history as the first Palla's Cat to have their genome sequence here. The reads have been pre-staged on the Open Science Data Federation (OSDF) for use with this tutorial:
osdf:///osg-public/data/tutorial-CHTC-Genome-Assembly/input/SRR22085263
The ONT reads are transferred directly to the execute node by HTCondor as part of the job submission process. You do not need to download the reads manually.
SRA Accession: SRR22085263
If you would like to download the reads locally for inspection or other purposes, you can use the Pelican client:
pelican object get osdf:///osg-public/data/tutorial-CHTC-Genome-Assembly/SRR22085263.fastq ./You can also download directly from the SRA public bucket:
pelican object get osdf:///aws-opendata/us-east-1/sra-pub-run-odp/sra/SRR22085263/SRR22085263 ./
fasterq-dump ./SRR22085263You can also use the SRA Toolkit to download the reads directly from NCBI:
# Install SRA Toolkit if you don't have it already
# Download the reads using fasterq-dump
fasterq-dump SRR22085263 -O ./For more details about the dataset, see Toy_Dataset/README.md.
| Property | Value |
|---|---|
| Species | Otocolobus manul (Palla's Cat) |
| Specimen | Tater |
| Sequencing Platform | Oxford Nanopore Technologies |
| Library Prep | Ligation Sequencing Kit |
| Expected Genome Size | ~2.4 Gb |
Genome assembly is the process of reconstructing a genome from sequencing reads. Oxford Nanopore long reads (typically 10-100+ kb) provide the contiguity needed to span repetitive regions and produce chromosome-scale assemblies.
hifiasm performs de novo assembly by:
- Reading raw ONT reads and correcting errors using all-vs-all read overlaps
- Building an assembly graph that captures the relationships between reads
- Resolving haplotypes to produce separate assemblies for each parental copy of the genome
- Outputting assembly graphs in GFA format, which can be converted to FASTA for downstream analysis
- CPU-bound and memory-intensive: hifiasm uses multiple threads and requires significant memory for a mammalian-sized genome. For the Palla's Cat (~2.4 Gb), expect to need 64-128 GB of RAM and 16-32 CPU cores.
- No GPU required: The entire assembly pipeline runs on CPU only.
- Data-intensive: ONT read files for a mammalian genome can be 50-200+ GB. The reads are transferred from OSDF to the execute node by HTCondor.
- Runtime: Assembly of a mammalian genome typically takes 4-12 hours depending on read coverage, genome complexity, and available compute resources.
- Large disk usage: Assemblies generates large intermediate files during assembly. Ensure you request sufficient disk space in your submit file (e.g., 500 GB or more). You will also need suffcient disk space to store the final assembly outputs (e.g., 50-100 GB) as well as the input reads (e.g., 250-750 GB).
CHTC provides a shared Apptainer container for hifiasm. The submit file in this tutorial references a pre-built container image distributed via OSDF:
container_image = osdf:///osg-public/containers/hifiasm_08APR2026_v1.sif
This container includes hifiasm (v0.19+) with ONT-only assembly support. HTCondor automatically transfers the container to the execute node, so no manual setup is required.
Click to expand: Building Your Own hifiasm Apptainer Container (Advanced)
If you need a specific version of hifiasm or want to customize the container, you can build your own:
-
On your CHTC AP
/home/directory, create an apptainer definition file titledhifiasm.def:Bootstrap: docker From: condaforge/miniforge3:latest %post mamba install -c conda-forge -c bioconda hifiasm -
Create a
build.subsubmit file in your directory:# apptainer.sub # Include other files that need to be transferred here. transfer_input_files = hifiasm.def +IsBuildJob = True # Make sure you request enough disk for the container image to build request_cpus = 8 request_memory = 16GB request_disk = 30GB queue
-
Submit your job as an interactive job
condor_submit build.sub -i
-
On your CHTC Execution Point, build an Apptainer image:
apptainer build hifiasm_08APR2026_v1.sif hifiasm.def
-
Move your Apptainer image
hifiasm_08APR2026_v1.sifto your/staging/directorymv hifiasm_08APR2026_v1.sif /staging/<netid>/
-
Exit the interactive session and return to your normal bash shell
exit
The ONT reads for Tater should be pre-staged under your /staging/<NetID>/tutorial-CHTC-Genome-Assembly/SRR22085263.fastq path. If it is not there, you can download the reads from our public
data repository using the following command:
pelican object get osdf:///osg-public/data/tutorial-CHTC-Genome-Assembly/SRR22085263.fastq /staging/<NetID>/tutorial-CHTC-Genome-Assembly/SRR22085263.fastq
Note
You may be asked to create a "Pelican password" for your transfer. This is usually a one-time password you create similar to your SSH Key passphrase. Create a short but secure passphrase you'll remember. You will also likely be pointed to visit a URL to authenticate. Copy the URL to your browser, sign-in with your University of Wisconsin-Madison NetID, and click approve. Note: The download is about 300 GB and can take ~1 hour. Please be patient with it.
Hifiasm can accept compressed FASTQ files directly, so there is no need to decompress your reads before running the assembly. We highly recommend compressing your reads with gzip or another compression tool to reduce storage and transfer times. If your reads are not compressed, you can compress them with:
gzip my_reads.fastqThis will reduce the disk requirement for your reads by ~4-5x, which can significantly reduce transfer times and storage costs.
Tip
If you have multiple FASTQ files from the same sequencing run, concatenate them before staging:
cat run1.fastq.gz run2.fastq.gz run3.fastq.gz > all_reads.fastq.gz-
Change to your
tutorial-CHTC-Genome-Assembly/directory:cd ~/tutorial-CHTC-Genome-Assembly/
-
Review the assembly executable script
scripts/assembly.sh. The script is a simple wrapper that runs hifiasm, converts the output GFA to FASTA, and tarballs the results. For this tutorial, no changes are necessary.#!/bin/bash set -euo pipefail OUTPUT_PREFIX=$1 # Run hifiasm ONT-only assembly hifiasm -t${PYTHON_CPU_COUNT} --ont -o ${OUTPUT_PREFIX}.asm /staging/<NetID>/tutorial-CHTC-Genome-Assembly/SRR22085263.fastq # Cleanup input FASTQ to save disk space rm SRR22085263.fastq.gz # Convert GFA to FASTA for gfa_file in *.p_ctg.gfa ; do fasta_file="${gfa_file%.gfa}.fa" awk '/^S/{print ">"$2; print $3}' "${gfa_file}" > "${fasta_file}" done # Package outputs tar czf assembly_output.tar.gz *.asm* -
Review the submit file
assembly.sub. Make sure to modify thetransfer_output_remapsattribute to reflect your NetID so that the assembly output is transferred back to your staging folder. You may also need to modify thecontainer_imageandtransfer_input_filesattributes if you are using a different container or input reads.:# Container for hifiasm genome assembler container_image = osdf:///osg-public/containers/hifiasm_08APR2026_v1.sif executable = scripts/assembly.sh arguments = Omalun-Tater log = ./logs/assembly.log output = assembly_$(Cluster)_$(Process).out error = assembly_$(Cluster)_$(Process).err # ONT reads pre-staged on OSDF # You should generally read data directly in from staging for assemblies due to their large input file sizes #transfer_input_files = osdf:///osg-public/data/tutorial-CHTC-Genome-Assembly/input/SRR22085263.fastq.gz # Transfer assembly output back to the submit node transfer_output_files = assembly_output.tar.gz transfer_output_remaps = "assembly_output.tar.gz=/staging/<NetID>/tutorial-CHTC-Genome-Assembly/Assembly_Output/assembly_output.tar.gz" should_transfer_files = YES when_to_transfer_output = ON_EXIT Requirements = (Target.HasCHTCStaging == true) # hifiasm for a ~2.4 Gb mammalian genome needs substantial resources request_memory = 128GB request_disk = 500GB request_cpus = 16 queue 1This submit file:
- Uses our pre-built hifiasm container from OSDF
- Transfers the ONT reads from OSDF to the execute node
- Requests 128 GB of memory, 16 CPU cores, and 500 GB of disk space
- Returns the assembly output as a tarball to
Assembly_Output/ - Targets CHTC machines with staging access
-
Submit the assembly job:
condor_submit assembly.sub
-
Track your job progress:
condor_watch_q
Tip
For testing, you can create a subset of reads and adjust the resource requirements. Create a subset with:
zcat SRR22085263 | head -n 400000 | gzip > SRR22085263_subset.fastq.gzThen modify assembly.sub to use the subset file and reduce request_memory to 16GB and request_cpus to 8.
Tip
If your assembly jobs are running out of memory, increase the request_memory attribute. Highly repetitive genomes or very high-coverage datasets may require 200+ GB of RAM. You can also use retry_request_memory for automatic retries with more memory. See the CHTC Request variable memory documentation.
Resource requirements for hifiasm depend primarily on genome size and read coverage:
| Genome Size | Coverage | Recommended Disk | Recommended Memory | Recommended CPUs | Estimated Runtime |
|---|---|---|---|---|---|
| < 500 Mb | 30-50x | 100 GB | 16-32 GB | 8-16 | 1-3 hours |
| 500 Mb - 1 Gb | 30-50x | 150 GB | 32-64 GB | 16-32 | 2-6 hours |
| 1 - 3 Gb | 30-50x | 300 GB | 64-128 GB | 16 | 4-12 hours |
| > 3 Gb | 30-50x | >500 GB | 128-256 GB | 20+ | 8-24+ hours |
The Palla's Cat genome (~2.4 Gb) falls in the 1-3 Gb range, so this tutorial requests 128 GB of memory and 16 CPUs.
Note
These are guidelines. Actual requirements depend on genome complexity, repeat content, and coverage depth. Large and highly repetitive genomes may need significantly more memory. You can use the retry_request_memory attribute to automatically retry with more memory if your job fails due to insufficient memory.
request_memory = 128GB
retry_request_memory = 256GB
This allows your job to automatically retry with 256 GB of memory if it fails with 128 GB, which can help ensure successful completion without over-requesting resources upfront.
Once the assembly job completes, extract the output:
cd Assembly_Output/
tar xzf assembly_output.tar.gz
ls -lhhifiasm produces several output files:
| File | Description |
|---|---|
Omalun-Tater.asm.bp.hap1.p_ctg.gfa |
Haplotype 1 primary contigs (assembly graph) |
Omalun-Tater.asm.bp.hap2.p_ctg.gfa |
Haplotype 2 primary contigs (assembly graph) |
Omalun-Tater.asm.bp.p_ctg.gfa |
Combined primary contigs (assembly graph) |
Omalun-Tater.asm.bp.hap1.p_ctg.fa |
Haplotype 1 primary contigs (FASTA, converted by script) |
Omalun-Tater.asm.bp.hap2.p_ctg.fa |
Haplotype 2 primary contigs (FASTA, converted by script) |
Omalun-Tater.asm.bp.p_ctg.fa |
Combined primary contigs (FASTA, converted by script) |
Omalun-Tater.asm.ec.bin |
Error-corrected reads (binary) |
Omalun-Tater.asm.ovlp.*.bin |
Overlap information (binary) |
The GFA (Graphical Fragment Assembly) files contain the assembly graph, which captures the full structure of the assembly including potential alternative paths. The FASTA files are derived from the GFA files by extracting contig sequences. The assembly.sh script automatically converts GFA to FASTA.
hifiasm produces haplotype-resolved assemblies, meaning it separates the two parental copies of the genome into distinct assembly outputs (hap1 and hap2). This is particularly valuable for:
- Studying structural variation between haplotypes
- Phasing heterozygous variants
- Generating more complete genome representations
The combined primary contigs (Omalun-Tater.asm.bp.p_ctg.gfa) represent a merged view and are suitable for most downstream applications.
To get a quick summary of the assembly, you can count contigs and compute basic statistics:
# Count the number of contigs
grep -c "^>" Omalun-Tater.asm.bp.p_ctg.fa
# View the first few contig headers
head -n 20 Omalun-Tater.asm.bp.p_ctg.fa | grep "^>"For detailed quality metrics (N50, L50, total length), consider running an assembly assessment tool like QUAST in a follow-up job. See Next Steps for details.
You can visualize hifiasm's GFA assembly graphs using Bandage, a tool for interactive visualization of assembly graphs. Download the GFA files to your local machine and open them in Bandage:
# From your local machine
scp <netID>@ap####.chtc.wisc.edu:~/tutorial-CHTC-Genome-Assembly/Assembly_Output/Omalun-Tater.asm.bp.p_ctg.gfa ./Now that you've successfully assembled the Palla's Cat genome on CHTC, here are recommended next steps:
Combine with ONT Basecalling and QC
- If you haven't already, complete the ONT Basecalling and QC tutorial to learn how to basecall raw ONT signal data. This will give you a deeper understanding of the input data and how it impacts assembly quality.
- Basecall your own ONT reads with Dorado on CHTC and use those reads as input for hifiasm assembly.
Use a different assembler
- Try running a different assembler (e.g., Flye, Canu, Shasta) on CHTC using the same input reads to compare results. You can build your own Apptainer container for these tools using the instructions in the Software section.
- Compare assembly metrics (N50, total length, number of contigs) across different assemblers to see how they perform on the same dataset.
Assess Assembly Quality
- Run QUAST to compute contiguity metrics (N50, total length, number of contigs)
- Run BUSCO against the
mammalia_odb10lineage database to assess gene completeness - Compare your assembly metrics against published Felidae genomes
Scaffold and Polish
- Use Hi-C data (if available) to scaffold contigs into chromosome-level assemblies with tools like YaHS or SALSA2
- Polish the assembly with additional sequencing data to reduce residual errors
Annotate the Genome
- Run repeat masking (RepeatMasker/RepeatModeler)
- Perform gene prediction and annotation (BRAKER, Augustus, or similar)
- Annotate functional elements
Scale to Multiple Samples
- Adapt the workflow for multiple specimens using HTCondor's
queue ... fromsyntax - Create a manifest file listing sample names and read paths, similar to the AF3 tutorial's approach
- Use DAGMan to chain assembly steps (QC, assembly, assessment) into automated pipelines
Get Help or Collaborate
- Reach out to chtc@cs.wisc.edu for one-on-one help with scaling your research.
- Attend office hours or training sessions -- see the CHTC Help Page for details.
This script, assembly.sh, is a simple wrapper that runs hifiasm in ONT-only mode, converts the output GFA assembly graphs to FASTA, and packages the results into a tarball. It is designed for execution inside an HTCondor container job on CHTC.
The script does three things:
-
Runs hifiasm in ONT-only mode:
hifiasm -t${PYTHON_CPU_COUNT} --ont -o ${OUTPUT_PREFIX}.asm SRR22085263
The output prefix is passed as the first argument from the submit file (e.g.,
Omalun-Tater).PYTHON_CPU_COUNTis automatically set by HTCondor to matchrequest_cpusin the submit file. -
Converts GFA to FASTA for each primary contig assembly graph:
awk '/^S/{print ">"$2; print $3}' *.p_ctg.gfa > *.p_ctg.fa
-
Packages all outputs into a tarball for transfer back to the submit host:
tar czf assembly_output.tar.gz ${OUTPUT_PREFIX}.asm*
| Term | Definition |
|---|---|
| ONT (Oxford Nanopore Technologies) | Long-read sequencing platform producing reads typically 10-100+ kb in length. |
| Ligation Sequencing | ONT library preparation method that ligates motor proteins to DNA fragments for sequencing. |
| hifiasm | Fast haplotype-resolved de novo assembler supporting PacBio HiFi and Oxford Nanopore reads. |
| Contig | A contiguous assembled sequence derived from overlapping reads. |
| N50 | Minimum contig length such that 50% of the total assembly is in contigs of this length or longer. A common assembly quality metric. |
| GFA (Graphical Fragment Assembly) | Standard format for representing assembly graphs, including contigs and their relationships. |
| FASTA | Standard text format for nucleotide or protein sequences. |
| FASTQ | Sequence format that includes per-base quality scores alongside the sequence. |
| Haplotype | One of two copies of each chromosome in a diploid organism. hifiasm can resolve both haplotypes separately. |
| Coverage / Depth | Average number of sequencing reads covering each position in the genome. Typically 30-50x for de novo assembly. |
| QUAST | Quality Assessment Tool for genome assemblies -- computes metrics like N50, total length, and number of contigs. |
| BUSCO | Benchmarking Universal Single-Copy Orthologs -- assesses genome completeness using conserved gene sets. |
| NanoPlot | Plotting and statistics tool for quality assessment of long-read sequencing data. |
HTCondor submit file (.sub) |
Job description file used by HTCondor to submit tasks to the HTC system. |
| Apptainer | Container runtime (formerly Singularity) commonly used on HPC/HTC systems to run reproducible environments. |
| OSDF (Open Science Data Federation) | Federated data delivery infrastructure used for staging and retrieving large files across compute sites. |
| Pelican | Client tool for transferring data to and from OSDF origins and caches. |
Staging (/staging/) |
CHTC shared filesystem for large file storage, accessible from execute nodes with HasCHTCStaging. |
In this tutorial, we use an Apptainer container containing hifiasm for genome assembly. The container is distributed via OSDF and transferred to execute nodes automatically by HTCondor.
Our recommendation for most users is to use Apptainer containers for deploying their software. For instructions on how to build an Apptainer container, see our guide Using Apptainer/Singularity Containers. If you are familiar with Docker, or want to learn how to use Docker, see our guide Using Docker Containers.
This information can also be found in our guide Using Software on CHTC.
Genome assembly involves large datasets, particularly for mammalian genomes. Understanding how data moves through the HTC system is essential for scaling assembly workflows.
- ONT reads (50-200+ GB per sample)
- Stored on OSDF or in your CHTC
/staging/directory - Transferred to execute nodes via HTCondor's file transfer mechanism
- Use OSDF paths (
osdf:///...) intransfer_input_filesfor efficient delivery
- Stored on OSDF or in your CHTC
- Assembly outputs (~1-10 GB per assembly)
- GFA assembly graphs and FASTA contig sequences
- Packaged as
assembly_output.tar.gzand transferred back to the submit host
- Intermediate files (variable, can be large)
- Error correction and overlap files are generated during assembly
- Cleaned up automatically by the script after packaging results
For guides on how data movement works on the HTC system, see our Data Staging and Transfer to Jobs guides.
Important
ONT read files are often too large for the CHTC home directory (which has limited quota). Always store large sequencing files in /staging/ and reference them via OSDF paths in your submit files.
Genome assembly with hifiasm is CPU- and memory-intensive but does not require GPUs. CHTC provides high-memory nodes suitable for genome assembly.
| Genome Size | Coverage | Memory | CPUs | Disk | Estimated Runtime |
|---|---|---|---|---|---|
| < 500 Mb | 30-50x | 16-32 GB | 8-16 | 100 GB | 2-6 hours |
| 500 Mb - 1 Gb | 30-50x | 32-64 GB | 16-32 | 250 GB | 6-12 hours |
| 1 - 3 Gb | 30-50x | 64-128 GB | 16-20 | 500 GB | 12-24 hours |
| > 3 Gb | 30-50x | 128-256 GB | 20+ | 750+ GB | 24+ hours |
- Memory is the primary constraint: hifiasm loads all read overlap information into memory. Insufficient memory is the most common cause of assembly job failure.
- Thread scaling: While hifiasm scales well with multiple threads, requesting beyond 16-20 cores may not add much speedup.
- Disk space: Budget for raw reads (input) + intermediate files + output. Assembly intermediate files can be several times larger than the raw reads.
- Runtime and Queue Time: Assembly can take several hours to days. Use
condor_watch_qto monitor progress and adjust resource requests if needed. Running large assemblies with high memory requirements may lead to longer queue times, so plan accordingly. Very large assemblies may require 512+ GB of RAM, will likely take 24-48hrs to start up. If you would like to learn more about CHTC compute resources, please visit the CHTC Documentation Portal.
The CHTC Research Computing Facilitators are here to help researchers using the CHTC resources for their research. We provide a broad swath of research facilitation services, including:
- Web guides: CHTC Guides - instructions and how-tos for using the CHTC cluster.
- Email support: get help within 1-2 business days by emailing chtc@cs.wisc.edu.
- Virtual office hours: live discussions with facilitators - see the Email, Office Hours, and 1-1 Meetings page for current schedule.
- One-on-one meetings: dedicated meetings to help new users, groups get started on the system; email chtc@cs.wisc.edu to request a meeting.
This information, and more, is provided in our Get Help page.
