EASTR

(\(\
(-.-)
o('')('')

Emending Alignments of Spliced Transcript Reads

EASTR is a tool for detecting and eliminating spuriously spliced alignments in RNA-seq datasets. It improves the accuracy of transcriptome assembly by identifying and removing misaligned spliced alignments. The tool can process GTF, BED, and BAM files as input and can be applied to any RNA-seq dataset regardless of the alignment software used.

This is a C++ reimplementation of the original EASTR, providing ~10x faster performance with identical results.

Installation

Conda (recommended)

conda install -c bioconda eastr-cpp

Build from source

Dependencies:

C++17 compiler (GCC >=7 or Clang >=5)
CMake >=3.16
htslib >=1.15
bowtie2 >=2.5.2 (must be in PATH)

# Install dependencies
# macOS
brew install htslib bowtie2

# Ubuntu/Debian
sudo apt-get install libhts-dev bowtie2

# Conda environment
conda install -c bioconda htslib bowtie2

# Build EASTR
git clone --recurse-submodules https://github.com/ishinder/eastr-cpp.git
cd eastr-cpp
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Optional: install system-wide
sudo make install

Quick Start

eastr --bam input.bam -r genome.fa --out_filtered_bam filtered.bam

EASTR automatically builds a bowtie2 index if one isn't provided. To use an existing index, add -i /path/to/bt2_index.

Usage Examples

Filter multiple BAM files in parallel

# Create a file listing your BAMs (one path per line)
ls /path/to/*.bam > samples.txt

# Filter all samples using 8 threads
eastr --bam samples.txt \
      -r genome.fa \
      --out_filtered_bam filtered_bams/ \
      -p 8 \
      --verbose

Identify spurious junctions in a GTF annotation

eastr --gtf annotation.gtf \
      -r genome.fa \
      -i /path/to/bt2_index \
      --out_removed_junctions spurious_junctions.bed

Note: The -i (bowtie2 index) flag is optional. If omitted, EASTR will automatically build an index from the reference FASTA, which adds a few minutes for large genomes. Providing a pre-built index speeds up repeated runs.

Save removed alignments for inspection

eastr --bam input.bam \
      -r genome.fa \
      --out_filtered_bam filtered.bam \
      --removed_alignments_bam

# Output:
#   filtered.bam                        - alignments with spurious junctions removed
#   filtered_removed_alignments.bam     - the removed alignments

Use trusted junctions that should never be removed

eastr --bam input.bam \
      -r genome.fa \
      --out_filtered_bam filtered.bam \
      --trusted_bed validated_junctions.bed

Options

Required (one input type)

Option	Description
`--bam FILE`	BAM file, or text file listing BAM paths (one per line)
`--gtf FILE`	GTF annotation file
`--bed FILE`	BED file with junction coordinates
`-r, --reference FILE`	Reference genome FASTA (required for all input types)

Common options

Option	Default	Description
`-i, --bowtie2_index PATH`	auto-built	Bowtie2 index prefix (built automatically if not provided)
`-p INT`	1	Number of threads
`--out_filtered_bam PATH`	—	Output filtered BAM file or directory
`--out_removed_junctions PATH`	stdout	Output spurious junctions (BED format)
`--verbose`	off	Show progress information

Advanced options

Algorithm parameters

Option	Default	Description
`--bt2_k INT`	10	Minimum distinct bowtie2 alignments for spurious classification
`-o INT`	50	Flanking sequence length on each side of junction
`-a INT`	7	Minimum anchor length in each exon
`--min_duplicate_exon_length INT`	27	Minimum length for duplicated exon detection
`--min_junc_score INT`	1	Minimum supporting reads per junction
`--trusted_bed FILE`	—	BED file of junctions to never remove

Alignment scoring (minimap2 parameters)

Option	Default	Description
`-A INT`	3	Match score
`-B INT`	4	Mismatch penalty
`-k INT`	3	K-mer length
`-w INT`	2	Minimizer window size
`-m INT`	25	Minimum chain score

Additional output options

Option	Description
`--out_original_junctions PATH`	Write all junctions before filtering
`--out_kept_junctions PATH`	Write non-spurious junctions
`--removed_alignments_bam`	Write removed alignments to separate BAM
`--filtered_bam_suffix STR`	Suffix for output BAMs (default: `_EASTR_filtered`)

Input requirements

BAM files must be coordinate-sorted
Reference FASTA can be uncompressed or bgzip-compressed (.fa, .fa.gz, .fasta, .fasta.gz)
GTF/BED files can be plain text or gzipped

Citation

If you use EASTR in your research, please cite:

Shinder I, Hu R, Ji HJ, Chao KH, Pertea M. EASTR: Identifying and eliminating systematic alignment errors in multi-exon genes. Nat Commun. 2023;14:7223. doi: 10.1038/s41467-023-43017-4

License

MIT License

Performance notes

This C++ implementation provides significant speedups over the original Python version through parallelization at every pipeline stage:

Stage	Parallelization strategy
BAM junction extraction	Inter-file + intra-file tiled parallelism
FASTA sequence retrieval	Multi-threaded with coordinate-sorted access
Self-alignment (minimap2)	Thread pool for parallel alignment
Bowtie2 validation	Native multi-threading
BAM filtering	Parallel files + multi-threaded BGZF compression

Benchmark

Test dataset: 3 Arabidopsis chr4 BAM files (~380MB total), 6 threads

Version	Time	Speedup
C++	5.8s	10x
Python	60s	—

Threading recommendations

The -p option controls parallelism across all stages. For best performance:

Use -p equal to or greater than the number of input BAM files
When -p exceeds the number of BAM files, EASTR uses tiled intra-file parallelism (requires BAM index files)
For large datasets, -p 8 to -p 16 typically provides good throughput

Development

Building with tests

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Debug -DEASTR_BUILD_TESTS=ON
make -j$(nproc)
ctest --output-on-failure

Test data

Test data requires Git LFS:

git lfs install
git lfs pull

Project structure

eastr-cpp/
├── include/eastr/     # Header files
├── src/               # Implementation
├── tests/             # Unit tests
├── external/          # Dependencies (minimap2)
├── conda/             # Conda recipe
└── test_data/         # Test datasets (Git LFS)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
conda		conda
external		external
include/eastr		include/eastr
src		src
test_data		test_data
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EASTR

Installation

Conda (recommended)

Quick Start

Usage Examples

Filter multiple BAM files in parallel

Identify spurious junctions in a GTF annotation

Save removed alignments for inspection

Use trusted junctions that should never be removed

Options

Required (one input type)

Common options

Algorithm parameters

Alignment scoring (minimap2 parameters)

Additional output options

Input requirements

Citation

License

Benchmark

Threading recommendations

Building with tests

Test data

Project structure

About

Uh oh!

Releases 1

Packages

Languages

License

ishinder/eastr-cpp

Folders and files

Latest commit

History

Repository files navigation

EASTR

Installation

Conda (recommended)

Quick Start

Usage Examples

Filter multiple BAM files in parallel

Identify spurious junctions in a GTF annotation

Save removed alignments for inspection

Use trusted junctions that should never be removed

Options

Required (one input type)

Common options

Algorithm parameters

Alignment scoring (minimap2 parameters)

Additional output options

Input requirements

Citation

License

Benchmark

Threading recommendations

Building with tests

Test data

Project structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages