Hi-C has emerged as a powerful tool for detecting structural variations (SVs), but its sensitivity remains limited—particularly for SVs lacking canonical contact patterns. Here, we introduce EagleC2, a next-generation deep-learning framework that integrates an ensemble of convolutional neural networks (CNNs) with diverse architectures, trained on over 2.7 million image patches from 51 cancer Hi-C datasets with matched whole-genome sequencing (WGS) data. EagleC2 substantially outperforms its predecessor (EagleC) and other state-of-the-art methods, achieving consistently higher precision and recall across diverse validation datasets. Notably, it enables the discovery of non-canonical SVs—including complex rearrangements and fusions involving extremely small fragments—that are frequently missed by existing tools. In individual cancer genomes, EagleC2 detects over a thousand previously unrecognized SVs, the majority of which are supported by orthogonal evidence. To support clinical and diagnostic applications, EagleC2 also offers a rapid evaluation mode for accurately screening predefined SV lists, even at ultra-low coverage (e.g., 1x depth). When applied to single-cell Hi-C data from glioblastoma before and after erlotinib treatment, EagleC2 reveals extensive SV heterogeneity and dynamic structural changes, including events overlooked by conventional pipelines. These findings establish EagleC2 as a powerful and versatile framework for SV discovery, with broad applications in genome research, cancer biology, diagnostics, and therapeutic development.
Compared with the original EagleC, EagleC2 has the following unique features:
- EagleC2 is able to detect non-canonical SVs, including fine-scale complex rearrangements (multiple SVs clustered within a local window) and fusions involving extremely small fragments
- EagleC2 offers a rapid evaluation mode for accurately screening predefined SV lists, even at ultra-low coverage (e.g., 1x depth)
- EagleC2 supports arbitrary resolutions, without requiring model re-training for each resolution
- EagleC2 enables fast genome-wide SV prediction on large Hi-C datasets without the need for distributed computing across multiple nodes
- EagleC2 supports both CPU and GPU inference
- Installation
- Download pre-trained models
- Overview of the commands
- Quick start
- Visualize local contact patterns around SV breakpoints
- Post-processing and filtering of SV predictions
- Evaluation of predefined SVs
EagleC2 and all required dependencies can be installed using mamba and pip.
After you have installed mamba successfully, you can create a conda environment for EagleC2 by executing the following commands (for Linux users):
$ conda config --add channels defaults $ conda config --add channels bioconda $ conda config --add channels conda-forge $ mamba create -n EagleC2 hdbscan numba statsmodels python=3.11 cooler=0.9 joblib=1.3 numpy=1.26 scikit-learn=1.4 "tensorflow>=2.16" $ mamba activate EagleC2 $ pip install eaglec
This will intall the core dependencies required to run EagleC2.
If you also wish to use the visualization module, please install the following additional packages (pyBigWig is only required if you want to plot signals from BigWig files):
$ mamba install matplotlib pyBigWig
If you plan to use the gene fusion annotation module, please install:
$ mamba install pyensembl
For macOS users (tested on Apple M-series chips only), you can install EagleC2 and its dependencies with:
$ conda config --add channels defaults $ conda config --add channels bioconda $ conda config --add channels conda-forge $ mamba clean --all $ mamba create -n EagleC2gpu python=3.11 hdbscan numba statsmodels joblib=1.3 numpy=1.26 scikit-learn=1.4 $ mamba activate EagleC2gpu $ pip install --no-cache-dir cooler==0.9.1 $ pip install --no-cache-dir tensorflow==2.16.1 keras==3.3.3 $ pip install --no-cache-dir tensorflow-metal==1.1.0 $ pip install eaglec
Similarly, if you would like to use the visualization or gene fusion annotation modules on macOS, please install matplotlib, pyBigWig, and pyensembl as described above.
Before proceeding, please download the pre-trained models for EagleC2.
Unlike EagleC, which relied on separate models trained for specific resolutions (e.g., 5 kb, 10 kb, 50 kb, and 500 kb) and sequencing depths, EagleC2 was trained on a unified dataset that integrates samples across a wide range of resolutions and depths. This allows for seamless application to data at arbitrary resolutions and sequencing depths, without the need for model re-training.
EagleC2 is distributed with eight command-line tools. You can command [-h]
in a
terminal window to view the basic usage of each command.
-
predictSV
predictSV is the core command for predicting SVs from chromatin contact maps.
Required inputs:
- Path to a .mcool file – This is a multi-resolution format for storing contact matrices. See cooler for details. If you only have .hic files (see Juicer), you can convert them to .mcool using hic2cool or HiClift.
- Path to the folder containing the pre-trained models.
Output:
The predicted SVs will be written to a .txt file with 13 columns:
- Breakpoint coordinates (chrom1, pos1, chrom2, pos2)
- Probability values for each SV type (++, +-, -+, --, ++/--, and +-/-+)
- The resolution of the contact matrix from which the SV was originally predicted
- The finest resolution to which the SV can be refined
- The number of bad bins near the SV breakpoints
-
plot-SVbreaks
Plots a local contact map centered on the provided SV breakpoint coordinates. For intra-chromosomal SVs, contact counts will be distance-normalized. All contact matrices will be min-max scaled to the range [0, 1].
The input breakpoint coordinates should follow the format: "chrom1,pos1,chrom2,pos2".
This is useful for visually checking whether the expected contact patterns are present around SV breakpoints, including those identified by short-read or long-read whole-genome sequencing methods.
-
filterSV
Filters the predicted SVs based on probability values.
-
evaluateSV
Evaluates a predefined list of SVs using EagleC2 models.
-
reformatSV
Reformats the output from predictSV into a format compatible with NeoLoopFinder.
-
annotate-gene-fusion
Annotates gene fusion events for a list of SV breakpoints.
-
plot-interSVs
Plots a contact map for a specified set of chromosomes, with predicted SVs marked.
-
plot-intraSVs
Plots a contact map for a specified genomic region, with predicted SVs marked.
As the commands annotate-gene-fusion, plot-interSVs, and plot-intraSVs are directly inherited from the original EagleC, this documentation does not cover them in detail. For more information, please refer to the orignal EagleC documentation
The following steps will guide you through the process of using EagleC2. All commands below are expected to be executed in a terminal window.
Place the downloaded pre-trained models in your working directory and unzip the archive:
$ unzip EagleC2-models.zip
Download the test dataset FY1199.used_for_SVpredict.mcool, which contains ~18 million contact pairs. This dataset is derived from FY1199, a human lymphoblastoid cell line with a known balanced inter-chromosomal translocation between chromosomes 11 and 22 (46,XY,t(11;22)(q23.3;q11.2)). Place the file in the same directory as the pre-trained models.
Execute the following command to perform SV prediction on this Hi-C dataset:
$ predictSV --mcool FY1199.used_for_SVpredict.mcool --resolutions 25000,50000,100000 \ --high-res 25000 --prob-cutoff-1 0.5 --prob-cutoff-2 0.5 -O FY1199_EagleC2 \ -g hg38 --balance-type ICE -p 8 --intra-extend-size 1,1,1 --inter-extend-size 1,1,1
For view a full description of each parameter, run:
$ predictSV -h
This command performs genome-wide SV prediction on ICE-normalized contact matrices
at 50 kb and 100 kb resolutions (as specified by --resolutions
, excluding those
listed in --high-res
). To accelerate computation, pixels with significantly
elevated contact counts are identified and extended by 1 bin on both ends (controlled
by --intra-extend-size
and --inter-extend-size
; the values specified for these
parameters correspond to each resolution listed in --resolutions
) to cover potential
SV breakpoints.
SV predicted at coarser resolutions are progressively refined at higher resolutions.
For example, an SV initially predicted at 100 kb (with a probability cutoff of 0.5,
set by --prob-cutoff-1
) will be refined at 50 kb. If the probability at 50 kb exceeds
the second cutoff (set by --prob-cutoff-2
), the SV will be further refined at 25 kb.
Otherwise, the 50 kb coordinates are reported as final.
SV predictions across all resolutions are merged in a non-redundant manner. For resolutions
specified in --high-res
, the program performs refinement only—not genome-wide scanning.
Computation is parallelized using 8 CPU cores (set via -p 8
). If a GPU is available and
the --cpu
flag is not set, model inference will run on the GPU.
After ~5 minutes (depending on your machine), you will find the predicted SVs in a .txt file named "FY1199_EagleC2.SV_calls.txt" in your working directory:
$ cat FY1199_EagleC2.SV_calls.txt chrom1 pos1 chrom2 pos2 ++ +- -+ -- ++/-- +-/-+ original resolution fine-mapped resolution gap info chr4 52200000 chr4 64400000 0.6095 1.42e-06 1.96e-06 7.996e-09 3.169e-08 9.618e-09 100000 100000 0,0 chr11 116800000 chr22 20300000 2.495e-11 1.013e-06 5.552e-07 7.897e-11 2.943e-12 1 50000 25000 0,0
The known balanced translocation is successfully detected. The final breakpoint
coordinates (chr11:116800000;chr22:20300000) are reported at 25 kb resolution (see
the "fine-mapped resolution" column), while the SV was initially predicted at 50 kb
(see the "original resolution" column). The last column, "gap info", indicates that
there are no problematic bins in reference genome (hg38, as specified by the parameter
-g
) near either breakpoint (0,0).
Note
Valid Options for the --balance-type
parameter are "ICE", "CNV" and "Raw".
- Use "Raw" to process unnormalized contact matrices.
- Use "ICE" only if your matrices have been balanced with cooler balance.
- Use "CNV" only if your matrices have been CNV-corrected using
correct-cnv
from the NeoLoopFinder toolkit.
Different normalization strategies may yield slightly different results. For best sensitivity, we recommend running predictSV on all three types of contact matrices (Raw, CNV, and ICE) and merging the results.
All key parameters that may affect the sensitivity and specificity of SV prediction
are configurable via command-line options. For details, please refer to the help
message of the predictSV command by running predictSV -h
.
Note
During prediction, intermediate files will be stored in a hidden folder named ".eaglec2" within your working directory. You may remove this folder after the process completes to free up disk space.
For the BT-474 Hi-C dataset in Figure 2, we used the following command (for HCC1954 and MCF7, the same parameters were used with different .mcool inputs):
$ predictSV --mcool BT474.used_for_SVpredict.mcool --resolutions 5000,10000,50000 \ --high-res 2000 -O BT474_EagleC2 -g hg38 --balance-type CNV \ -p 8 --intra-extend-size 2,2,1 --inter-extend-size 1,1,1
For the HCC1954 Arima Hi-C dataset in Figure 3, we used the following command
(we also ran with --balance-type Raw
and --balance-type CNV
, keeping all
other parameters unchanged):
$ predictSV --mcool HCC1954-Arima-allReps-filtered.mcool --resolutions 1000,2000,5000,10000,25000,50000,100000,250000,500000,1000000 \ --high-res 500 -O HCC1954-Arima -g hg38 --balance-type ICE \ -p 8 --entropy-cutoff 0.98 --intra-extend-size 3,3,3,2,2,2,1,1,1,1 \ --inter-extend-size 2,2,2,1,1,1,1,1,1,1
For the single-cell Hi-C datasets in Figure 6, we used the following command
to predict SVs from pseudo-bulk contact matrices (we also ran with --balance-type Raw
,
keeping all other parameters unchanged):
$ predictSV --mcool GBM39-pseudo-bulk.mcool --resolutions 5000,10000,25000,50000,100000,250000,500000,1000000 \ --high-res 500 -O GBM39-pseudo-bulk -g hg38 --balance-type ICE \ -p 8 --entropy-cutoff 0.98 --intra-extend-size 3,2,2,2,1,1,1,1 \ --inter-extend-size 2,1,1,1,1,1,1,1
To assess the quality of predicted SVs, you can visualize the local contact patterns around the breakpoints using the plot-SVbreaks command.
For example, the following command plots the contact map centered on the breakpoints of the balanced translocation detected in the previous step (see panel a):
$ plot-SVbreaks --cool-uri FY1199.used_for_SVpredict.mcool::resolutions/25000 \ --balance-type ICE --breakpoint-coords chr11,116800000,chr22,20300000 \ --window-width 15 -O chr11,116800000,chr22,20300000.25kb.png --dpi 800
Similarly, this command visualizes the contact patterns around the breakpoints of another intra-chromosomal SV detected earlier (see panel b):
$ plot-SVbreaks --cool-uri FY1199.used_for_SVpredict.mcool::resolutions/100000 \ --balance-type ICE --breakpoint-coords chr4,52200000,chr4,64400000 \ --window-width 15 -O chr4,52200000,chr4,64400000.100kb.png --max-value 1 --dpi 800
As shown in the figures, the balanced translocation exhibits a clear butterfly-shaped contact pattern, consistent with the highest predicted probability for the "+-/-+" SV type (1). In contrast, the intra-chromosomal SV displays a strong interaction block in the upper-left quadrant, consistent with the highest predicted probability for the "++" SV type (0.6095) at the breakpoints.
As mentioned eariler, by default, predictSV outputs all predicted SVs with a maximum probability score greater than 0.5. However, you may want to filter the results further to reduce the number of false positives. To this end, the filterSV command can be used. This command takes as input a .txt file generated by the predictSV command and outputs a .txt file in the same format, containing only SVs that pass the specified filtering criteria.
In figure 2, we applied the following command to filter SVs:
$ filterSV -i BT474_EagleC2_new.SV_calls.txt -o BT474_EagleC2_new.SV_calls.filtered.txt \ --res-cutoffs 0.5,0.65 --res-list 5000,10000
The key parameters here are --res-cutoffs
and --res-list
. The former specifies the
probability cutoffs for filtering SVs at different resolutions, while the latter specifies
the corresponding resolutions.
In this case, we set a cutoff of 0.5 for 5 kb resolution and 0.65 for 10 kb resolution.
For each line in the input file, the program checks the finest resolution at which the
SV was refined. If the resolution is not listed in --res-list
, the SV will be filtered out.
Otherwise, it will check whether the associated probability score meets the cutoff specified
in --res-cutoffs
for that resolution. If the score is below the cutoff, the SV will also be
excluded.
To evaluate a predefined list of SVs using EagleC2 models, the evaluateSV command can be used. This command takes as input the paths to a .mcool file, a folder containing pre-trained models, and a .txt file with breakpoint coordinates, and outputs a .txt file containing the evaluation results.
To demonstrate its usage, download the example SV file (HCC1954-SVs.txt, which contains a subset of SVs identified from WGS in HCC1954 cells) and the HCC1954 Arima Hi-C .mcool file:
$ head HCC1954-SVs.txt chr2 131643352 chr7 44807635 chr21 41391239 chr8 120011960 chr16 34171022 chr20 30928661 chr22 32887577 chr5 116559449 chr5 118443982 chr8 109240006
The input SV file should contain breakpoint coordinates: chrom1, pos1, chrom2, and pos2 in the first four columns, separated by tabs or spaces. The file may include additional columns, but they will be ignored during evaluation.
Then run the following command:
$ evaluateSV -i HCC1954-SVs.txt -m HCC1954-Arima-allReps-filtered.mcool -O HCC1954-SVs.EagleC2 \ --model-path EagleC2-models --resolutions 5000,10000,50000 --balance-type Raw
This command evaluates the SVs using raw contact signals at 5 kb, 10 kb, and 50 kb resolutions.
After a few minutes, the results will be written to a .txt file named "HCC1954-SVs.EagleC2.txt" in your working directory:
$ head -5 HCC1954-SVs.EagleC2.txt chrom1 pos1 chrom2 pos2 strand probability resolution chr2 131643352 chr7 44807635 ++/-- 2.068e-08 50000 chr2 103773984 chr7 148264670 ++ 3.948e-07 50000 chr2 157420649 chr7 111890623 -+ 3.604e-06 50000 chr2 33059444 chr7 65078783 ++ 2.649e-06 50000
This output file contains 7 columns. For each SV and at each specified resolution, the following information is reported:
- Breakpoint coordinates (same as the first four columns of the input file)
- SV type with the highest probability (one of: ++, +-, -+, --, ++/--, or +-/-+)
- Corresponding probability score
- Resolution at which the SV was evaluated
Note: If there are no Hi-C signals within a 15x15 window centered on an SV at a given resolution, the SV will not be reported at that resolution. For resolutions with sufficient signal, the program performs no filtering—so the output may include SVs with very low probability scores.