Xenomake: Processing Pipeline for Patient Derived Xenograft (PDX) Spatial Transcriptomics Datasets


Install
pip install xenomake==1.0.9

Documentation

  1. About Xenomnake
    1.1. PDX Model
    1.2. Spatial Integration
    1.3. Pipeline Overview
    1.4. Publication
  2. Installation
    2.1. Create conda environment and build dependencies
  3. Running Xenomake
    3.1. QuickStart
            3.1.2 Memory Requirements and Runtimes
    3.2. Step1: Xenomake Init
            3.2.1. Description
            3.2.2. Flags
            3.2.3. Command Line Implementation
            3.2.4. Output
    3.3. Step2: Xenomake Species
            3.3.1 Description
            3.3.2 Flags
            3.3.3. Command Line Implementation
            3.3.4. Output
    3.4 Step3: Xenomake Config
            3.4.1 Decsription
            3.4.2 Flags
            3.4.3 Command Line Implementation
    3.5. Step3: Run Snakemake Pipeline
            3.5.1. Snakemake Description
            3.5.2. Dry Run
            3.5.3. Execute Snakemake Workflow
            3.5.4. Snakemake Standard Outs
            3.5.5. Xenomake Output Structure
  4. Downstream Cell-Cell Interactions
  5. QC
    5.1. Scanpy QC Plots
    5.2. Xengsort Metrics
    5.3. STAR Metrics
    5.4. UMI Metrics
  6. Test Dataset
    6.1. Download Data
    6.2. Run Xenomake Pipeline
  7. Optional Downloads
    7.1. Dropseq
    7.2. Picard

About Xenomake

Processing pipeline for Patient Derived Xenograft (PDX) Spatial Transcriptomics Datasets

PDX Model

Patient-derived xenografts are a widely used component of cancer research
These models consist of grafting a human tumor sample into mice to study interactions between a patient's tumor and stroma, screening compounds that target a patient's tumor, and simulating human tumor biology

Spatial Integration

With the rising popularity of spatially resolved transcriptomic technologies, there is a growing need for processing pipelines that can handle reads from PDX samples. Spatial processing has the potential to analyze tumor/stroma interactions and tumor microenvironments.

Pipeline

To facilitate the adoption of spatial transcriptomics for PDX studies, we thus have developed a pipeline "Xenomake", which is an end-to-end pipeline that includes read alignment and gene quantification steps for reads generated by spatial transcriptomic platforms and uses a xenograft sorting tool (Xengsort) to apportion xenograft reads to the host and graft genomes.
Xenomake's structure is based on Snakemake and includes scripts to handle multi-mapping, downstream analysis, and handling "ambiguous" reads

Publication

Xenomake: a pipeline for processing and sorting xenograft reads from spatial transcriptomic experiments
Benjamin S Strope, Katherine E Pendleton, William Z Bowie, Gloria V Echeverria, Qian Zhu
bioRxiv 2023.09.04.556109; doi: https://doi.org/10.1101/2023.09.04.556109

Installation

Create the conda environment and build dependencies

# Create conda environment with the provided environment file
conda env create -n xenomake -f environment.yaml

# Activate conda environment
conda activate xenomake

#install xenomake
pip install xenomake

Running Xenomake

Quickstart

Quickstart: Initialize Xenomake:

#Initialize Xenomake
xenomake init \
	-r1 <read1> \
	-r2 <read2> \
	-s <sample name> \
	-o <output directory> \
	--spatial_mode <preset spatial mode [visium,dbit-seq,seq-scope] or custom> \
	--run_mode <preset run mode [lenient,prude] or custom> \

Quickstart: Add Species Information:

# Print Help Function Call
xenomake species --help

# Create species folder 
xenomake species \
	-mr <mouse_reference_assembly.fa> \
	-hr <human_reference_assemble.fa> \
	-ma <mouse_annotation.gtf> \
	-ha <human_annotation.gtf> \

Quickstart: Spatial Configuration:

If default run mode and spatial mode, this step can be skipped
If 'custom' run mode or spatial mode, you must input the following parameters

# Print Help Function Call
xenomake spatial --help

# Create configuration file
xenomake spatial \
	--barcode_file <barcode whitelist> \
	--umi <umi structure> \
	--cell_barcode <cell barcode structure> \
	--beads <number of spots/beads> \
	--spot_diameter_um <spot diameter> \
	--slide_size_um <slide size> \
	--ambiguous <re-partition ambiguous reads from xenograft sorting> \
	--downstream <perform downstream data processing> \
	--genic_only <only count genic reads (i.e., no UTR, intergenic,intronic> \
	--mm_reads <if read is mult-imapped, choose most likely position>

Quickstart: Run Snakemake Pipeline

# Dry Run
xenomake run -n 
# Initialize Xenomake Pipeline (species dir and config.yaml need to be created)
xenomake run \
	--cores <n_threads>

Memory Requirements and Runtimes

Process Threads Time RAM Recommended
Xengsort Index 8 33 min 25 Gb
STAR Index 8 ~90 min per genome 128 Gb
Xenomake Pipeline
(Test Dataset: 8.4m reads)
8 37 min 35 Gb
Xenomake Pipeline
(Test Dataset: 8.4m reads)
4 42 min 35 Gb
Xenomake Pipeline
(Medulloblastoma: 88m reads)
8 300 min 35 Gb

Step 1: Project Initialization:

This is a REQUIRED step to initialize your implementation of the xenomake pipeline

Description

xenomake init takes information about your reads, run modes, spatial chemistry, and sample/project names to create a configuration file for your project

Run Modes
===========

Run modes refer to how reads are handled during mapping and xenograft sorting portions of the pipeline. Options are the following:

1. lenient: xenograft ambiguous re-mapped (True), multi-mapped re-aligned (True), only genic reads (True) <br>
2. stringent: xenograft ambiguous re-mapped (False), multi-mapped re-aligned (False), only genic reads (True) <br>
3. custom: User Defined 



Spatial Modes
===============

Spatial modes refer to the spatial technology used to generate reads

1. visium <br>
2. slide-seq <br>
6. dbit-seq <br>
7. custom: User Defined barcode structure, umi structure, spot size, slide size, barcode file, number of beads/spots

Required Flags:

  • --read1: paired-end read in fastq format
  • --read2: second paired-end read
  • --outdir: name of project directory where the output file will be written
  • --sample: name of sample to prepend filenames and create sample directory within the project directory
  • --spatial_mode: preset run modes based upon spatial method used [custom,visium,slide_seq,hdst_seq,stereo_seq,pixel_seq,dbit_seq]
  • --run_mode: preset run modes based upon read processing. This controls multi-mapped, ambiguous, and non-genic read handling [custom, prude, lenient]

Optional Flags:

  • --root_dir: directory of project execution (default: ./)
  • --temp_dir: temporary directory (default: /tmp)
  • --picard: path to user preferred picard.jar tool (default: tools/picard.jar)
  • --dropseq_tools: path to user preferred download of dropseq tools directory (default: tools/Dropseq_2.5.1)
  • --download_species: download mm10 and hg38 genomes and annotations to the current directory (default: False)
  • --download_index: download star indices and xengosort index to the current directory (default: False)

Command Line Implementation:

# Print Help Function Call
xenomake init --help

#Initialize Project
xenomake init \
	-r1 <sample_R1.fastq.gz> \
	-r2 <sample_R2.fastq.gz> \
	-o <project_name> \
	-s <sample_name> \
	--spatial_mode <spatial_name> \
	--run_mode <custom, lenient, spatial>

Output Message

sample name: A1
spatial chemistry: visium
run mode: lenient
project "downsampled" initialized, proceed to species setup and project execution

Example Configuration File

config.yaml

Step 2: Xenomake Species

This is a REQUIRED step to initialize your implementation of the xenomake pipeline

Description

xenomake species links all inputs in a new directory structure that is then easily referenced in project run

Required Flags:

  • --mouse_ref: Mouse Reference Assembly in fasta format
  • --human_ref: Human Reference Assembly in fasta format
  • --mouse_annotation: Mouse Genome Annotation File in .gtf format
  • --human_annotation: Human Genome Annotation File in .gtf format

Optional Flags:

Including optional flags for indices will skip parts in the pipeline such as STAR --runMode CreateIndex and xengsort index
These two rules can be time-consuming and if these files are already generated, it doesn't need to be done again.

  • --human_index: STAR index directory for human genome reference
  • --mouse_index: STAR index directory for mouse genome reference
  • --xengsort_hash: Xengsort hash table filepath to be used in xengsort classify step
  • --xengsort_info: Xengosrt info table filepath to be used in xengsort classify step

Command Line Implementation:

# Print Help Function Call
xenomake species --help

# Create species folder 
xenomake species \
	--mouse_reference <mouse_reference_assembly.fa> \
	--human_reference <human_reference_assemble.fa> \
	--mouse_annotation <mouse_annotation.gtf> \
	--human_annotaion <human_annotation.gtf> \

Output Message

species directory successfully completed, keep project execution in the current directory

Output Directory Structure

├─ species
│ ├── idx.hash
│ ├── idx.info
│ ├── human
│ │ ├── genome.fa
│ │ ├── annotation.gtf
│ │ ├── star_index/
│ ├── mouse
│ │ ├── genome.fa
│ │ ├── annotation.gtf
│ │ ├── star_index/

Step 3: Xenomake Config (custom run)

Description

Only necessary if you selected custom for either run_mode or spatial_mode as a part of xenomake init

Spatial Flags

  • --barcode_file:
  • --umi:
  • --cell_bc:
  • --beads:
  • spot_diameter_um:
  • slide_size_um:

Run Flags

  • downstream:
  • mm_reads:
  • genic_only:
  • --ambiguous:

Command Line Implementation:

if spatial_mode: custom

xenomake config \
	--barcode_file <path to spot/cell barcode file> \
	--umi <umi structure in units of bases> \
	--cell_bc <barcode structure in unit of bases> \
	--beads <number of spots/beads in spatial array> \
	--spot_diameter_um <diameter of spot in um> \
	--slide_size_um <diameter of slide in um>

if run_mode : custom

xenomake config \
	--downstream <downstream processing (True/False)> \
	--mm_reads <count multi-mapped reads (True/False)> \
	--genic_only <only count genic reads in final count matrix (True/False)> \
	--ambiguous <partition xenograft 'ambiguous' reads back into alignments (True/False)>

Step 4: Execute Project

Snakemake

Xenomake uses the Snakemake workflow management system to create reproducible and scalable data analyses
To get an understanding of snakemake, please visit: https://snakemake.github.io/
For additional information, read Snakemake's paper https://doi.org/10.12688/f1000research.29032.1

Dry Run

It is recommended to first execute a dry-run with flag --dry-run, Xenomake will only show the execution plan instead of performing steps.

Flags:

-cores: number of cores to use for project run
--dry-run: Dry-Run showing the workflow
--rerun-incomplete: Force to rerun incompletely generated files if project halted
--keep-going: If a job fails, keep executing independent jobs that don't rely on its output

#Dry Run
xenomake run --cores <n_cores> --dry-run
#Run Project
xenomake run --cores <n_cores> --keep-going

Example Snakemake Standard Outs

Below shows an example of the messages printed by Snakemake. Briefly, they give an overview of the jobs in line for execution and print messages when a rule is being performed. This includes a few important features:

  1. Start time
  2. Finish time
  3. Rule name
  4. Input files
  5. Output file destinations
  6. Wildcard values
  7. Additional params and log destinations

Snakemake will also print any standard-out messages from tools performed. We have saved some of these as log files for subsequent review

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	DGE_Human
	1	Expression_QC
...
	1	Subset_Overlapped
	1	Xengsort_Clasify
	1	all
	28
Select jobs to execute...

[Fri Sep  8 10:36:04 2023]
rule Preprocess:
    input: fastq/sub1.fq, fastq/sub2.fq
    output: out/treated/preprocess/unaligned_bc_umi_tagged.bam, out/treated/logs/Preprocess.summary
    log: out/treated/logs/Preprocess.log
    jobid: 6
    wildcards: OUTDIR=out, sample=treated

[Fri Sep  8 10:36:06 2023]
Finished job 6.
1 of 28 steps (4%) done
Select jobs to execute...

...
...
...

[Fri Sep 8 10:47:41 2023]
localrule all:
    input: out/treated/final/treated_human_counts.tsv.gz, out/treated/final/treated_mouse_counts.tsv.gz, out/treated/final/treated_human.hdf5, out/treated/final/treated_mouse.hdf5, out/treated/final/treated_human.h5ad, out/treated/final/treated_mouse.hdf5, out/treated/final/treated_mouse.h5ad, out/treated/final/treated_human_processed.h5ad, out/treated/final/treated_mouse_processed.h5ad, out/treated/qc/human_qc.png
    jobid: 0
    threads: 16

[Fri Sep 8 14:37:01 2023]
Finished job 0.
29 of 29 steps (100%) done

Output Directory Structure

├─ OUTDIR
│ ├── sample
│ │ ├── final.bam: Mapped, Sorted, Xenograft Processed alignments in bam format
│ │ ├── final
│ │ │ ├── tissue_positions_list.csv
│ │ │ ├── {sample}_counts.tsv.gz: Gene by Barcode umi count matrix generated by dropseq DigitalGeneExpression
│ │ │ ├── {sample}_human.h5ad:Spatial Anndata object
│ │ │ ├── {sample}_human.hdf5: 10X genomics formatted output structure (can be used for downstream spatial processing)
│ ├── xengsort: Xenograft sorted fastq files generated using xengsort toolkit
│ │ │ ├── ambiguous.fq.gz
│ │ │ ├── both.fq.gz
│ │ │ ├── graft.fq.gz
│ │ │ ├── host.fq.gz
│ │ │ ├── neither.fq.gz
│ ├── mapping
│ │ │ ├── STAR aligned genomes: Aligned files without Xenograft Processing
│ ├── logs
│ │ │ ├── Tagging and trimming logs
│ │ │ ├── STAR log files
│ │ │ ├── BAM subsetting logs
│ │ │ ├── Xengsort log
│ │ │ ├── Multi-map filtering logs
│ │ │ ├── Digital Gene Expression logs
│ ├── qc
│ │ │ ├── Xengsort summary files
│ │ │ ├── STAR summary files
│ │ │ ├── Dataset qc plots
│ │ │ ├── Digital Gene Expression summary
│ ├── preprocessing
│ │ │ ├── unaligned_bc_umi_tagged.bam: Unaligned bam file. Cell-barcode/UMI tagged and Adapter/PolyA trimmed using dropseq workflows

Downstream Cell-Cell Interactions

Intro

Using the outputs from Xenomake, we were able to identify group specific expression and differential signalling genes in our model. We provide scripts in the repository CellInteract to perform identical analyses that are present in the Xenomake Publication

Cell-Cell Signalling

Cell Signalling molecules such as chemokines/cytokines are simultaneously secreted by both stromal cell types and epithelial tumor cells in the tumor microenvironment. It is often difficult to attribute signalling expression to stromal or tumor cells in standard models. However in PDX models processed with means similar to Xenomake, we are able to study the differential production of cellular signals and identify biomarkers in stromal and epithelial compartments

Xenomake Publication Figure (Organism Specific Signalling)

Xenomake Plot

Repository

Access further description as well as scripts to perform these downstream analysis: Link to CelllInteract Repository: https://github.com/bernard2012/CellInteract

QC

Scanpy QC Plots

QC Metrics

Xengsort Metrics

Counts number of reads partitioned to each file

prefix host graft ambiguous both neither
test 6652723 13654867 2603727 1297366 154218
## Running time statistics
- Running time in seconds: 225.0 sec
- Running time in minutes: 3.75 min
- Done. [2023-08-17 16:19:49]

STAR Metrics

                                 Started job on |       Aug 16 18:57:26
                             Started mapping on |       Aug 16 18:58:54
                                    Finished on |       Aug 16 19:04:22
       Mapping speed, Million of reads per hour |       968.48

                          Number of input reads |       88239353
                      Average input read length |       108
                                    UNIQUE READS:
                   Uniquely mapped reads number |       36755354
                        Uniquely mapped reads % |       41.65%
                          Average mapped length |       109.05
                       Number of splices: Total |       7555478
            Number of splices: Annotated (sjdb) |       7421066
                       Number of splices: GT/AG |       7487922
                       Number of splices: GC/AG |       23361
                       Number of splices: AT/AC |       1030
               Number of splices: Non-canonical |       43165
                      Mismatch rate per base, % |       1.75%
                         Deletion rate per base |       0.04%
                        Deletion average length |       1.90
                        Insertion rate per base |       0.04%
                       Insertion average length |       1.89
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       10085928
             % of reads mapped to multiple loci |       11.43%
        Number of reads mapped to too many loci |       768576
             % of reads mapped to too many loci |       0.87%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       0
       % of reads unmapped: too many mismatches |       0.00%
            Number of reads unmapped: too short |       38624549
                 % of reads unmapped: too short |       43.77%
                Number of reads unmapped: other |       2004946
                     % of reads unmapped: other |       2.27%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

UMI Metrics

Example of top 4 spots with highest expression

CELL_BARCODE NUM_GENIC_READS NUM_TRANSCRIPTS NUM_GENES
CAGTTCCGCGGGTCGA 429279 67931 9287
GAAACTCGTGCGATGC 398339 59154 8748
AAGAGATGAATCGGTA 362017 57447 8723
CCAAGAAAGTGGGCGA 367515 54742 8306

Run Xenomake on Medulloblastoma Test Dataset

Follow Installation Instructions for Xenomake if not done already

Dataset

Paired End Fastq Files Downsamples to 8.4 million reads
Taken from: https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-023-01185-4

Download Dataset

wget https://zenodo.org/records/10655403/files/sub1.fq.gz
wget https://zenodo.org/records/10655403/files/sub2.fq.gz

Run Xenomake Pipeline

Initialize Project

xenomake init \
	-r1 sub1.fq.gz \
	-r2 sub2.fq.gz \
	-s downsampled \
	-o test_run \
	--run_mode lenient \
	--spatial_mode visium \
	--download_genome True \
	--download_index True \
	--download_xengsort True

Species Setup

xenomake species \
	-hr GRCh38.p14.fna.gz \
	-ha GRCh38.p14.gtf \
	-mr GRCm39.fna.gz \
	-ma GRCm39.gtf \
	--human_index human_index/ \
	--mouse_index mouse_index/ \
	--xengsort_hash idx.hash \
	--xengsort_info idx.info

Run Project

xenomake run --cores 4

Optional Downloads

Download Dropseq

Current version of Dropseq-tools is 2.5.3 and is present at <repo_dir>/tools/Dropseq-2.5.3/
For updated versions check: https://github.com/broadinstitute/Drop-seq

Download Picard

Version Control

The provided picard.jar file is currently compatible with Java installed in the Xenomake conda environment.
Java: v11.0.15
Picard: v2.27.5

Picard Github

If you choose to use your own Picard executable, ensure compatibility with Java
https://broadinstitute.github.io/picard/