ARC (Antigen Receptor Classifier)

Authors: Austin Crinklaw, Swapnil Mahajan

Requirements:

Linux OS
HMMER3
NCBI Blast+
Python 3+
- Python packages: Pandas, BioPython

Installation:

We provide a Dockerfile for ease of use.

ARC can also be downloaded through PyPI using the following pip command.

pip install bio-arc

Usage:

Input

A fasta format file with one or more protein sequences.

>1WBZ_A_alpha I H2-Kb
MVPCTLLLLLAAALAPTQTRAGPHSLRYFVTAVSRPGLGEPRYMEVGYVDDTEFVRFDSDAENPRYEPRARWMEQEGPEYWERETQKAKGNEQSFRVDLRTLLGYYNQSKGGSHTIQVISGCEVGSDGRLLRGYQQYAYDGCDYIALNEDLKTWTAADMAALITKHKWEQAGEAERLRAYLEGTCVEWLRRYLKNGNATLLRTDSPKAHVTHHSRPEDKVTLRCWALGFYPADITLTWQLNGEELIQDMELVETRPAGDGTFQKWASVVVPLGKEQYYTCHVYHQGLPEPLTLRWEPPPSTVSNMATVAVLVVLGAAIVTGAVVAFVMKMRRRNTGGKGGDYALAPGSQTSDLSLPDCKVMVHDPHSLA
>1WBZ_B_b2m I H2-Kb
MARSVTLVFLVLVSLTGLYAIQKTPQIQVYSRHPPENGKPNILNCYVTQFHPPHIEIQMLKNGKKIPKVEMSDMSFSKDWSFYILAHTEFTPTETDTYACRVKHASMAEPKTVYWDRDM

Commands

Using Fasta file as an input:

python -m ARC classify -i /path/to/input.fasta -o /path/to/output.csv

Output

Output file has 4 columns in CSV format.
First column named 'ID' is the description provoded in the fasta for each sequence.
Second column named 'class' is the assigned molecule class for each sequence.
- e.g. MHC-I, MHC-II, BCR or TCR.
The third column named 'chain_type' is the assigned chain type for each sequence.
- e.g. alpha, beta, heavy, lambda, kappa, scFv, TscFv or construct. These will also be labelled as V for variable domain or C for constant domain.
The fourth column named 'calc_mhc_allele' is the MHC allele identified using groove domain similarity to MRO alleles.

ID	class	chain_type	calc_mhc_allele
1WBY_A_alpha I H2-Db	MHC-I	alpha V
1WBY_B_b2m I H2-Db
1HQR_A_alpha II HLA-DRA01:01/DRB501:01	MHC-II	alpha C	HLA-DRA*01:01
1HQR_B_beta II HLA-DRA01:01/DRB501:01	MHC-II	beta C	HLA-DRB5*01:01
2CMR_H_heavy	BCR	heavy V
2CMR_L_light	BCR	kappa C
4RFO_L_light	BCR	lambda V
3UZE_A_heavy	BCR	scFv
1FYT_D_alpha	TCR	alpha V
1FYT_E_beta	TCR	beta C
3TF7_C_alpha	TCR	TscFv

How it works:

BCR and TCR chains are identified using HMMs. A given protein sequence is searched against HMMs built using BCR and TCR chain sequences from IMGT. HMMER is used to align an input sequence to the HMMs.
MHC class I (alpha1-alpha2 domains) and MHC class I alpha and beta chain HMMs are downloaded from Pfam website. An input protein sequence is searched against these HMMs. A HMMER bit score threshold of 25 was used to identify MHC chain sequences. -To identify MHC alleles, MRO repository is downloaded every time the script is run. Groove domains (G-domains) are assigned to new MRO allles and stored in a CSV file. If this file does not exist then G-domains are assigned to all the MRO alleles (which may slow down the script).

References:

Several methods for HMMER result parsing were sourced from ANARCI.

Dunbar J and Deane CM. ANARCI: Antigen receptor numbering and receptor classification. Bioinformatics (2016)

bio-arc
Release 0.1.1

Release 0.1.1

0.1.1

0.1.0

0.0.23

0.0.22

0.0.21

0.0.20

0.0.19

0.0.18

0.0.17

0.0.16

Documentation

ARC (Antigen Receptor Classifier)

Authors: Austin Crinklaw, Swapnil Mahajan

Requirements:

Installation:

Usage:

Input

Commands

Output

How it works:

References:

Stats

Releases

Contributors

bio-arc Release 0.1.1

Release 0.1.1 Toggle Dropdown 0.1.1 0.1.0 0.0.23 0.0.22 0.0.21 0.0.20 0.0.19 0.0.18 0.0.17 0.0.16

Documentation

ARC (Antigen Receptor Classifier)

Authors: Austin Crinklaw, Swapnil Mahajan

Requirements:

Installation:

Usage:

Input

Commands

Output

How it works:

References:

Stats

Releases

Contributors

bio-arc
Release 0.1.1

Release 0.1.1

0.1.1

0.1.0

0.0.23

0.0.22

0.0.21

0.0.20

0.0.19

0.0.18

0.0.17

0.0.16