Fasta One-Hot Encoder
Simple python to lazily one-hot encode fasta files using multiple processes, either single bases or considering arbitrary kmers.
Installation
Simply run:
pip installed fasta_one_hot_encoder
Examples
Bases
One-hot encode to bases.
from fasta_one_hot_encoder import FastaOneHotEncoder
encoder = FastaOneHotEncoder(
nucleotides = "acgt",
lower = True,
sparse = False,
handle_unknown="ignore"
)
path = "test_data/my_test_fasta.fa"
encoder.transform_to_df(path, verbose=True).to_csv(
"my_result.csv"
)
Obtained results should look like:
 | a | c | g | t |
---|---|---|---|---|
0 | 0 | 0 | 1 | 0 |
1 | 0 | 1 | 0 | 0 |
2 | 0 | 1 | 0 | 0 |
Kmers
One-hot encode to kmers of given length.
from fasta_one_hot_encoder import FastaOneHotEncoder
encoder = FastaOneHotEncoder(
nucleotides = "acgt",
kmers_length=2,
lower = True,
sparse = False,
handle_unknown="ignore"
)
path = "test_data/my_test_fasta.fa"
encoder.transform_to_df(path, verbose=True).to_csv(
"my_result.csv"
)
Obtained results should look like:
 | aa | ac | ag | at | ca | cc | cg | ct | ga | gc | gg | gt | ta | tc | tg | tt |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |