parsers module
This module provides utility functions for parsing various bioinformatics file formats and writing FASTA files. It includes functions to read and process FASTA, MMseqs m8, GenBank flat files (GBFF), eggNOG-mapper hits, hmmscan domtblout outputs, and to write FASTA files from pandas DataFrames.
- plast.parsers.read_fasta(fasta)
Parse a FASTA file handler or string into a dictionary.
- Parameters:
fasta (str or io.BufferedReader) – FASTA file content as string or file handler.
- Returns:
Dictionary mapping sequence IDs to sequences.
- Return type:
dict[str, str]
- plast.parsers.read_m8(m8_filename, res_per_query=1)
Parse MMseqs easy-search output file.
- Parameters:
m8_filename (str) – Path to the m8 file.
res_per_query (int) – Number of results per query (default: 1).
- Returns:
DataFrame indexed by query_id.
- Return type:
pandas.DataFrame
- plast.parsers.read_gbff(stream)
Parse a GenBank flat file (GBFF) from a file handler into a DataFrame containing CDS features and their attributes.
- Parameters:
stream (io.BufferedReader) – File handler for GBFF file.
- Returns:
Tuple of (DataFrame of CDS features, accession, length).
- Return type:
tuple[pandas.DataFrame, str, int]
- plast.parsers.read_emapper_hits(emapper_hits_file, res_per_query=1)
Parse eggnog-mapper output file into a DataFrame.
- Parameters:
emapper_hits_file (str) – Path to eggnog-mapper output file.
res_per_query (int) – Number of results per query (default: 1).
- Returns:
DataFrame indexed by query_name.
- Return type:
pandas.DataFrame
- plast.parsers.read_hmmscan_output(input_file_path)
Parse hmmscan domtblout file into a DataFrame similar to eggnog-mapper output.
- Parameters:
input_file_path (str) – Path to hmmscan domtblout file.
- Returns:
DataFrame with best hits per query.
- Return type:
pandas.DataFrame