parsers module

This module provides utility functions for parsing various bioinformatics file formats and writing FASTA files. It includes functions to read and process FASTA, MMseqs m8, GenBank flat files (GBFF), eggNOG-mapper hits, hmmscan domtblout outputs, and to write FASTA files from pandas DataFrames.

plast.parsers.read_fasta(fasta)

Parse a FASTA file handler or string into a dictionary.

Parameters:: fasta (str or io.BufferedReader) – FASTA file content as string or file handler.
Returns:: Dictionary mapping sequence IDs to sequences.
Return type:: dict[str, str]

plast.parsers.read_m8(m8_filename, res_per_query=1)

Parse MMseqs easy-search output file.

Parameters:

m8_filename (str) – Path to the m8 file.
res_per_query (int) – Number of results per query (default: 1).

Returns:

DataFrame indexed by query_id.

Return type:

pandas.DataFrame

plast.parsers.read_gbff(stream)

Parse a GenBank flat file (GBFF) from a file handler into a DataFrame containing CDS features and their attributes.

Parameters:: stream (io.BufferedReader) – File handler for GBFF file.
Returns:: Tuple of (DataFrame of CDS features, accession, length).
Return type:: tuple[pandas.DataFrame, str, int]

plast.parsers.read_emapper_hits(emapper_hits_file, res_per_query=1)

Parse eggnog-mapper output file into a DataFrame.

Parameters:

emapper_hits_file (str) – Path to eggnog-mapper output file.
res_per_query (int) – Number of results per query (default: 1).

Returns:

DataFrame indexed by query_name.

Return type:

pandas.DataFrame

plast.parsers.read_hmmscan_output(input_file_path)

Parse hmmscan domtblout file into a DataFrame similar to eggnog-mapper output.

Parameters:: input_file_path (str) – Path to hmmscan domtblout file.
Returns:: DataFrame with best hits per query.
Return type:: pandas.DataFrame