Submodules
core.workflows
- bacpipe.core.workflows.cross_model_evaluation(dim_reduction_model, evaluation_task, models, **kwargs)[source]
Generate plots to compare models by the specified tasks.
- Parameters:
dim_reduction_model (str) – name of dimensionality reduction model
evaluation_task (list) – tasks to evaluate models by
models (list) – embedding models
- bacpipe.core.workflows.ensure_models_exist(model_base_path, model_names, repo_id='vskode/bacpipe_models')[source]
Ensure that the model checkpoints for the selected models are available locally. Downloads from Hugging Face Hub if missing.
- Parameters:
model_base_path (Path) – Local base directory where the checkpoints should be stored.
model_names (str or list) – Model name or list of model names to run
repo_id (str, optional) – Hugging Face Hub repo ID, by default “vinikay/bacpipe_models”
- Returns:
path to saved models
- Return type:
str
- bacpipe.core.workflows.evaluation_with_settings_already_exists(audio_dir, dim_reduction_model, models, testing=False, **kwargs)[source]
Check if the evaluation with the specified settings already exists. The function checks if the embeddings, dimensionality reduction, probing and clustering evaluation results already exist in the specified directory. If any of these results do not exist, the function returns False. Otherwise, it returns True.
- Parameters:
audio_dir (string) – full path to audio files
dim_reduction_model (string) – name of the dimensionality reduction model to be used
models (list) – embedding models
- Returns:
True if the evaluation with the specified settings
- Return type:
bool
- bacpipe.core.workflows.generate_embeddings(avoid_pipelined_gpu_inference=False, **kwargs)[source]
Run the embedding generation pipeline including classification using the pretrained classifier (if included). All of this will be done for one model. The predefined folder structure will be created so that subsequent processing runs will be very fast, as they then only load the data. kwargs that are not specifically passed will be taken from bacpipe.config and bacpipe.settings.
- Parameters:
avoid_pipelined_gpu_inference (bool, optional) – set to True to avoid multiprocessing, by default False
- Returns:
loader object to access embeddings and classifier predictions
- Return type:
- Raises:
ValueError – if not model name is provided
- bacpipe.core.workflows.get_model_names(models, audio_dir, main_results_dir, embed_parent_dir, already_computed=False, **kwargs)[source]
Get the names of the models used for processing. This is either done by using already computed embeddings or by using the selected models from the config file. If already computed embeddings are used, the model names are extracted from the directory structure.
- Parameters:
models (list) – list of embedding models
audio_dir (string) – full path to audio files
main_results_dir (string) – top level directory for the results of the embedding evaluation
embed_parent_dir (string) – parent directory for the embeddings
already_computed (bool, Default is False) – ignore model list and use only models whos embeddings already have been computed and are saved in the results dir
- Raises:
ValueError – If already computed embeddings are used, but no embeddings are found in the specified directory.
- bacpipe.core.workflows.model_specific_evaluation(loader_dict, evaluation_task, probe_configs, models, dim_reduction_model=False, **kwargs)[source]
Perform evaluation of the embeddings using the specified evaluation task. The evaluation task can be either probing or clustering. The evaluation is performed using the functions from the probing and clustering modules. The results of the evaluation are saved in the directory specified by the audio_dir parameter.
- Parameters:
loader_dict (dict) – dictionary containing the loader objects for each model
evaluation_task (string) – name of the evaluation task to be performed.
probe_configs (dict) – dictionary containing the configuration for the probing tasks. The configurations are specified in the bacpipe/settings.yaml file.
models (list) – embedding models
- bacpipe.core.workflows.play(bool_save_logs=False, **kwargs)[source]
Play the bacpipe! The pipeline will run using the models specified in bacpipe.config.models and generate results in the directory bacpipe.settings.results_dir. For more details see the ReadMe file on the repository page https://github.com/bioacoustic-ai/bacpipe or the documentation under https://bacpipe.readthedocs.io/en/latest/.
- Parameters:
bool_save_logs (bool, optional) – Save logs, config and settings file. This is important if you get a bug, sharing this will be very helpful to find the source of the problem, by default False
- Raises:
FileNotFoundError – If no audio files are found we can’t compute any embeddings. So make sure the path is correct :)
- bacpipe.core.workflows.run_pipeline_for_models(models, audio_dir, dim_reduction_model, **kwargs)[source]
Generate embeddings for each model in the list of model names. The embeddings are generated using the generate_embeddings function from the generate_embeddings module. The embeddings are saved in the directory specified by the audio_dir parameter. The function returns a dictionary containing the loader objects for each model, by which metadata and paths are stored. kwargs that are not specifically passed will be taken from bacpipe.config and bacpipe.settings.
code example: ``` loader = bacpipe.run_pipeline_for_models(
models=[‘birdnet’, ‘naturebeats’], audio_dir=’bacpipe/tests/test_data’, dim_reduction_model=’umap’
)
# this call will initiate the embedding generation process, it will check if embeddings # already exist for the combination of each model and the dataset and if so it will # be ready to load them. The loader keys will be the model name and the values will # be the loader objects for each model. Each object contains all the information # on the generated embeddings. To name access them: loader[‘birdnet’].embeddings() # this will give you a dictionary with the keys corresponding to embedding files # and the values corresponding to the embeddings as numpy arrays
loader[‘birdnet’].metadata_dict # This will give you a dictionary overview of: # - where the audio data came from, # - where the embeddings were saved # - all the audio files, # - the embedding size of the model, # - the audio file lengths, # - the number of embeddings for each audio files # - the sample rate # - the number of samples per window # - and the total length of the processed dataset in seconds # Thic dictionary is also saved as a yaml file in the directory of the embeddings ```
- Parameters:
models (list) – embedding models
audio_dir (string) – full path to audio files
dim_reduction_model (string) – name of the dimensionality reduction model to be used for the embeddings. If “None” is selected, no dimensionality reduction is performed.
- Returns:
loader_dict – dictionary containing the loader objects for each model
- Return type:
dict
- bacpipe.core.workflows.run_pipeline_for_single_model(model_name, audio_dir, dim_reduction_model='None', check_if_already_processed=True, check_if_already_dim_reduced=True, testing=False, **kwargs)[source]
Run the bacpipe pipeline, including embedding generation, classification using the pretrained classifier (if included), dimensionality reduction (if passed), and plotting of visualization to files. All of this will be done for one model. The predefined folder structure will be created so that subsequent processing runs will be very fast, as they then only load the data. kwargs that are not specifically passed will be taken from bacpipe.config and bacpipe.settings.
- Parameters:
model_name (string) – model name
audio_dir (str) – path to audio data
dim_reduction_model (str, optional) – name of dimensionality reduction model, by default “None”
check_if_already_processed (bool, optional) – set to False if you want to force recomputing of embeddings, by default True
check_if_already_dim_reduced (bool, optional) – set to False if you want to force recomputing of dimensionality reduced embeddings, by default True
overwrite (bool, optional) – set to True if you want default labels and ground truth labels to be processed again, by default False
testing (bool, optional) – set to True for testing, by default False
- Returns:
object to processed embeddings and classifier predictions
- Return type:
core.experiment_manager
- class bacpipe.core.experiment_manager.Loader(audio_dir, model_name=None, check_if_combination_exists=True, dim_reduction_model=False, use_folder_structure=False, testing=False, **kwargs)[source]
Bases:
objectInitiate the generation of embedding by creating a Loader object. This object will handles paths for loading and saving data. During this process it collects metadata which can be accessed as an attribute and will be saved after the successful run. kwargs that are not specifically passed will be taken from bacpipe.config and bacpipe.settings.
- __init__(audio_dir, model_name=None, check_if_combination_exists=True, dim_reduction_model=False, use_folder_structure=False, testing=False, **kwargs)[source]
Initiate the generation of embedding by creating a Loader object. This object will handles paths for loading and saving data. During this process it collects metadata which can be accessed as an attribute and will be saved after the successful run. kwargs that are not specifically passed will be taken from bacpipe.config and bacpipe.settings.
- Parameters:
audio_dir (string or pathlib.Path) – path to audio data
model_name (string, optional) – Name of the model that should be used, by default None
check_if_combination_exists (bool, optional) – If false new embeddings are created and the checking is skipped, by default True
dim_reduction_model (bool, optional) – Either false if primary embeddings are created or the name of the dimensionaliry reduction model if dim reduction should be performed, by default False
use_folder_structure (bool, optional) – If True data will be saved and the output folder structure will be created, by default False
testing (bool, optional) – Testing yes or no?, by default False
- embeddings(return_type='dict')[source]
Load and return processed embeddings. This method can only be used to return already computed embeddings. Embeddings can be returned as np.array (array) or as dictionary (dict) in which case the keys will correspond to the corresponding embedding file name. In case of the array, all embeddings are concatenated so that the first dimension corresponds to the timestamp and the second dimension to the embedding dimension.
- Parameters:
return_type (str, optional) – return type either array or dict, by default ‘dict’
- Returns:
depending on return_type argument
- Return type:
array or dict
- static get_audio_files(audio_dir, audio_suffixes=['.wav', '.WAV', '.aif', '.mp3', '.MP3', '.flac', '.ogg'], return_type='pathlib.Path')[source]
Collect all audio files in a given directory that have file endings that can be processed by bacpipe.
- Parameters:
audio_dir (str) – path to audio data
audio_suffixes (list, optional) – list of audio suffixes, by default settings.audio_suffixes
return_type (str, optional) – specify if list should be returned as list of strings or list of pathlib.Path objects which comes in handy for some downstream processing, by default ‘pathlib.Path’
- Returns:
list of audio files
- Return type:
list
- predictions(return_type='dict')[source]
Load and return classifier predictions. This method can only be used for already processed predictions. Predictions that have been processed will be returned based on the specified return_type: array for np.array, in which case all predictions are concatenated and a dictionary is passed referencing the index to the corresponding label. dict for a dictionary, in which case the keys correspond to the audio file name corresponding to the annotation and the values are np.arrays with all annotations of that file dataframe for a dataframe with columns for each species that was active and columns for filename, start and end times.
- Parameters:
return_type (str, optional) – return either array, dict or dataframe, by default ‘dict’
- Returns:
either tuples of (np.array, dict) for array or tuple of (dict, dict) for dict or pd.DataFrame
- Return type:
tuple or pd.DataFrame
core.audio_processor
- class bacpipe.core.audio_processor.AudioHandler(model, padding, audio_dir, bool_slowdown=False, slowdown_rate=None, **kwargs)[source]
Bases:
objectHelper class for all methods related to loading and padding audio.
- __init__(model, padding, audio_dir, bool_slowdown=False, slowdown_rate=None, **kwargs)[source]
Helper class for all methods related to loading and padding audio.
- Parameters:
model (Model object) – has attributes for all the model characteristics like sample rate, segment length etc. as well as the methods to run the model
padding (str) – padding function to use for where padding is necessary
audio_dir (pathlib.Path object) – path to audio dir
- prepare_audio(sample)[source]
Use bacpipe pipeline to load audio file, window it according to model specific window length and preprocess the data, ready for batch inference computation. Also log file length and shape for metadata files.
- Parameters:
sample (pathlib.Path or str) – path to audio file
- Returns:
audio frames preprocessed with model specific preprocessing
- Return type:
torch.Tensor
model_pipelines.runner
- class bacpipe.model_pipelines.runner.Classifier(model, model_name, audio_dir, main_results_dir, classifier_threshold, use_folder_structure=True, save_raven_tables=False, **kwargs)[source]
Bases:
object- __init__(model, model_name, audio_dir, main_results_dir, classifier_threshold, use_folder_structure=True, save_raven_tables=False, **kwargs)[source]
Class to handle all tasks surrounding classification. Both generating the classifications from embeddings, as well as managing them, collecting them in arrays and creating dataframes and annotation tables from them.
- Parameters:
model (Model object) – has attributes for all the model characteristics like sample rate, segment length etc. as well as the methods to run the model
model_name (str) – name of the model
classifier_threshold (float, optional) – Value under which class predictions are discarded, by default None
- static filter_top_k_classifications(probabilities, class_names, class_indices, class_time_bins, k=50)[source]
Generate a dictionary with the top k classes. By limiting the class number to k, it prevents from this step taking too long but has the benefit of generating a dicitonary which can be saved as a .json file to quickly get a overview of species that are well represented within an audio file.
- Parameters:
probabilities (np.array) – Probabilities for each class
class_names (list) – class names
class_indices (np.array) – class indices exceeding the threshold
class_time_bins (np.array) – time bin indices exceeding the threshold
k (int, optional) – number of classes to save in the dict. keep this below 100 otherwise the operation will start slowing the process down a lot, by default 50
- Returns:
dictionary of top k classes with time bin indices exceeding threshold
- Return type:
dict
- class bacpipe.model_pipelines.runner.Embedder(model_name, loader=None, CustomModel=None, dim_reduction_model=False, **kwargs)[source]
Bases:
AudioHandlerThis class takes care of loading the specified model and using it to process the audio data to create embeddings. This class is also used to create dimensinoality reductions from embeddings. At the end if instantiation, the selected model is loaded and the model is associated with the specified device. kwargs that are not specifically passed will be taken from bacpipe.config and bacpipe.settings.
- Parameters:
AudioHandler (class) – Helper class that handles loading of audio
- __init__(model_name, loader=None, CustomModel=None, dim_reduction_model=False, **kwargs)[source]
This class takes care of loading the specified model and using it to process the audio data to create embeddings. This class is also used to create dimensinoality reductions from embeddings. At the end if instantiation, the selected model is loaded and the model is associated with the specified device. kwargs that are not specifically passed will be taken from bacpipe.config and bacpipe.settings.
- Parameters:
model_name (str) – name of selected embedding model
loader (Loader object) – Object that has all the necessary path information and methods to load and save all the processed data
CustomModel (class, optional) – custom model class to use for processing, by default None
dim_reduction_model (bool, optional) – Can be bool or the string corresponding to the dimensionality reduction model, by default False
- embeddings_using_multithreading(array_of_audios)[source]
Generate embeddings for all files in a pipelined manner: - Producer thread loads and preprocesses audio - Consumer (main thread) embeds audio while producer prepares next batch Ensures metadata and embeddings are written exactly like in the sequential version.
- Parameters:
fileloader_obj (Loader object) – contains all metadata of a model specific embedding creation session
- Returns:
updated object with metadata on embedding creation session
- Return type:
Loader object
- get_embeddings_for_audio(sample)[source]
Create a dataloader for the processed audio frames and run batch inference. Both are methods of the self.model class, which can be found in the utils.py file.
- Parameters:
sample (torch.Tensor) – preprocessed audio frames
- Returns:
embeddings from model
- Return type:
np.array
- get_embeddings_from_model(sample)[source]
Run full embedding generation pipeline, both for generating embeddings from audio data or generating dimensionality reductions from embedding data. Depending on that sample can be an embedding array or a audio file path.
- Parameters:
sample (np.array or string-like) – embedding array of path to audio file
- Returns:
embeddings
- Return type:
np.array
- run_inference_pipeline_using_multithreading()[source]
Generate embeddings for all files in a pipelined manner: - Producer thread loads and preprocesses audio - Consumer (main thread) embeds audio while producer prepares next batch Ensures metadata and embeddings are written exactly like in the sequential version.
- Parameters:
fileloader_obj (Loader object) – contains all metadata of a model specific embedding creation session
- Returns:
updated object with metadata on embedding creation session
- Return type:
Loader object
model_pipelines.model_utils
- class bacpipe.model_pipelines.model_utils.ModelBaseClass(sr, segment_length, model_name, device=None, model_base_path=None, global_batch_size=None, dim_reduction_model=False, **kwargs)[source]
Bases:
object- __init__(sr, segment_length, model_name, device=None, model_base_path=None, global_batch_size=None, dim_reduction_model=False, **kwargs)[source]
This base class defines key methods and attributes for all feature extractors to ensure that we can use the same processing pipeline to generate embeddings. The idea is to
1. initialize the model with prepare_inference, thereby loading the model and loading it onto the selected device.
load and resample audio to the sample rate required by the model
3. window the audio into segments corresponding to the required input segment length.
4. Calculating spectrograms (if the model architecture is accessible) to batch preprocess the audio and potentially be able to in retrospect build the spectrograms to investigate
5. Initialize a torch dataloader object based on the model specific audio loading characteristics to speed up the inference process and looping through the segments
Perform batch inference
If ‘cuda’ has been selected as device, a threading approach is used to load data in parallel while performing inference. The return value are the embeddings.
- Parameters:
sr (int) – sample rate
segment_length (int) – segment length in samples
device (str) – ‘cpu’ or ‘cuda’
model_base_path (pathlib.Path) – path to moin model checkpoint dir
global_batch_size (int) – global batch size that is then used in comjunction with the segment length to calculate a model-specific batch size that results in approximately equal batches for different models
embedding_evaluation
- class bacpipe.embedding_evaluation.label_embeddings.DefaultLabels(paths, model, default_label_keys, **kwargs)[source]
Bases:
object- __init__(paths, model, default_label_keys, **kwargs)[source]
Class to generate default labels based on audio files and number of generated embeddings per file.
- Parameters:
paths (SimpleNamespace) – convenient object for path handling
model (str) – model name
default_label_keys (list) – list of default labels, see settings.yaml
- Raises:
ValueError – if no embeddings were found
- bacpipe.embedding_evaluation.label_embeddings.build_ground_truth_labels_by_file(paths, ind, model, num_embeds, segment_s, metadata, all_labels, label_df=None, label_idx_dict=None, label_column=None, **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.collect_ground_truth_labels(paths, files, model, segment_s, metadata, label_df, label_idx_dict, **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.concatenate_annotation_files(annotation_src, appendix='.txt', acodet_annotations=False, start_col_name='start', end_col_name='end', lab_col_name='label')[source]
- bacpipe.embedding_evaluation.label_embeddings.create_Raven_annotation_table(df, label_column, high_freq=1000)[source]
- bacpipe.embedding_evaluation.label_embeddings.create_default_labels(audio_dir=None, model=None, paths=None, overwrite=True, **kwargs)[source]
Create default labels based on audio files and model timestamps to match the number of embeddings created per file for visualization and clustering purposes.
- Parameters:
audio_dir (str, optional) – path to audio data, by default None
model (str, optional) – model name, by default None
paths (SimpleNamespace, optional) – convenient object for path handling, by default None
overwrite (bool, optional) – if True labels are overwritten, by default True
- Returns:
dictionary with default labels
- Return type:
dict
- bacpipe.embedding_evaluation.label_embeddings.ensure_audio_files(found_audio_files, annotated_audio_files, audio_dir)[source]
- bacpipe.embedding_evaluation.label_embeddings.fill_all_labels_array(file_labels, all_labels)[source]
- bacpipe.embedding_evaluation.label_embeddings.filter_annotations_by_minimum_number_of_occurrences(df, min_occurrences=150, min_duration=0.65)[source]
Filter the annotations to have at least a minimum number of occurrences and a minimum duration.
- Parameters:
df (pd.DataFrame) – DataFrame containing the annotations.
min_occurrences (int, optional) – Minimum number of occurrences for each label, by default 150.
min_duration (float, optional) – Minimum duration for each label, by default 0.65.
- Returns:
Filtered DataFrame containing the annotations.
- Return type:
pd.DataFrame
- bacpipe.embedding_evaluation.label_embeddings.filter_df_by_filename(df_to_filer, file_name, file_name_column='audiofilename', model=None)[source]
- bacpipe.embedding_evaluation.label_embeddings.fit_labels_to_embedding_timestamps(df, label_idx_dict, num_embeds, segment_s, label_column=None, single_label=True, min_annotation_length=0.65, **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.get_default_labels(model_name, **kwargs)[source]
Return dictionary of the default labels based on the files that were already processed and saved. This is model dependent, as the input length is model dependent and therefore this function requires a model name as input. The default labels are calculated based on the default labels specified in the settings.yaml file.
- Parameters:
model_name (str) – model name
- Returns:
dictionary of default labels
- Return type:
dict
- bacpipe.embedding_evaluation.label_embeddings.get_dim_reduc_path_func(model_name, dim_reduction_model='umap', **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.get_dt_filename(file)[source]
Return the timestamp within a filename as a datetime object based on the most common naming conventions in bioacoustics. This is not bullet proof but it works with the vast majority of naming conventions for files.
- Parameters:
file (str) – filename as string
- Returns:
datetime object of the filename
- Return type:
dt.datetime object
- bacpipe.embedding_evaluation.label_embeddings.get_files_if_no_embeds(audio_dir, model, label_df=None)[source]
- bacpipe.embedding_evaluation.label_embeddings.get_ground_truth(model_name)[source]
Return dictionary of the ground truth labels based on the files that were already processed and saved. This is model dependent, as the input length is model dependent and therefore this function requires a model name as input.
- Parameters:
model_name (str) – model name
- Returns:
dictionary of ground truth labels
- Return type:
dict
- bacpipe.embedding_evaluation.label_embeddings.ground_truth_by_model(model, audio_dir, label_df=None, label_idx_dict=None, label_column='label:species', paths=None, annotations_filename='annotations.csv', overwrite=True, single_label=True, bool_filter_labels=False, **kwargs)[source]
Generate ground truth labels that are mapped onto the timestamps of a model, based on the model-specific input lengths. This way the embeddings and ground truth labels have the same lengths, and can be used for downstream evaluation like probing or clustering. This function supports single or multi-label generation of ground truth labels. A dictionary is created with a numpy array for the labels and a dictionary to associate the int values with the corresponding label class. The labels are processed based on a single annotation file which requires predefined column names: audiofilename, start, end, label:species (species can be replaced with other things but the label: needs to be consistent). See ‘bacpipe/tests/test_data/annotations.csv’ for an example. After processing the ground truth, the dictionary is saved as a numpy file and upon reexecution is simply loaded for shorter runtime.
- Parameters:
model (str) – model name
audio_dir (str) – path to audio data
label_df (pandas.DataFrame, optional) – ground truth annotations in specified format, by default None
label_idx_dict (dict, optional) – link between int values and class labels can be auto generated, by default None
label_column (str, optional) – name of column in annotation file, by default ‘label:species’
paths (SimpleNamespace, optional) – convenient object for path handling, by default None
annotations_filename (str, optional) – path to annotations csv file, by default “annotations.csv”
overwrite (bool, optional) – If True, the dict will be generated again and saved rather than loaded from a file if already processed, by default True
single_label (bool, optional) – set False if you want multi-label, by default True
bool_filter_labels (bool, optional) – set to True, if you want a minimum number of occurrence for labels to be included in the ground truth. See settings file for more options and descriptions, by default False
- Returns:
dictionary of ground truth labels with numpy array and dict to link int values to class labels
- Return type:
dict
- Raises:
ValueError – if gorund truth file is not found
- bacpipe.embedding_evaluation.label_embeddings.load_labels_and_build_dict(paths, annotations_filename, audio_dir, bool_filter_labels=True, min_label_occurrences=150, main_label_column=None, testing=False, **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.make_set_paths_func(audio_dir, main_results_dir=None, dim_reduc_parent_dir='dim_reduced_embeddings', testing=False, **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.model_specific_embedding_path(path, model, dim_reduction_model=None, **kwargs)[source]
Get the path to the model specific embeddings. This function searches for the most recent directory containing the embeddings for the specified model and dimensionality reduction model.
- Parameters:
path (Path) – Path to the main embeddings directory.
model (str) – Name of the model used for embedding.
dim_reduction_model (str) – Name of the dimensionality reduction model used. Default is ‘umap’.
kwargs (dict) – Additional keyword arguments.
- Returns:
Path to the model specific embeddings directory.
- Return type:
Path
- Raises:
ValueError – If no embeddings are found for the specified model.
probing
- bacpipe.embedding_evaluation.probing.probe.embeds_array_without_noise(embeds, ground_truth, label_column, **kwargs)[source]
- bacpipe.embedding_evaluation.probing.probe.probing_pipeline(model_name, ground_truth, embeds, paths=None, name='linear', overwrite=True, label_column='species', **kwargs)[source]
Probing pipeline consisting of building the classifier, evaluating it and saving metrics and plots of performance.
- Parameters:
paths (SimpleNamespace object) – dict with attributes corresponding to paths for loading and saving
embeds (np.array) – embeddings
name (string) – Type of Probing
dataset_csv_path (string) – name of Probing dataframe as specified in settings.yaml
overwrite (bool) – overwrite existing Probing?, defaults to False
- bacpipe.embedding_evaluation.probing.probe.run_probe_inference(model, linear_probe, threshold, embeds=None, return_binary_presence=True, callbacks=None)[source]
- bacpipe.embedding_evaluation.probing.inference_probe.prepare_probe_inference(model, probe_path='')[source]
Load a linear probe that was previously trained and saved. The probe is loaded and the state_dict of the model is loaded so that the probe is ready and in the exact same state as after training.
- Parameters:
model (str) – model name of backbone
probe_path (str, optional) – path to probe, will default to the standard bacpipe path, by default ‘’
- Returns:
torch model object – linear probe model
dict – dictionary to associate the columns of the generated predictions array with the corresponding class label
- bacpipe.embedding_evaluation.probing.inference_probe.run_probe_inference(model, linear_probe, threshold, embeds=None, return_binary_presence=True, callbacks=None, device='cpu')[source]
Apply a previously trained linear probe to data. This requires either that the embeddings were already created using the backbone and saved using the bacpipe folder structure, or that the embeddings are directly passed to this function. See the examples notebooks for an example use case. This function then loads the embeddings and applies the linear probe to classify the data.
- Parameters:
model (str) – model name
linear_probe (torch model) – linear probe torch model object
threshold (float) – float value to process the predictions
embeds (torch.Tensor, optional) – embeddings array, by default None
return_binary_presence (bool, optional) – if true a binary presence array is returned, by default True
callbacks (function, optional) – use to have custom progress bars increment, by default None
device (str, optional) – select device to process the probe, by default ‘cpu’
- Returns:
generated probe predictions
- Return type:
np.ndarray
- bacpipe.embedding_evaluation.probing.evaluate_probe.accuracy_per_class(y_true, y_pred, label2index, items_per_class)[source]
Accuracy per class
- Parameters:
y_true (list) – ground truth
y_pred (list) – predictions
label2index (dict) – link labels to ints
items_per_class (list) – number of items per class
- Returns:
classwise accuracy
- Return type:
dict
- bacpipe.embedding_evaluation.probing.evaluate_probe.auc(y_true, probability_scores)[source]
Compute the AUC
- bacpipe.embedding_evaluation.probing.evaluate_probe.compute_task_metrics(y_pred, y_true, probability_scores, label2index)[source]
Compute the evaluation metrics
- bacpipe.embedding_evaluation.probing.evaluate_probe.eval_probe(probe, embeds, df, label2index, device='cuda:0', config='linear', paths=None, save_probe=False, **kwargs)[source]
Perform inference using probe.
- Parameters:
probe (object) – trained classification object
test_dataloader (DataLoader object) – dataset iterator
device (str, optional) – ‘cpu’ or ‘cuda’, by default “cuda:0”
config (str, optional) – type of classification, by default “linear”
- Returns:
list – prediction values in ints corresponding to labels
list – ground truth values in ints
np.array – probabilities for each class and each embedding
- bacpipe.embedding_evaluation.probing.evaluate_probe.macro_accuracy(y_true, y_pred)[source]
Compute macro accuracy.
- Parameters:
y_true (list) – ground truth
y_pred (list) – predictions
- Returns:
balance accuracy score
- Return type:
float
- bacpipe.embedding_evaluation.probing.evaluate_probe.macro_f1(y_true, y_pred)[source]
Compute the macro f1 score
- bacpipe.embedding_evaluation.probing.evaluate_probe.micro_f1(y_true, y_pred)[source]
Compute the micro f1 score
- bacpipe.embedding_evaluation.probing.evaluate_probe.save_probe_results(paths, config, metrics, **kwargs)[source]
Save a dict with all performance metrics.
- Parameters:
paths (SimpleNamespace object) – dict with attributs of paths for loading and saving
config (string) – type of classification (linear or knn)
metrics (dict) – performance
- class bacpipe.embedding_evaluation.probing.train_probe.KNNProbe(n_neighbors=15, testing=False, **kwargs)[source]
Bases:
Module
- class bacpipe.embedding_evaluation.probing.train_probe.LinearProbe(in_dim, out_dim, device='cpu', **kwargs)[source]
Bases:
Module- __init__(in_dim, out_dim, device='cpu', **kwargs)[source]
Linear classification layer.
- Parameters:
in_dim (int) – number of input dimensions (dictated by embeddings)
out_dim (int) – number of output dimensions (dictated by classes in ground truth)
- forward(x)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- bacpipe.embedding_evaluation.probing.train_probe.train_knn_probe(knn_classifier, train_dataloader, device='cpu', **kwargs)[source]
Pipeline for knn classifier training.
- Parameters:
knn_classifier (object) – classifier object
train_dataloader (DataLoader object) – iterator for dataset
device (str, optional) – ‘cpu’ or ‘cuda’, by default “cpu”
- Returns:
classifier object
- Return type:
object
- bacpipe.embedding_evaluation.probing.train_probe.train_linear_probe(linear_classifier, train_dataloader, learning_rate, num_epochs, device='cuda:0', **kwargs)[source]
Linear classification training pipeline. Hyperparameters are specified in settings.yaml file and passed to this function.
- Parameters:
linear_classifier (object) – classification object
train_dataloader (DataLoader object) – dataset loader to iterate over
learning_rate (float) – learning rate
num_epochs (int) – number of epochs for training
device (str, optional) – ‘cpu’ or ‘cuda’, by default “cuda:0”
- Returns:
trained linear classification object
- Return type:
object
- bacpipe.embedding_evaluation.probing.train_probe.train_probe(embeds, df, label2index, config='linear', learning_rate=None, num_epochs=None, n_neighbors=None, **kwargs)[source]
Classification pipeline. First the classification dataframe is loaded, then a dict is created to link labels to ints, then the dataset loaders are created to iterate over. Next depending of the specified config a linear or KNN classification is performed. Finally the classifiers are used for inference and based on that performance metrics are created.
- Parameters:
paths (SimpleNamespace dict) – dictionary object containing paths for loading and saving
dataset_csv_path (string) – name of classification dataframe as secified in the settings.yaml file
embeds (np.array) – the embeddings
config (str, optional) – type of classification, by default ‘linear’
- Returns:
performance dictionary
- Return type:
dict
- class bacpipe.embedding_evaluation.probing.dataset_probe.ProbeDatasetLoader(class_df, embeds, label2index, set_name=None, **kwargs)[source]
Bases:
Dataset- __getitem__(idx)[source]
Iterate through dataset.
- Parameters:
idx (int) – index of training step
- Returns:
(embedding, true label)
- Return type:
tuple
- __init__(class_df, embeds, label2index, set_name=None, **kwargs)[source]
Class to initialize and iterate through classification dataset.
- Parameters:
class_df (pd.DataFrame) – classification dataframe
embeds (np.array) – embeddings
label2index (dict) – linking labels to integers
set_name (string, optional) – train, test or val set, by default None
- bacpipe.embedding_evaluation.probing.dataset_probe.generate_annotations_for_probing_task(ground_truth, paths, label_column, dataset_csv_path='probe_annotations.csv', train_ratio=None, test_ratio=None, **kwargs)[source]
- bacpipe.embedding_evaluation.probing.dataset_probe.probe_dataset_loader(set_name, clean_df, embeds, label2index, batch_size=64, shuffle=False, **kwargs)[source]
Create dataset loader object for classification.
- Parameters:
set_name (string) – train, test of val set
clean_df (pd.DataFrame) – classification dataframe
embeds (np.array) – embeddings
label2index (dict) – link labels to ints
batch_size (int, optional) – number of embeddings per batch, by default 64
shuffle (bool, optional) – shuffle or not, by default False
- Returns:
dataset loader object to iterate over during training
- Return type:
DataLoader obj
cluster
- bacpipe.embedding_evaluation.clustering.cluster.clustering_pipeline(model_name, ground_truth, embeds, paths=None, overwrite=True, label_column='species', **kwargs)[source]
Clustering pipeline, generating clusterings based on the settings file. Clusterings are then evaluated and a dictionary with the evaluation scores is saved and returned
- Parameters:
model_name (str) – name of model backbone
ground_truth (dict) – ground truth labels and a label2dict dictionary
embeds (np.array) – embeddings
paths (SimpleNamespace object) – dict with path attributs for saving and loading
overwrite (bool, optional) – whether to overwrite exisiting clustering files, by default False
label_column (str, optional) – name of column in annotations file, defaults to bacpipe.settings.label_column
- bacpipe.embedding_evaluation.clustering.cluster.eval_clustering(clusterings, ground_truth=[], embeds=None, default_labels=None, label_column=None, **kwargs)[source]
Evaluate clustering performance.
- Parameters:
clusterings (dict) – dictionary with clusterings
ground_truth (list) – ground truth labels
default_labels (dict) – default labels for the dataset
label_column (string) – label type defined in annotations.csv file
- Returns:
performance metrics
- Return type:
dict
- bacpipe.embedding_evaluation.clustering.cluster.eval_with_silhouette(embeds, ground_truth, metrics=None)[source]
Evaluate clustering using Silhouette Score.
- Parameters:
embeds (np.ndarray) – embeddings
ground_truth (list) – ground truth array
metrics (dict, optional) – already generated evaluation metrics, if any, by default None
- Returns:
evaluation metrics including Silhouette score
- Return type:
dict
- bacpipe.embedding_evaluation.clustering.cluster.get_clustering_models(clust_params)[source]
Initialize the clustering models specified in settings.yaml
- Parameters:
clust_params (dict) – clusterings specified in settings.yaml
- Returns:
clustering objects to run the data on
- Return type:
dict
- bacpipe.embedding_evaluation.clustering.cluster.get_nr_of_clusters(labels, clust_configs, **kwargs)[source]
Get number of clusters either from ground truth or if doesn’t exist from settings.yaml
- Parameters:
labels (list) – ground truth labels
clust_configs (dict) – clusterings specified in settings.yaml
- Returns:
clustering dict with correct number of clusters
- Return type:
dict
- bacpipe.embedding_evaluation.clustering.cluster.run_clustering(embeds, cluster_configs, label_column=None, ground_truth=[])[source]
Fit clustering algorithms to embeddings.
- Parameters:
embeds (np.array) – embeddings
cluster_configs (dict) – clustering algorithm objects
label_column (string) – label type defined in annotations.csv file
ground_truth (list) – ground truth labels
- Returns:
labels accordings to clustering algorithms
- Return type:
dict
- bacpipe.embedding_evaluation.clustering.cluster.save_clustering_performance(paths, clusterings, metrics, label_column)[source]
Save the clustering performance. A json file for the performance metrics and a npy file with the cluster labels for visualizations.
- Parameters:
paths (SimpleNamespace object) – dict with path attributes
clusterings (np.array) – clustering labels
metrics (dict) – clustering performance
label_column (str) – label as defined in annotation.csv file
benchmarking
- bacpipe.embedding_evaluation.benchmark.benchmark(model, dataset, annotations_file=None, CustomModel=None, check_if_already_processed=True, **kwargs)[source]
Benchmark a model’s classifier performance for a dataset. The dataset requires an annotation file that is located in the root directory of the dataset. This annotation file has needs to have the column names: start, end, audiofilename, label:species so that the ground truth can be extracted. Ground truth is mapped to the timestamps so that predictions and ground_truth have the same shape. If predictions have already been produced this function runs very quickly as it uses the saved data.
Finally the sklearn.metrics.classification_report function is used to quantify the performance. The results are printed as a report and returned as a dictionary. This function expects a threshold. Threshold-independent performance evaluation is currently not supported.
- Parameters:
model (string) – model name
dataset (string) – path to audio dataset
annotations_file (string, optional) – file name of annotations, by default None
CustomModel (class, optional) – Custom model to use for the predictions, by default None
check_if_already_processed (bool, optional) – if you want to force embeddings to be generated again, set to True, defaults to True
- Returns:
dictionary containing report results, ground truth array, predictions array, index to label dict and a list of the species that weren’t found in the classifier class list
- Return type:
dict
label_embeddings
- class bacpipe.embedding_evaluation.label_embeddings.DefaultLabels(paths, model, default_label_keys, **kwargs)[source]
Bases:
object- __init__(paths, model, default_label_keys, **kwargs)[source]
Class to generate default labels based on audio files and number of generated embeddings per file.
- Parameters:
paths (SimpleNamespace) – convenient object for path handling
model (str) – model name
default_label_keys (list) – list of default labels, see settings.yaml
- Raises:
ValueError – if no embeddings were found
- bacpipe.embedding_evaluation.label_embeddings.build_ground_truth_labels_by_file(paths, ind, model, num_embeds, segment_s, metadata, all_labels, label_df=None, label_idx_dict=None, label_column=None, **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.collect_ground_truth_labels(paths, files, model, segment_s, metadata, label_df, label_idx_dict, **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.concatenate_annotation_files(annotation_src, appendix='.txt', acodet_annotations=False, start_col_name='start', end_col_name='end', lab_col_name='label')[source]
- bacpipe.embedding_evaluation.label_embeddings.create_Raven_annotation_table(df, label_column, high_freq=1000)[source]
- bacpipe.embedding_evaluation.label_embeddings.create_default_labels(audio_dir=None, model=None, paths=None, overwrite=True, **kwargs)[source]
Create default labels based on audio files and model timestamps to match the number of embeddings created per file for visualization and clustering purposes.
- Parameters:
audio_dir (str, optional) – path to audio data, by default None
model (str, optional) – model name, by default None
paths (SimpleNamespace, optional) – convenient object for path handling, by default None
overwrite (bool, optional) – if True labels are overwritten, by default True
- Returns:
dictionary with default labels
- Return type:
dict
- bacpipe.embedding_evaluation.label_embeddings.ensure_audio_files(found_audio_files, annotated_audio_files, audio_dir)[source]
- bacpipe.embedding_evaluation.label_embeddings.fill_all_labels_array(file_labels, all_labels)[source]
- bacpipe.embedding_evaluation.label_embeddings.filter_annotations_by_minimum_number_of_occurrences(df, min_occurrences=150, min_duration=0.65)[source]
Filter the annotations to have at least a minimum number of occurrences and a minimum duration.
- Parameters:
df (pd.DataFrame) – DataFrame containing the annotations.
min_occurrences (int, optional) – Minimum number of occurrences for each label, by default 150.
min_duration (float, optional) – Minimum duration for each label, by default 0.65.
- Returns:
Filtered DataFrame containing the annotations.
- Return type:
pd.DataFrame
- bacpipe.embedding_evaluation.label_embeddings.filter_df_by_filename(df_to_filer, file_name, file_name_column='audiofilename', model=None)[source]
- bacpipe.embedding_evaluation.label_embeddings.fit_labels_to_embedding_timestamps(df, label_idx_dict, num_embeds, segment_s, label_column=None, single_label=True, min_annotation_length=0.65, **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.get_default_labels(model_name, **kwargs)[source]
Return dictionary of the default labels based on the files that were already processed and saved. This is model dependent, as the input length is model dependent and therefore this function requires a model name as input. The default labels are calculated based on the default labels specified in the settings.yaml file.
- Parameters:
model_name (str) – model name
- Returns:
dictionary of default labels
- Return type:
dict
- bacpipe.embedding_evaluation.label_embeddings.get_dim_reduc_path_func(model_name, dim_reduction_model='umap', **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.get_dt_filename(file)[source]
Return the timestamp within a filename as a datetime object based on the most common naming conventions in bioacoustics. This is not bullet proof but it works with the vast majority of naming conventions for files.
- Parameters:
file (str) – filename as string
- Returns:
datetime object of the filename
- Return type:
dt.datetime object
- bacpipe.embedding_evaluation.label_embeddings.get_files_if_no_embeds(audio_dir, model, label_df=None)[source]
- bacpipe.embedding_evaluation.label_embeddings.get_ground_truth(model_name)[source]
Return dictionary of the ground truth labels based on the files that were already processed and saved. This is model dependent, as the input length is model dependent and therefore this function requires a model name as input.
- Parameters:
model_name (str) – model name
- Returns:
dictionary of ground truth labels
- Return type:
dict
- bacpipe.embedding_evaluation.label_embeddings.ground_truth_by_model(model, audio_dir, label_df=None, label_idx_dict=None, label_column='label:species', paths=None, annotations_filename='annotations.csv', overwrite=True, single_label=True, bool_filter_labels=False, **kwargs)[source]
Generate ground truth labels that are mapped onto the timestamps of a model, based on the model-specific input lengths. This way the embeddings and ground truth labels have the same lengths, and can be used for downstream evaluation like probing or clustering. This function supports single or multi-label generation of ground truth labels. A dictionary is created with a numpy array for the labels and a dictionary to associate the int values with the corresponding label class. The labels are processed based on a single annotation file which requires predefined column names: audiofilename, start, end, label:species (species can be replaced with other things but the label: needs to be consistent). See ‘bacpipe/tests/test_data/annotations.csv’ for an example. After processing the ground truth, the dictionary is saved as a numpy file and upon reexecution is simply loaded for shorter runtime.
- Parameters:
model (str) – model name
audio_dir (str) – path to audio data
label_df (pandas.DataFrame, optional) – ground truth annotations in specified format, by default None
label_idx_dict (dict, optional) – link between int values and class labels can be auto generated, by default None
label_column (str, optional) – name of column in annotation file, by default ‘label:species’
paths (SimpleNamespace, optional) – convenient object for path handling, by default None
annotations_filename (str, optional) – path to annotations csv file, by default “annotations.csv”
overwrite (bool, optional) – If True, the dict will be generated again and saved rather than loaded from a file if already processed, by default True
single_label (bool, optional) – set False if you want multi-label, by default True
bool_filter_labels (bool, optional) – set to True, if you want a minimum number of occurrence for labels to be included in the ground truth. See settings file for more options and descriptions, by default False
- Returns:
dictionary of ground truth labels with numpy array and dict to link int values to class labels
- Return type:
dict
- Raises:
ValueError – if gorund truth file is not found
- bacpipe.embedding_evaluation.label_embeddings.load_labels_and_build_dict(paths, annotations_filename, audio_dir, bool_filter_labels=True, min_label_occurrences=150, main_label_column=None, testing=False, **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.make_set_paths_func(audio_dir, main_results_dir=None, dim_reduc_parent_dir='dim_reduced_embeddings', testing=False, **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.model_specific_embedding_path(path, model, dim_reduction_model=None, **kwargs)[source]
Get the path to the model specific embeddings. This function searches for the most recent directory containing the embeddings for the specified model and dimensionality reduction model.
- Parameters:
path (Path) – Path to the main embeddings directory.
model (str) – Name of the model used for embedding.
dim_reduction_model (str) – Name of the dimensionality reduction model used. Default is ‘umap’.
kwargs (dict) – Additional keyword arguments.
- Returns:
Path to the model specific embeddings directory.
- Return type:
Path
- Raises:
ValueError – If no embeddings are found for the specified model.
visualization
- bacpipe.embedding_evaluation.visualization.visualize.clustering_overview(path_func, label_by, no_noise, model_list, label_column, **kwargs)[source]
Create overview plots for clustering metrics.
- Parameters:
path_func (function) – function to return the paths when model name is given
label_by (str) – key of default_labels dict
no_noise (bool) – whether to plot the metrics with or without noise
model_list (list) – list of models
label_column (str) – label as defined in the annotations.csv file
kwargs (dict) – additional arguments for plotting
- Returns:
figure handle
- Return type:
plt.plot object
- bacpipe.embedding_evaluation.visualization.visualize.generate_bar_plot(metrics, fig, ax, x_label='Metric value', no_legend=False, **kwargs)[source]
- bacpipe.embedding_evaluation.visualization.visualize.iterate_through_subtasks(plot_func, plot_path, task_name, model_list, metrics)[source]
For classification multiple subtasks exist (linear and knn). Iterate over each of the subtasks and call the plotting functions to create the visualizations.
- Parameters:
plot_func (function) – returns model specific paths when model name is passed
plot_path (pathlib.Path object) – path to store overview plots
task_name (str) – name of task
model_list (list) – list of models
metrics (dict) – performance dictionary
- bacpipe.embedding_evaluation.visualization.visualize.plot_clusterings(path_func, model_name, label_by, no_noise, fig=None, ax=None, **kwargs)[source]
Plot the clustering metrics for a given model and label type.
- Parameters:
path_func (function) – function to return the paths when model name is given
model_name (str) – name of model
label_by (str) – key of default_labels dict
no_noise (bool) – whether to plot the metrics with or without noise
fig (plt.plot object, optional) – figure handle, by default None
ax (plt.plot object, optional) – axes handle, by default None
- Returns:
figure handle
- Return type:
plt.plot object
- bacpipe.embedding_evaluation.visualization.visualize.plot_overview_metrics(plot_path, task_name, model_list, metrics, path_func=None, return_fig=False, sort_string='kmeans-audio_file_name')[source]
Visualization of task performance by model accross all classes. Resulting plot is stored in the plot path.
- Parameters:
plot_path (pathlib.Path object) – path to store overview plots
task_name (str) – name of task
model_list (list) – list of models
metrics (dict) – performance dictionary
sort_string (str) – string to sort the metrics by, defaults to “kmeans-audio_file_name”
- bacpipe.embedding_evaluation.visualization.visualize.visualise_results_across_models(plot_path, task_name, model_list)[source]
Create visualizations to compare models by specified tasks.
- Parameters:
path_func (function) – return the paths when given a model name
plot_path (pathlib.Path object) – path to overview plots
task_name (str) – name of task
model_list (list) – list of models
- class bacpipe.embedding_evaluation.visualization.dashboard.DashBoard(model_names, audio_dir, main_results_dir, default_label_keys, evaluation_task, dim_reduction_model, dim_reduc_parent_dir, **kwargs)[source]
Bases:
DashBoardHelper- build_layout()[source]
Builds the layout for the dashboard with two models and a single model page. The layout consists of a single model page, a two-models comparison page, and a page showing all models. Each page contains sidebars with model-specific information and content areas for visualizations.
- bacpipe.embedding_evaluation.visualization.dashboard.visualize_using_dashboard(models, dashboard_port=5006, dashboard_address='localhost', dashboard_websocket_origin=False, **kwargs)[source]
Create and serve the dashboard for visualization. To colorcode embeddings by other labels than the default ones, create an annotations file with timestamps. An example file can be found in ‘bacpipe/tests/test_data/annotations.csv’. Multiple dashboards can be opened, the port will simply increment.
- Parameters:
models (list) – embedding models
kwargs (dict) – Dictionary with parameters for dashboard creation
test_embedding_creation
- bacpipe.tests.test_embedding_creation.test_benchmarking(model, device, only_embed_annotations, kwargs)[source]