API Reference

This page lists all the objects that are part of the public API of bacpipe. These are either defined in the package or explicitly re-exported in __init__.py.

Constants

bacpipe.supported_models()

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

bacpipe.models_needing_checkpoint()

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

bacpipe.TF_MODELS()

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

bacpipe.EMBEDDING_DIMENSIONS()

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s

(key, value) pairs

dict(iterable) -> new dictionary initialized as if via:

d = {} for k, v in iterable:

d[k] = v

dict(**kwargs) -> new dictionary initialized with the name=value pairs

in the keyword argument list. For example: dict(one=1, two=2)

bacpipe.NEEDS_CHECKPOINT()

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

Main Processing Classes

class bacpipe.Loader(audio_dir, model_name=None, check_if_combination_exists=True, dim_reduction_model=False, use_folder_structure=False, testing=False, **kwargs)[source]

Bases: object

Initiate the generation of embedding by creating a Loader object. This object will handles paths for loading and saving data. During this process it collects metadata which can be accessed as an attribute and will be saved after the successful run. kwargs that are not specifically passed will be taken from bacpipe.config and bacpipe.settings.

__init__(audio_dir, model_name=None, check_if_combination_exists=True, dim_reduction_model=False, use_folder_structure=False, testing=False, **kwargs)[source]

Initiate the generation of embedding by creating a Loader object. This object will handles paths for loading and saving data. During this process it collects metadata which can be accessed as an attribute and will be saved after the successful run. kwargs that are not specifically passed will be taken from bacpipe.config and bacpipe.settings.

Parameters:
  • audio_dir (string or pathlib.Path) – path to audio data

  • model_name (string, optional) – Name of the model that should be used, by default None

  • check_if_combination_exists (bool, optional) – If false new embeddings are created and the checking is skipped, by default True

  • dim_reduction_model (bool, optional) – Either false if primary embeddings are created or the name of the dimensionaliry reduction model if dim reduction should be performed, by default False

  • use_folder_structure (bool, optional) – If True data will be saved and the output folder structure will be created, by default False

  • testing (bool, optional) – Testing yes or no?, by default False

classifier_should_be_run(paths, run_pretrained_classifier, testing, **kwargs)[source]
embeddings(return_type='dict')[source]

Load and return processed embeddings. This method can only be used to return already computed embeddings. Embeddings can be returned as np.array (array) or as dictionary (dict) in which case the keys will correspond to the corresponding embedding file name. In case of the array, all embeddings are concatenated so that the first dimension corresponds to the timestamp and the second dimension to the embedding dimension.

Parameters:

return_type (str, optional) – return type either array or dict, by default ‘dict’

Returns:

depending on return_type argument

Return type:

array or dict

get_annotations_parquet(**kwargs)[source]
static get_audio_files(audio_dir, audio_suffixes=['.wav', '.WAV', '.aif', '.mp3', '.MP3', '.flac', '.ogg'], return_type='pathlib.Path')[source]

Collect all audio files in a given directory that have file endings that can be processed by bacpipe.

Parameters:
  • audio_dir (str) – path to audio data

  • audio_suffixes (list, optional) – list of audio suffixes, by default settings.audio_suffixes

  • return_type (str, optional) – specify if list should be returned as list of strings or list of pathlib.Path objects which comes in handy for some downstream processing, by default ‘pathlib.Path’

Returns:

list of audio files

Return type:

list

get_embedding_dir()[source]
get_preds_array(return_type='dict', **kwargs)[source]
predictions(return_type='dict')[source]

Load and return classifier predictions. This method can only be used for already processed predictions. Predictions that have been processed will be returned based on the specified return_type: array for np.array, in which case all predictions are concatenated and a dictionary is passed referencing the index to the corresponding label. dict for a dictionary, in which case the keys correspond to the audio file name corresponding to the annotation and the values are np.arrays with all annotations of that file dataframe for a dataframe with columns for each species that was active and columns for filename, start and end times.

Parameters:

return_type (str, optional) – return either array, dict or dataframe, by default ‘dict’

Returns:

either tuples of (np.array, dict) for array or tuple of (dict, dict) for dict or pd.DataFrame

Return type:

tuple or pd.DataFrame

read_embedding_file(file)[source]
save_embedding_file(file, embeds)[source]
update_files()[source]
write_metadata_file()[source]
class bacpipe.Embedder(model_name, loader=None, CustomModel=None, dim_reduction_model=False, **kwargs)[source]

Bases: AudioHandler

This class takes care of loading the specified model and using it to process the audio data to create embeddings. This class is also used to create dimensinoality reductions from embeddings. At the end if instantiation, the selected model is loaded and the model is associated with the specified device. kwargs that are not specifically passed will be taken from bacpipe.config and bacpipe.settings.

Parameters:

AudioHandler (class) – Helper class that handles loading of audio

__init__(model_name, loader=None, CustomModel=None, dim_reduction_model=False, **kwargs)[source]

This class takes care of loading the specified model and using it to process the audio data to create embeddings. This class is also used to create dimensinoality reductions from embeddings. At the end if instantiation, the selected model is loaded and the model is associated with the specified device. kwargs that are not specifically passed will be taken from bacpipe.config and bacpipe.settings.

Parameters:
  • model_name (str) – name of selected embedding model

  • loader (Loader object) – Object that has all the necessary path information and methods to load and save all the processed data

  • CustomModel (class, optional) – custom model class to use for processing, by default None

  • dim_reduction_model (bool, optional) – Can be bool or the string corresponding to the dimensionality reduction model, by default False

batch_inference(batched_samples, callback=None)[source]
embeddings_using_multithreading(array_of_audios)[source]

Generate embeddings for all files in a pipelined manner: - Producer thread loads and preprocesses audio - Consumer (main thread) embeds audio while producer prepares next batch Ensures metadata and embeddings are written exactly like in the sequential version.

Parameters:

fileloader_obj (Loader object) – contains all metadata of a model specific embedding creation session

Returns:

updated object with metadata on embedding creation session

Return type:

Loader object

get_embeddings_for_audio(sample)[source]

Create a dataloader for the processed audio frames and run batch inference. Both are methods of the self.model class, which can be found in the utils.py file.

Parameters:

sample (torch.Tensor) – preprocessed audio frames

Returns:

embeddings from model

Return type:

np.array

get_embeddings_from_model(sample)[source]

Run full embedding generation pipeline, both for generating embeddings from audio data or generating dimensionality reductions from embedding data. Depending on that sample can be an embedding array or a audio file path.

Parameters:

sample (np.array or string-like) – embedding array of path to audio file

Returns:

embeddings

Return type:

np.array

get_reduced_dimensionality_embeddings(embeds)[source]
init_dataloader(audio)[source]
prepare_audio(sample)

Use bacpipe pipeline to load audio file, window it according to model specific window length and preprocess the data, ready for batch inference computation. Also log file length and shape for metadata files.

Parameters:

sample (pathlib.Path or str) – path to audio file

Returns:

audio frames preprocessed with model specific preprocessing

Return type:

torch.Tensor

run_dimensionality_reduction_pipeline()[source]
run_inference_pipeline_sequentially()[source]
run_inference_pipeline_using_multithreading()[source]

Generate embeddings for all files in a pipelined manner: - Producer thread loads and preprocesses audio - Consumer (main thread) embeds audio while producer prepares next batch Ensures metadata and embeddings are written exactly like in the sequential version.

Parameters:

fileloader_obj (Loader object) – contains all metadata of a model specific embedding creation session

Returns:

updated object with metadata on embedding creation session

Return type:

Loader object

Main Pipeline Functions

bacpipe.play(bool_save_logs=False, **kwargs)[source]

Play the bacpipe! The pipeline will run using the models specified in bacpipe.config.models and generate results in the directory bacpipe.settings.results_dir. For more details see the ReadMe file on the repository page https://github.com/bioacoustic-ai/bacpipe or the documentation under https://bacpipe.readthedocs.io/en/latest/.

Parameters:

bool_save_logs (bool, optional) – Save logs, config and settings file. This is important if you get a bug, sharing this will be very helpful to find the source of the problem, by default False

Raises:

FileNotFoundError – If no audio files are found we can’t compute any embeddings. So make sure the path is correct :)

bacpipe.run_pipeline_for_single_model(model_name, audio_dir, dim_reduction_model='None', check_if_already_processed=True, check_if_already_dim_reduced=True, testing=False, **kwargs)[source]

Run the bacpipe pipeline, including embedding generation, classification using the pretrained classifier (if included), dimensionality reduction (if passed), and plotting of visualization to files. All of this will be done for one model. The predefined folder structure will be created so that subsequent processing runs will be very fast, as they then only load the data. kwargs that are not specifically passed will be taken from bacpipe.config and bacpipe.settings.

Parameters:
  • model_name (string) – model name

  • audio_dir (str) – path to audio data

  • dim_reduction_model (str, optional) – name of dimensionality reduction model, by default “None”

  • check_if_already_processed (bool, optional) – set to False if you want to force recomputing of embeddings, by default True

  • check_if_already_dim_reduced (bool, optional) – set to False if you want to force recomputing of dimensionality reduced embeddings, by default True

  • overwrite (bool, optional) – set to True if you want default labels and ground truth labels to be processed again, by default False

  • testing (bool, optional) – set to True for testing, by default False

Returns:

object to processed embeddings and classifier predictions

Return type:

bacpipe.Loader

bacpipe.run_pipeline_for_models(models, audio_dir, dim_reduction_model, **kwargs)[source]

Generate embeddings for each model in the list of model names. The embeddings are generated using the generate_embeddings function from the generate_embeddings module. The embeddings are saved in the directory specified by the audio_dir parameter. The function returns a dictionary containing the loader objects for each model, by which metadata and paths are stored. kwargs that are not specifically passed will be taken from bacpipe.config and bacpipe.settings.

code example: ``` loader = bacpipe.run_pipeline_for_models(

models=[‘birdnet’, ‘naturebeats’], audio_dir=’bacpipe/tests/test_data’, dim_reduction_model=’umap’

)

# this call will initiate the embedding generation process, it will check if embeddings # already exist for the combination of each model and the dataset and if so it will # be ready to load them. The loader keys will be the model name and the values will # be the loader objects for each model. Each object contains all the information # on the generated embeddings. To name access them: loader[‘birdnet’].embeddings() # this will give you a dictionary with the keys corresponding to embedding files # and the values corresponding to the embeddings as numpy arrays

loader[‘birdnet’].metadata_dict # This will give you a dictionary overview of: # - where the audio data came from, # - where the embeddings were saved # - all the audio files, # - the embedding size of the model, # - the audio file lengths, # - the number of embeddings for each audio files # - the sample rate # - the number of samples per window # - and the total length of the processed dataset in seconds # Thic dictionary is also saved as a yaml file in the directory of the embeddings ```

Parameters:
  • models (list) – embedding models

  • audio_dir (string) – full path to audio files

  • dim_reduction_model (string) – name of the dimensionality reduction model to be used for the embeddings. If “None” is selected, no dimensionality reduction is performed.

Returns:

loader_dict – dictionary containing the loader objects for each model

Return type:

dict

bacpipe.generate_embeddings(avoid_pipelined_gpu_inference=False, **kwargs)[source]

Run the embedding generation pipeline including classification using the pretrained classifier (if included). All of this will be done for one model. The predefined folder structure will be created so that subsequent processing runs will be very fast, as they then only load the data. kwargs that are not specifically passed will be taken from bacpipe.config and bacpipe.settings.

Parameters:

avoid_pipelined_gpu_inference (bool, optional) – set to True to avoid multiprocessing, by default False

Returns:

loader object to access embeddings and classifier predictions

Return type:

bacpipe.Loader

Raises:

ValueError – if not model name is provided

Return audio files in specified dir

bacpipe.get_audio_files(audio_dir, audio_suffixes=['.wav', '.WAV', '.aif', '.mp3', '.MP3', '.flac', '.ogg'], return_type='pathlib.Path')

Collect all audio files in a given directory that have file endings that can be processed by bacpipe.

Parameters:
  • audio_dir (str) – path to audio data

  • audio_suffixes (list, optional) – list of audio suffixes, by default settings.audio_suffixes

  • return_type (str, optional) – specify if list should be returned as list of strings or list of pathlib.Path objects which comes in handy for some downstream processing, by default ‘pathlib.Path’

Returns:

list of audio files

Return type:

list

Automatic creation of labels and ground truth

bacpipe.DefaultLabels(paths, model, default_label_keys, **kwargs)[source]
bacpipe.create_default_labels(audio_dir=None, model=None, paths=None, overwrite=True, **kwargs)[source]

Create default labels based on audio files and model timestamps to match the number of embeddings created per file for visualization and clustering purposes.

Parameters:
  • audio_dir (str, optional) – path to audio data, by default None

  • model (str, optional) – model name, by default None

  • paths (SimpleNamespace, optional) – convenient object for path handling, by default None

  • overwrite (bool, optional) – if True labels are overwritten, by default True

Returns:

dictionary with default labels

Return type:

dict

bacpipe.ground_truth_by_model(model, audio_dir, label_df=None, label_idx_dict=None, label_column='label:species', paths=None, annotations_filename='annotations.csv', overwrite=True, single_label=True, bool_filter_labels=False, **kwargs)[source]

Generate ground truth labels that are mapped onto the timestamps of a model, based on the model-specific input lengths. This way the embeddings and ground truth labels have the same lengths, and can be used for downstream evaluation like probing or clustering. This function supports single or multi-label generation of ground truth labels. A dictionary is created with a numpy array for the labels and a dictionary to associate the int values with the corresponding label class. The labels are processed based on a single annotation file which requires predefined column names: audiofilename, start, end, label:species (species can be replaced with other things but the label: needs to be consistent). See ‘bacpipe/tests/test_data/annotations.csv’ for an example. After processing the ground truth, the dictionary is saved as a numpy file and upon reexecution is simply loaded for shorter runtime.

Parameters:
  • model (str) – model name

  • audio_dir (str) – path to audio data

  • label_df (pandas.DataFrame, optional) – ground truth annotations in specified format, by default None

  • label_idx_dict (dict, optional) – link between int values and class labels can be auto generated, by default None

  • label_column (str, optional) – name of column in annotation file, by default ‘label:species’

  • paths (SimpleNamespace, optional) – convenient object for path handling, by default None

  • annotations_filename (str, optional) – path to annotations csv file, by default “annotations.csv”

  • overwrite (bool, optional) – If True, the dict will be generated again and saved rather than loaded from a file if already processed, by default True

  • single_label (bool, optional) – set False if you want multi-label, by default True

  • bool_filter_labels (bool, optional) – set to True, if you want a minimum number of occurrence for labels to be included in the ground truth. See settings file for more options and descriptions, by default False

Returns:

dictionary of ground truth labels with numpy array and dict to link int values to class labels

Return type:

dict

Raises:

ValueError – if gorund truth file is not found

bacpipe.get_default_labels(model_name, **kwargs)[source]

Return dictionary of the default labels based on the files that were already processed and saved. This is model dependent, as the input length is model dependent and therefore this function requires a model name as input. The default labels are calculated based on the default labels specified in the settings.yaml file.

Parameters:

model_name (str) – model name

Returns:

dictionary of default labels

Return type:

dict

bacpipe.get_dt_filename(file)[source]

Return the timestamp within a filename as a datetime object based on the most common naming conventions in bioacoustics. This is not bullet proof but it works with the vast majority of naming conventions for files.

Parameters:

file (str) – filename as string

Returns:

datetime object of the filename

Return type:

dt.datetime object

Probing functions

bacpipe.probing_pipeline(model_name, ground_truth, embeds, paths=None, name='linear', overwrite=True, label_column='species', **kwargs)[source]

Probing pipeline consisting of building the classifier, evaluating it and saving metrics and plots of performance.

Parameters:
  • paths (SimpleNamespace object) – dict with attributes corresponding to paths for loading and saving

  • embeds (np.array) – embeddings

  • name (string) – Type of Probing

  • dataset_csv_path (string) – name of Probing dataframe as specified in settings.yaml

  • overwrite (bool) – overwrite existing Probing?, defaults to False

bacpipe.run_probe_inference(model, linear_probe, threshold, embeds=None, return_binary_presence=True, callbacks=None, device='cpu')[source]

Apply a previously trained linear probe to data. This requires either that the embeddings were already created using the backbone and saved using the bacpipe folder structure, or that the embeddings are directly passed to this function. See the examples notebooks for an example use case. This function then loads the embeddings and applies the linear probe to classify the data.

Parameters:
  • model (str) – model name

  • linear_probe (torch model) – linear probe torch model object

  • threshold (float) – float value to process the predictions

  • embeds (torch.Tensor, optional) – embeddings array, by default None

  • return_binary_presence (bool, optional) – if true a binary presence array is returned, by default True

  • callbacks (function, optional) – use to have custom progress bars increment, by default None

  • device (str, optional) – select device to process the probe, by default ‘cpu’

Returns:

generated probe predictions

Return type:

np.ndarray

bacpipe.prepare_probe_inference(model, probe_path='')[source]

Load a linear probe that was previously trained and saved. The probe is loaded and the state_dict of the model is loaded so that the probe is ready and in the exact same state as after training.

Parameters:
  • model (str) – model name of backbone

  • probe_path (str, optional) – path to probe, will default to the standard bacpipe path, by default ‘’

Returns:

  • torch model object – linear probe model

  • dict – dictionary to associate the columns of the generated predictions array with the corresponding class label

Clustering functions

bacpipe.clustering_pipeline(model_name, ground_truth, embeds, paths=None, overwrite=True, label_column='species', **kwargs)[source]

Clustering pipeline, generating clusterings based on the settings file. Clusterings are then evaluated and a dictionary with the evaluation scores is saved and returned

Parameters:
  • model_name (str) – name of model backbone

  • ground_truth (dict) – ground truth labels and a label2dict dictionary

  • embeds (np.array) – embeddings

  • paths (SimpleNamespace object) – dict with path attributs for saving and loading

  • overwrite (bool, optional) – whether to overwrite exisiting clustering files, by default False

  • label_column (str, optional) – name of column in annotations file, defaults to bacpipe.settings.label_column

bacpipe.run_clustering(embeds, cluster_configs, label_column=None, ground_truth=[])[source]

Fit clustering algorithms to embeddings.

Parameters:
  • embeds (np.array) – embeddings

  • cluster_configs (dict) – clustering algorithm objects

  • label_column (string) – label type defined in annotations.csv file

  • ground_truth (list) – ground truth labels

Returns:

labels accordings to clustering algorithms

Return type:

dict

bacpipe.eval_clustering(clusterings, ground_truth=[], embeds=None, default_labels=None, label_column=None, **kwargs)[source]

Evaluate clustering performance.

Parameters:
  • clusterings (dict) – dictionary with clusterings

  • ground_truth (list) – ground truth labels

  • default_labels (dict) – default labels for the dataset

  • label_column (string) – label type defined in annotations.csv file

Returns:

performance metrics

Return type:

dict

bacpipe.eval_with_silhouette(embeds, ground_truth, metrics=None)[source]

Evaluate clustering using Silhouette Score.

Parameters:
  • embeds (np.ndarray) – embeddings

  • ground_truth (list) – ground truth array

  • metrics (dict, optional) – already generated evaluation metrics, if any, by default None

Returns:

evaluation metrics including Silhouette score

Return type:

dict

Evaluation pipelines

bacpipe.benchmark(model, dataset, annotations_file=None, CustomModel=None, check_if_already_processed=True, **kwargs)[source]

Benchmark a model’s classifier performance for a dataset. The dataset requires an annotation file that is located in the root directory of the dataset. This annotation file has needs to have the column names: start, end, audiofilename, label:species so that the ground truth can be extracted. Ground truth is mapped to the timestamps so that predictions and ground_truth have the same shape. If predictions have already been produced this function runs very quickly as it uses the saved data.

Finally the sklearn.metrics.classification_report function is used to quantify the performance. The results are printed as a report and returned as a dictionary. This function expects a threshold. Threshold-independent performance evaluation is currently not supported.

Parameters:
  • model (string) – model name

  • dataset (string) – path to audio dataset

  • annotations_file (string, optional) – file name of annotations, by default None

  • CustomModel (class, optional) – Custom model to use for the predictions, by default None

  • check_if_already_processed (bool, optional) – if you want to force embeddings to be generated again, set to True, defaults to True

Returns:

dictionary containing report results, ground truth array, predictions array, index to label dict and a list of the species that weren’t found in the classifier class list

Return type:

dict

bacpipe.model_specific_evaluation(loader_dict, evaluation_task, probe_configs, models, dim_reduction_model=False, **kwargs)[source]

Perform evaluation of the embeddings using the specified evaluation task. The evaluation task can be either probing or clustering. The evaluation is performed using the functions from the probing and clustering modules. The results of the evaluation are saved in the directory specified by the audio_dir parameter.

Parameters:
  • loader_dict (dict) – dictionary containing the loader objects for each model

  • evaluation_task (string) – name of the evaluation task to be performed.

  • probe_configs (dict) – dictionary containing the configuration for the probing tasks. The configurations are specified in the bacpipe/settings.yaml file.

  • models (list) – embedding models

bacpipe.cross_model_evaluation(dim_reduction_model, evaluation_task, models, **kwargs)[source]

Generate plots to compare models by the specified tasks.

Parameters:
  • dim_reduction_model (str) – name of dimensionality reduction model

  • evaluation_task (list) – tasks to evaluate models by

  • models (list) – embedding models

Experiment managing functions

bacpipe.ensure_models_exist(model_base_path, model_names, repo_id='vskode/bacpipe_models')[source]

Ensure that the model checkpoints for the selected models are available locally. Downloads from Hugging Face Hub if missing.

Parameters:
  • model_base_path (Path) – Local base directory where the checkpoints should be stored.

  • model_names (str or list) – Model name or list of model names to run

  • repo_id (str, optional) – Hugging Face Hub repo ID, by default “vinikay/bacpipe_models”

Returns:

path to saved models

Return type:

str

bacpipe.evaluation_with_settings_already_exists(audio_dir, dim_reduction_model, models, testing=False, **kwargs)[source]

Check if the evaluation with the specified settings already exists. The function checks if the embeddings, dimensionality reduction, probing and clustering evaluation results already exist in the specified directory. If any of these results do not exist, the function returns False. Otherwise, it returns True.

Parameters:
  • audio_dir (string) – full path to audio files

  • dim_reduction_model (string) – name of the dimensionality reduction model to be used

  • models (list) – embedding models

Returns:

True if the evaluation with the specified settings

Return type:

bool

bacpipe.get_model_names(models, audio_dir, main_results_dir, embed_parent_dir, already_computed=False, **kwargs)[source]

Get the names of the models used for processing. This is either done by using already computed embeddings or by using the selected models from the config file. If already computed embeddings are used, the model names are extracted from the directory structure.

Parameters:
  • models (list) – list of embedding models

  • audio_dir (string) – full path to audio files

  • main_results_dir (string) – top level directory for the results of the embedding evaluation

  • embed_parent_dir (string) – parent directory for the embeddings

  • already_computed (bool, Default is False) – ignore model list and use only models whos embeddings already have been computed and are saved in the results dir

Raises:

ValueError – If already computed embeddings are used, but no embeddings are found in the specified directory.

bacpipe.make_set_paths_func(audio_dir, main_results_dir=None, dim_reduc_parent_dir='dim_reduced_embeddings', testing=False, **kwargs)[source]

Visualization function to start dashboard

bacpipe.visualize_using_dashboard(models, dashboard_port=5006, dashboard_address='localhost', dashboard_websocket_origin=False, **kwargs)[source]

Create and serve the dashboard for visualization. To colorcode embeddings by other labels than the default ones, create an annotations file with timestamps. An example file can be found in ‘bacpipe/tests/test_data/annotations.csv’. Multiple dashboards can be opened, the port will simply increment.

Parameters:
  • models (list) – embedding models

  • kwargs (dict) – Dictionary with parameters for dashboard creation