bacpipe.embedding_evaluation package

Subpackages

Submodules

bacpipe.embedding_evaluation.benchmark module

bacpipe.embedding_evaluation.benchmark.benchmark(model, dataset, annotations_file=None, CustomModel=None, check_if_already_processed=True, **kwargs)[source]

Benchmark a model’s classifier performance for a dataset. The dataset requires an annotation file that is located in the root directory of the dataset. This annotation file has needs to have the column names: start, end, audiofilename, label:species so that the ground truth can be extracted. Ground truth is mapped to the timestamps so that predictions and ground_truth have the same shape. If predictions have already been produced this function runs very quickly as it uses the saved data.

Finally the sklearn.metrics.classification_report function is used to quantify the performance. The results are printed as a report and returned as a dictionary. This function expects a threshold. Threshold-independent performance evaluation is currently not supported.

Parameters:

model (string) – model name
dataset (string) – path to audio dataset
annotations_file (string, optional) – file name of annotations, by default None
CustomModel (class, optional) – Custom model to use for the predictions, by default None
check_if_already_processed (bool, optional) – if you want to force embeddings to be generated again, set to True, defaults to True

Returns:

dictionary containing report results, ground truth array, predictions array, index to label dict and a list of the species that weren’t found in the classifier class list

Return type:

dict

bacpipe.embedding_evaluation.label_embeddings module

class bacpipe.embedding_evaluation.label_embeddings.DefaultLabels(paths, model, default_label_keys, **kwargs)[source]

Bases: object

__init__(paths, model, default_label_keys, **kwargs)[source]

Class to generate default labels based on audio files and number of generated embeddings per file.

Parameters:

paths (SimpleNamespace) – convenient object for path handling
model (str) – model name
default_label_keys (list) – list of default labels, see settings.yaml

Raises:

ValueError – if no embeddings were found

audio_file_name()[source]

continuous_timestamp()[source]

day_of_year()[source]

default_classifier()[source]

fill_remaining_labels(df)[source]

generate()[source]

get_datetimes()[source]

parent_directory()[source]

time_of_day()[source]

bacpipe.embedding_evaluation.label_embeddings.assign_global_get_paths_function(audio_dir)[source]

bacpipe.embedding_evaluation.label_embeddings.build_ground_truth_labels_by_file(paths, ind, model, num_embeds, segment_s, metadata, all_labels, label_df=None, label_idx_dict=None, label_column=None, **kwargs)[source]

bacpipe.embedding_evaluation.label_embeddings.collect_ground_truth_labels(paths, files, model, segment_s, metadata, label_df, label_idx_dict, **kwargs)[source]

bacpipe.embedding_evaluation.label_embeddings.concatenate_annotation_files(annotation_src, appendix='.txt', acodet_annotations=False, start_col_name='start', end_col_name='end', lab_col_name='label')[source]

bacpipe.embedding_evaluation.label_embeddings.create_Raven_annotation_table(df, label_column, high_freq=1000)[source]

bacpipe.embedding_evaluation.label_embeddings.create_default_labels(audio_dir=None, model=None, paths=None, overwrite=True, **kwargs)[source]

Create default labels based on audio files and model timestamps to match the number of embeddings created per file for visualization and clustering purposes.

Parameters:

audio_dir (str, optional) – path to audio data, by default None
model (str, optional) – model name, by default None
paths (SimpleNamespace, optional) – convenient object for path handling, by default None
overwrite (bool, optional) – if True labels are overwritten, by default True

Returns:

dictionary with default labels

Return type:

dict

bacpipe.embedding_evaluation.label_embeddings.ensure_audio_files(found_audio_files, annotated_audio_files, audio_dir)[source]

bacpipe.embedding_evaluation.label_embeddings.fill_all_labels_array(file_labels, all_labels)[source]

bacpipe.embedding_evaluation.label_embeddings.filter_annotations_by_minimum_number_of_occurrences(df, min_occurrences=150, min_duration=0.65)[source]

Filter the annotations to have at least a minimum number of occurrences and a minimum duration.

Parameters:

df (pd.DataFrame) – DataFrame containing the annotations.
min_occurrences (int, optional) – Minimum number of occurrences for each label, by default 150.
min_duration (float, optional) – Minimum duration for each label, by default 0.65.

Returns:

Filtered DataFrame containing the annotations.

Return type:

pd.DataFrame

bacpipe.embedding_evaluation.label_embeddings.filter_df_by_filename(df_to_filer, file_name, file_name_column='audiofilename', model=None)[source]

bacpipe.embedding_evaluation.label_embeddings.fit_labels_to_embedding_timestamps(df, label_idx_dict, num_embeds, segment_s, label_column=None, single_label=True, min_annotation_length=0.65, **kwargs)[source]

bacpipe.embedding_evaluation.label_embeddings.get_default_labels(model_name, **kwargs)[source]

Return dictionary of the default labels based on the files that were already processed and saved. This is model dependent, as the input length is model dependent and therefore this function requires a model name as input. The default labels are calculated based on the default labels specified in the settings.yaml file.

Parameters:: model_name (str) – model name
Returns:: dictionary of default labels
Return type:: dict

bacpipe.embedding_evaluation.label_embeddings.get_dim_reduc_path_func(model_name, dim_reduction_model='umap', **kwargs)[source]

bacpipe.embedding_evaluation.label_embeddings.get_dt_filename(file)[source]

Return the timestamp within a filename as a datetime object based on the most common naming conventions in bioacoustics. This is not bullet proof but it works with the vast majority of naming conventions for files.

Parameters:: file (str) – filename as string
Returns:: datetime object of the filename
Return type:: dt.datetime object

bacpipe.embedding_evaluation.label_embeddings.get_files_if_no_embeds(audio_dir, model, label_df=None)[source]

bacpipe.embedding_evaluation.label_embeddings.get_ground_truth(model_name)[source]

Return dictionary of the ground truth labels based on the files that were already processed and saved. This is model dependent, as the input length is model dependent and therefore this function requires a model name as input.

Parameters:: model_name (str) – model name
Returns:: dictionary of ground truth labels
Return type:: dict

bacpipe.embedding_evaluation.label_embeddings.ground_truth_by_model(model, audio_dir, label_df=None, label_idx_dict=None, label_column='label:species', paths=None, annotations_filename='annotations.csv', overwrite=True, single_label=True, bool_filter_labels=False, **kwargs)[source]

Generate ground truth labels that are mapped onto the timestamps of a model, based on the model-specific input lengths. This way the embeddings and ground truth labels have the same lengths, and can be used for downstream evaluation like probing or clustering. This function supports single or multi-label generation of ground truth labels. A dictionary is created with a numpy array for the labels and a dictionary to associate the int values with the corresponding label class. The labels are processed based on a single annotation file which requires predefined column names: audiofilename, start, end, label:species (species can be replaced with other things but the label: needs to be consistent). See ‘bacpipe/tests/test_data/annotations.csv’ for an example. After processing the ground truth, the dictionary is saved as a numpy file and upon reexecution is simply loaded for shorter runtime.

Parameters:

model (str) – model name
audio_dir (str) – path to audio data
label_df (pandas.DataFrame, optional) – ground truth annotations in specified format, by default None
label_idx_dict (dict, optional) – link between int values and class labels can be auto generated, by default None
label_column (str, optional) – name of column in annotation file, by default ‘label:species’
paths (SimpleNamespace, optional) – convenient object for path handling, by default None
annotations_filename (str, optional) – path to annotations csv file, by default “annotations.csv”
overwrite (bool, optional) – If True, the dict will be generated again and saved rather than loaded from a file if already processed, by default True
single_label (bool, optional) – set False if you want multi-label, by default True
bool_filter_labels (bool, optional) – set to True, if you want a minimum number of occurrence for labels to be included in the ground truth. See settings file for more options and descriptions, by default False

Returns:

dictionary of ground truth labels with numpy array and dict to link int values to class labels

Return type:

dict

Raises:

ValueError – if gorund truth file is not found

bacpipe.embedding_evaluation.label_embeddings.load_labels_and_build_dict(paths, annotations_filename, audio_dir, bool_filter_labels=True, min_label_occurrences=150, main_label_column=None, testing=False, **kwargs)[source]

bacpipe.embedding_evaluation.label_embeddings.make_set_paths_func(audio_dir, main_results_dir=None, dim_reduc_parent_dir='dim_reduced_embeddings', testing=False, **kwargs)[source]

bacpipe.embedding_evaluation.label_embeddings.model_specific_embedding_path(path, model, dim_reduction_model=None, **kwargs)[source]

Get the path to the model specific embeddings. This function searches for the most recent directory containing the embeddings for the specified model and dimensionality reduction model.

Parameters:

path (Path) – Path to the main embeddings directory.
model (str) – Name of the model used for embedding.
dim_reduction_model (str) – Name of the dimensionality reduction model used. Default is ‘umap’.
kwargs (dict) – Additional keyword arguments.

Returns:

Path to the model specific embeddings directory.

Return type:

Path

Raises:

ValueError – If no embeddings are found for the specified model.

bacpipe.embedding_evaluation.label_embeddings.raven_tables_sanity_check(embed_timestamps, segment_s, paths, audio_file, label_df, label_idx_dict, label_column, file_labels, **kwargs)[source]

bacpipe.embedding_evaluation package

Subpackages

Submodules

bacpipe.embedding_evaluation.benchmark module

bacpipe.embedding_evaluation.label_embeddings module

Module contents