bacpipe.embedding_evaluation.label_embeddings

Functions

assign_global_get_paths_function(audio_dir)

build_ground_truth_labels_by_file(paths, ...)

collect_ground_truth_labels(paths, files, ...)

concatenate_annotation_files(annotation_src)

create_Raven_annotation_table(df, label_column)

create_default_labels([audio_dir, model, ...])

Create default labels based on audio files and model timestamps to match the number of embeddings created per file for visualization and clustering purposes.

ensure_audio_files(found_audio_files, ...)

fill_all_labels_array(file_labels, all_labels)

filter_annotations_by_minimum_number_of_occurrences(df)

Filter the annotations to have at least a minimum number of occurrences and a minimum duration.

filter_df_by_filename(df_to_filer, file_name)

fit_labels_to_embedding_timestamps(df, ...)

get_default_labels(model_name, **kwargs)

Return dictionary of the default labels based on the files that were already processed and saved.

get_dim_reduc_path_func(model_name[, ...])

get_dt_filename(file)

Return the timestamp within a filename as a datetime object based on the most common naming conventions in bioacoustics.

get_duration(*[, y, sr, S, n_fft, ...])

Compute the duration (in seconds) of an audio time series, feature matrix, or filename.

get_files_if_no_embeds(audio_dir, model[, ...])

get_ground_truth(model_name)

Return dictionary of the ground truth labels based on the files that were already processed and saved.

ground_truth_by_model(model, audio_dir[, ...])

Generate ground truth labels that are mapped onto the timestamps of a model, based on the model-specific input lengths.

import_module(name[, package])

Import a module.

load_labels_and_build_dict(paths, ...[, ...])

make_set_paths_func(audio_dir[, ...])

model_specific_embedding_path(path, model[, ...])

Get the path to the model specific embeddings.

raven_tables_sanity_check(embed_timestamps, ...)

Classes

DefaultLabels(paths, model, ...)

Path(*args, **kwargs)

PurePath subclass that can make system calls.

SimpleNamespace

A simple attribute-based namespace.

tqdm(*_, **__)

Decorate an iterable object, returning an iterator which acts exactly like the original iterable, but prints a dynamically updating progressbar every time a value is requested.

class bacpipe.embedding_evaluation.label_embeddings.DefaultLabels(paths, model, default_label_keys, **kwargs)[source]

Bases: object

__init__(paths, model, default_label_keys, **kwargs)[source]

Class to generate default labels based on audio files and number of generated embeddings per file.

Parameters:
  • paths (SimpleNamespace) – convenient object for path handling

  • model (str) – model name

  • default_label_keys (list) – list of default labels, see settings.yaml

Raises:

ValueError – if no embeddings were found

audio_file_name()[source]
continuous_timestamp()[source]
day_of_year()[source]
default_classifier()[source]
fill_remaining_labels(df)[source]
generate()[source]
get_datetimes()[source]
parent_directory()[source]
time_of_day()[source]
bacpipe.embedding_evaluation.label_embeddings.assign_global_get_paths_function(audio_dir)[source]
bacpipe.embedding_evaluation.label_embeddings.build_ground_truth_labels_by_file(paths, ind, model, num_embeds, segment_s, metadata, all_labels, label_df=None, label_idx_dict=None, label_column=None, **kwargs)[source]
bacpipe.embedding_evaluation.label_embeddings.collect_ground_truth_labels(paths, files, model, segment_s, metadata, label_df, label_idx_dict, **kwargs)[source]
bacpipe.embedding_evaluation.label_embeddings.concatenate_annotation_files(annotation_src, appendix='.txt', acodet_annotations=False, start_col_name='start', end_col_name='end', lab_col_name='label')[source]
bacpipe.embedding_evaluation.label_embeddings.create_Raven_annotation_table(df, label_column, high_freq=1000)[source]
bacpipe.embedding_evaluation.label_embeddings.create_default_labels(audio_dir=None, model=None, paths=None, overwrite=True, **kwargs)[source]

Create default labels based on audio files and model timestamps to match the number of embeddings created per file for visualization and clustering purposes.

Parameters:
  • audio_dir (str, optional) – path to audio data, by default None

  • model (str, optional) – model name, by default None

  • paths (SimpleNamespace, optional) – convenient object for path handling, by default None

  • overwrite (bool, optional) – if True labels are overwritten, by default True

Returns:

dictionary with default labels

Return type:

dict

bacpipe.embedding_evaluation.label_embeddings.ensure_audio_files(found_audio_files, annotated_audio_files, audio_dir)[source]
bacpipe.embedding_evaluation.label_embeddings.fill_all_labels_array(file_labels, all_labels)[source]
bacpipe.embedding_evaluation.label_embeddings.filter_annotations_by_minimum_number_of_occurrences(df, min_occurrences=150, min_duration=0.65)[source]

Filter the annotations to have at least a minimum number of occurrences and a minimum duration.

Parameters:
  • df (pd.DataFrame) – DataFrame containing the annotations.

  • min_occurrences (int, optional) – Minimum number of occurrences for each label, by default 150.

  • min_duration (float, optional) – Minimum duration for each label, by default 0.65.

Returns:

Filtered DataFrame containing the annotations.

Return type:

pd.DataFrame

bacpipe.embedding_evaluation.label_embeddings.filter_df_by_filename(df_to_filer, file_name, file_name_column='audiofilename', model=None)[source]
bacpipe.embedding_evaluation.label_embeddings.fit_labels_to_embedding_timestamps(df, label_idx_dict, num_embeds, segment_s, label_column=None, single_label=True, min_annotation_length=0.65, **kwargs)[source]
bacpipe.embedding_evaluation.label_embeddings.get_default_labels(model_name, **kwargs)[source]

Return dictionary of the default labels based on the files that were already processed and saved. This is model dependent, as the input length is model dependent and therefore this function requires a model name as input. The default labels are calculated based on the default labels specified in the settings.yaml file.

Parameters:

model_name (str) – model name

Returns:

dictionary of default labels

Return type:

dict

bacpipe.embedding_evaluation.label_embeddings.get_dim_reduc_path_func(model_name, dim_reduction_model='umap', **kwargs)[source]
bacpipe.embedding_evaluation.label_embeddings.get_dt_filename(file)[source]

Return the timestamp within a filename as a datetime object based on the most common naming conventions in bioacoustics. This is not bullet proof but it works with the vast majority of naming conventions for files.

Parameters:

file (str) – filename as string

Returns:

datetime object of the filename

Return type:

dt.datetime object

bacpipe.embedding_evaluation.label_embeddings.get_files_if_no_embeds(audio_dir, model, label_df=None)[source]
bacpipe.embedding_evaluation.label_embeddings.get_ground_truth(model_name)[source]

Return dictionary of the ground truth labels based on the files that were already processed and saved. This is model dependent, as the input length is model dependent and therefore this function requires a model name as input.

Parameters:

model_name (str) – model name

Returns:

dictionary of ground truth labels

Return type:

dict

bacpipe.embedding_evaluation.label_embeddings.ground_truth_by_model(model, audio_dir, label_df=None, label_idx_dict=None, label_column='label:species', paths=None, annotations_filename='annotations.csv', overwrite=True, single_label=True, bool_filter_labels=False, **kwargs)[source]

Generate ground truth labels that are mapped onto the timestamps of a model, based on the model-specific input lengths. This way the embeddings and ground truth labels have the same lengths, and can be used for downstream evaluation like probing or clustering. This function supports single or multi-label generation of ground truth labels. A dictionary is created with a numpy array for the labels and a dictionary to associate the int values with the corresponding label class. The labels are processed based on a single annotation file which requires predefined column names: audiofilename, start, end, label:species (species can be replaced with other things but the label: needs to be consistent). See ‘bacpipe/tests/test_data/annotations.csv’ for an example. After processing the ground truth, the dictionary is saved as a numpy file and upon reexecution is simply loaded for shorter runtime.

Parameters:
  • model (str) – model name

  • audio_dir (str) – path to audio data

  • label_df (pandas.DataFrame, optional) – ground truth annotations in specified format, by default None

  • label_idx_dict (dict, optional) – link between int values and class labels can be auto generated, by default None

  • label_column (str, optional) – name of column in annotation file, by default ‘label:species’

  • paths (SimpleNamespace, optional) – convenient object for path handling, by default None

  • annotations_filename (str, optional) – path to annotations csv file, by default “annotations.csv”

  • overwrite (bool, optional) – If True, the dict will be generated again and saved rather than loaded from a file if already processed, by default True

  • single_label (bool, optional) – set False if you want multi-label, by default True

  • bool_filter_labels (bool, optional) – set to True, if you want a minimum number of occurrence for labels to be included in the ground truth. See settings file for more options and descriptions, by default False

Returns:

dictionary of ground truth labels with numpy array and dict to link int values to class labels

Return type:

dict

Raises:

ValueError – if gorund truth file is not found

bacpipe.embedding_evaluation.label_embeddings.load_labels_and_build_dict(paths, annotations_filename, audio_dir, bool_filter_labels=True, min_label_occurrences=150, main_label_column=None, testing=False, **kwargs)[source]
bacpipe.embedding_evaluation.label_embeddings.make_set_paths_func(audio_dir, main_results_dir=None, dim_reduc_parent_dir='dim_reduced_embeddings', testing=False, **kwargs)[source]
bacpipe.embedding_evaluation.label_embeddings.model_specific_embedding_path(path, model, dim_reduction_model=None, **kwargs)[source]

Get the path to the model specific embeddings. This function searches for the most recent directory containing the embeddings for the specified model and dimensionality reduction model.

Parameters:
  • path (Path) – Path to the main embeddings directory.

  • model (str) – Name of the model used for embedding.

  • dim_reduction_model (str) – Name of the dimensionality reduction model used. Default is ‘umap’.

  • kwargs (dict) – Additional keyword arguments.

Returns:

Path to the model specific embeddings directory.

Return type:

Path

Raises:

ValueError – If no embeddings are found for the specified model.

bacpipe.embedding_evaluation.label_embeddings.raven_tables_sanity_check(embed_timestamps, segment_s, paths, audio_file, label_df, label_idx_dict, label_column, file_labels, **kwargs)[source]