bacpipe.embedding_evaluation.label_embeddings
Functions
|
|
|
|
|
|
|
|
|
|
|
Create default labels based on audio files and model timestamps to match the number of embeddings created per file for visualization and clustering purposes. |
|
|
|
|
Filter the annotations to have at least a minimum number of occurrences and a minimum duration. |
|
|
|
|
|
|
Return dictionary of the default labels based on the files that were already processed and saved. |
|
|
|
Return the timestamp within a filename as a datetime object based on the most common naming conventions in bioacoustics. |
|
Compute the duration (in seconds) of an audio time series, feature matrix, or filename. |
|
|
|
Return dictionary of the ground truth labels based on the files that were already processed and saved. |
|
Generate ground truth labels that are mapped onto the timestamps of a model, based on the model-specific input lengths. |
|
Import a module. |
|
|
|
|
|
Get the path to the model specific embeddings. |
|
Classes
|
|
|
PurePath subclass that can make system calls. |
|
A simple attribute-based namespace. |
|
Decorate an iterable object, returning an iterator which acts exactly like the original iterable, but prints a dynamically updating progressbar every time a value is requested. |
- class bacpipe.embedding_evaluation.label_embeddings.DefaultLabels(paths, model, default_label_keys, **kwargs)[source]
Bases:
object- __init__(paths, model, default_label_keys, **kwargs)[source]
Class to generate default labels based on audio files and number of generated embeddings per file.
- Parameters:
paths (SimpleNamespace) – convenient object for path handling
model (str) – model name
default_label_keys (list) – list of default labels, see settings.yaml
- Raises:
ValueError – if no embeddings were found
- bacpipe.embedding_evaluation.label_embeddings.build_ground_truth_labels_by_file(paths, ind, model, num_embeds, segment_s, metadata, all_labels, label_df=None, label_idx_dict=None, label_column=None, **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.collect_ground_truth_labels(paths, files, model, segment_s, metadata, label_df, label_idx_dict, **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.concatenate_annotation_files(annotation_src, appendix='.txt', acodet_annotations=False, start_col_name='start', end_col_name='end', lab_col_name='label')[source]
- bacpipe.embedding_evaluation.label_embeddings.create_Raven_annotation_table(df, label_column, high_freq=1000)[source]
- bacpipe.embedding_evaluation.label_embeddings.create_default_labels(audio_dir=None, model=None, paths=None, overwrite=True, **kwargs)[source]
Create default labels based on audio files and model timestamps to match the number of embeddings created per file for visualization and clustering purposes.
- Parameters:
audio_dir (str, optional) – path to audio data, by default None
model (str, optional) – model name, by default None
paths (SimpleNamespace, optional) – convenient object for path handling, by default None
overwrite (bool, optional) – if True labels are overwritten, by default True
- Returns:
dictionary with default labels
- Return type:
dict
- bacpipe.embedding_evaluation.label_embeddings.ensure_audio_files(found_audio_files, annotated_audio_files, audio_dir)[source]
- bacpipe.embedding_evaluation.label_embeddings.fill_all_labels_array(file_labels, all_labels)[source]
- bacpipe.embedding_evaluation.label_embeddings.filter_annotations_by_minimum_number_of_occurrences(df, min_occurrences=150, min_duration=0.65)[source]
Filter the annotations to have at least a minimum number of occurrences and a minimum duration.
- Parameters:
df (pd.DataFrame) – DataFrame containing the annotations.
min_occurrences (int, optional) – Minimum number of occurrences for each label, by default 150.
min_duration (float, optional) – Minimum duration for each label, by default 0.65.
- Returns:
Filtered DataFrame containing the annotations.
- Return type:
pd.DataFrame
- bacpipe.embedding_evaluation.label_embeddings.filter_df_by_filename(df_to_filer, file_name, file_name_column='audiofilename', model=None)[source]
- bacpipe.embedding_evaluation.label_embeddings.fit_labels_to_embedding_timestamps(df, label_idx_dict, num_embeds, segment_s, label_column=None, single_label=True, min_annotation_length=0.65, **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.get_default_labels(model_name, **kwargs)[source]
Return dictionary of the default labels based on the files that were already processed and saved. This is model dependent, as the input length is model dependent and therefore this function requires a model name as input. The default labels are calculated based on the default labels specified in the settings.yaml file.
- Parameters:
model_name (str) – model name
- Returns:
dictionary of default labels
- Return type:
dict
- bacpipe.embedding_evaluation.label_embeddings.get_dim_reduc_path_func(model_name, dim_reduction_model='umap', **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.get_dt_filename(file)[source]
Return the timestamp within a filename as a datetime object based on the most common naming conventions in bioacoustics. This is not bullet proof but it works with the vast majority of naming conventions for files.
- Parameters:
file (str) – filename as string
- Returns:
datetime object of the filename
- Return type:
dt.datetime object
- bacpipe.embedding_evaluation.label_embeddings.get_files_if_no_embeds(audio_dir, model, label_df=None)[source]
- bacpipe.embedding_evaluation.label_embeddings.get_ground_truth(model_name)[source]
Return dictionary of the ground truth labels based on the files that were already processed and saved. This is model dependent, as the input length is model dependent and therefore this function requires a model name as input.
- Parameters:
model_name (str) – model name
- Returns:
dictionary of ground truth labels
- Return type:
dict
- bacpipe.embedding_evaluation.label_embeddings.ground_truth_by_model(model, audio_dir, label_df=None, label_idx_dict=None, label_column='label:species', paths=None, annotations_filename='annotations.csv', overwrite=True, single_label=True, bool_filter_labels=False, **kwargs)[source]
Generate ground truth labels that are mapped onto the timestamps of a model, based on the model-specific input lengths. This way the embeddings and ground truth labels have the same lengths, and can be used for downstream evaluation like probing or clustering. This function supports single or multi-label generation of ground truth labels. A dictionary is created with a numpy array for the labels and a dictionary to associate the int values with the corresponding label class. The labels are processed based on a single annotation file which requires predefined column names: audiofilename, start, end, label:species (species can be replaced with other things but the label: needs to be consistent). See ‘bacpipe/tests/test_data/annotations.csv’ for an example. After processing the ground truth, the dictionary is saved as a numpy file and upon reexecution is simply loaded for shorter runtime.
- Parameters:
model (str) – model name
audio_dir (str) – path to audio data
label_df (pandas.DataFrame, optional) – ground truth annotations in specified format, by default None
label_idx_dict (dict, optional) – link between int values and class labels can be auto generated, by default None
label_column (str, optional) – name of column in annotation file, by default ‘label:species’
paths (SimpleNamespace, optional) – convenient object for path handling, by default None
annotations_filename (str, optional) – path to annotations csv file, by default “annotations.csv”
overwrite (bool, optional) – If True, the dict will be generated again and saved rather than loaded from a file if already processed, by default True
single_label (bool, optional) – set False if you want multi-label, by default True
bool_filter_labels (bool, optional) – set to True, if you want a minimum number of occurrence for labels to be included in the ground truth. See settings file for more options and descriptions, by default False
- Returns:
dictionary of ground truth labels with numpy array and dict to link int values to class labels
- Return type:
dict
- Raises:
ValueError – if gorund truth file is not found
- bacpipe.embedding_evaluation.label_embeddings.load_labels_and_build_dict(paths, annotations_filename, audio_dir, bool_filter_labels=True, min_label_occurrences=150, main_label_column=None, testing=False, **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.make_set_paths_func(audio_dir, main_results_dir=None, dim_reduc_parent_dir='dim_reduced_embeddings', testing=False, **kwargs)[source]
- bacpipe.embedding_evaluation.label_embeddings.model_specific_embedding_path(path, model, dim_reduction_model=None, **kwargs)[source]
Get the path to the model specific embeddings. This function searches for the most recent directory containing the embeddings for the specified model and dimensionality reduction model.
- Parameters:
path (Path) – Path to the main embeddings directory.
model (str) – Name of the model used for embedding.
dim_reduction_model (str) – Name of the dimensionality reduction model used. Default is ‘umap’.
kwargs (dict) – Additional keyword arguments.
- Returns:
Path to the model specific embeddings directory.
- Return type:
Path
- Raises:
ValueError – If no embeddings are found for the specified model.