utils

Module to handle all utility functions for training, testing and evaluation of a model.

IMAGERY_CONFIG_PATH

Path to the imagery config YAML file.

Type:

str | Sequence[str]

DATA_CONFIG_PATH

Path to the data config YAML file.

Type:

str | Sequence[str]

IMAGERY_CONFIG

Config defining the properties of the imagery used in the experiment.

Type:

dict[str, Any]

DATA_CONFIG

Config defining the properties of the data used in the experiment.

Type:

dict[str, Any]

DATA_DIR

Path to directory holding dataset.

Type:

str

CACHE_DIR

Path to cache directory.

Type:

str

RESULTS_DIR

Path to directory to output plots to.

Type:

str

BAND_IDS

Band IDs and position in sample image.

Type:

list[int] | tuple[int, …] | dict[str, Any]

IAMGE_SIZE

Defines the shape of the images.

Type:

int | tuple[int, int] | list[int]

CLASSES

Mapping of class labels to class names.

Type:

dict[str, Any]

CMAP_DICT

Mapping of class labels to colours.

Type:

dict[str, Any]

WGS84

WGS84 co-ordinate reference system acting as a default CRS for transformations.

Type:

CRS

batch_flatten(x: Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) ndarray[Any, dtype[Any]]

Flattens the supplied array with numpy.flatten().

Parameters:

x (ArrayLike]) – Array to be flattened.

Returns:

Flattened ndarray.

Return type:

ndarray[Any]

calc_norm_euc_dist(a: Tensor, b: Tensor) Tensor

Calculates the normalised Euclidean distance between two vectors.

Parameters:
Returns:

Normalised Euclidean distance between vectors A and B.

Return type:

Tensor

check_dict_key(dictionary: dict[Any, Any], key: Any) bool

Checks if a key exists in a dictionary and if it is None or False.

Parameters:
  • dictionary (dict[Any, Any]) – Dictionary to check key for.

  • key (Any) – Key to be checked.

Returns:

True if key exists and is not None or False. False if else.

Return type:

bool

check_len(param: Any, comparator: Any) Any | Sequence[Any]

Checks the length of one object against a comparator object.

Parameters:
  • param (Any) – Object to have length checked.

  • comparator (Any) – Object to compare length of param to.

Returns:

  • param if length of param == comparator,

  • or list with param[0] elements of length comparator if param =! comparator,

  • or list with param elements of length comparator if param does not have __len__.

Return type:

Any | Sequence[Any]

check_optional_import_exist(package: str) bool

Checks if a package is installed. Useful for optional dependencies.

Parameters:

package (str) – Name of the package to check if installed.

Returns:

True if package installed, False if not.

Return type:

bool

check_substrings_in_string(string: str, *substrings, all_true: bool = False) bool

Checks if either any or all substrings are in the provided string.

Parameters:
  • string (str) – String to check for substrings in.

  • substrings (str | tuple(str, ...)) – Substrings to check for in string.

  • all_true (bool) – Optional; Only returns True if all substrings are in string. Defaults to False.

Returns:

True if any substring is in string if all_true==False. Only True if all substrings in string if all_true==True. False if else.

Return type:

bool

check_test_empty(pred: Sequence[int] | ndarray[Any, dtype[int64]], labels: Sequence[int] | ndarray[Any, dtype[int64]], class_labels: dict[int, str] | None = None, p_dist: bool = True) tuple[ndarray[Any, dtype[int64]], ndarray[Any, dtype[int64]], dict[int, str]]

Checks if any of the classes in the dataset were not present in both the predictions and ground truth labels. Returns corrected and re-ordered predictions, labels and class labels.

Parameters:
  • pred (Sequence[int] | ndarray[int]) – List of predicted labels.

  • labels (Sequence[int] | ndarray[int]) – List of corresponding ground truth labels.

  • class_labels (dict[int, str]) – Optional; Dictionary mapping class labels to class names.

  • p_dist (bool) – Optional; Whether to print to screen the distribution of classes within each dataset.

Returns:

tuple of:
  • List of predicted labels transformed to new classes.

  • List of corresponding ground truth labels transformed to new classes.

  • Dictionary mapping new class labels to class names.

Return type:

tuple[ndarray[int], ndarray[int], dict[int, str]]

check_within_bounds(bbox: BoundingBox, bounds: BoundingBox) BoundingBox

Ensures that the a bounding box is within another.

Parameters:
  • bbox (BoundingBox) – First bounding box that needs to be within the second.

  • bounds (BoundingBox) – Second outer bounding box to use as the bounds.

Returns:

Copy of bbox if it is within bounds or a new bounding box that has been limited to the dimensions of bounds if those of bbox exceeded them.

Return type:

BoundingBox

class_dist_transform(class_dist: list[tuple[int, int]], matrix: dict[int, int]) list[tuple[int, int]]

Transforms the class distribution from an old schema to a new one.

Parameters:
Returns:

Class distribution updated to new labels.

Return type:

list[tuple[int, int]]

class_frac(patch: Series) dict[Any, Any]

Computes the fractional sizes of the classes of the given patch and returns a dict of the results.

Parameters:

patch (Series) – Row of DataFrame representing the entry for a patch.

Returns:

Dictionary-like object with keys as class numbers and associated values of fractional size of class plus a key-value pair for the patch ID.

Return type:

Mapping

class_transform(label: int, matrix: dict[int, int]) int

Transforms labels from one schema to another mapped by a supplied dictionary.

Parameters:
  • label (int) – Label to be transformed.

  • matrix (dict[int, int]) – Dictionary mapping old labels to new.

Returns:

Label transformed by matrix.

Return type:

int

class_weighting(class_dist: list[tuple[int, int]], normalise: bool = False) dict[int, float]

Constructs weights for each class defined by the distribution provided.

Note

Each class weight is the inverse of the number of samples of that class. This will most likely mean that the weights will not sum to unity.

Parameters:
Returns:

Dictionary mapping class number to its weight.

Return type:

dict[int, float]

compile_dataset_paths(data_dir: Path | str, in_paths: list[Path | str] | Path | str) list[str]

Ensures that a list of paths is returned with the data directory prepended, even if a single string is supplied

Parameters:
  • data_dir (Path | str) – The parent data directory for all paths.

  • in_paths (list[Path | str] | [Path | str]) – Paths to the data to be compilied.

Returns:

Compilied paths to the data.

Return type:

list[str]

compute_roc_curves(probs: ndarray[Any, dtype[float64]], labels: Sequence[int] | ndarray[Any, dtype[int64]], class_labels: list[int], micro: bool = True, macro: bool = True) tuple[dict[Any, float], dict[Any, float], dict[Any, float]]

Computes the false-positive rate, true-positive rate and AUCs for each class using a one-vs-all approach. The micro and macro averages are for each of these variables is also computed.

Adapted from scikit-learn’s example at: https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

Parameters:
  • probs (ndarray[float]) – Array of probabilistic predicted classes from model where each sample should have a list of the predicted probability for each class.

  • labels (list[int]) – List of corresponding ground truth labels.

  • class_labels (list[int]) – List of class label numbers.

  • micro (bool) – Optional; Whether to compute the micro average ROC curves.

  • macro (bool) – Optional; Whether to compute the macro average ROC curves.

Returns:

tuple of:
  • Dictionary of false-positive rates for each class and micro and macro averages.

  • Dictionary of true-positive rates for each class and micro and macro averages.

  • Dictionary of AUCs for each class and micro and macro averages.

Return type:

tuple[dict[Any, float], dict[Any, float], dict[Any, float]]

datetime_reformat(timestamp: str, fmt1: str, fmt2: str) str

Takes a str representing a time stamp in one format and returns it reformatted into a second.

Parameters:
  • timestamp (str) – Datetime string to be reformatted.

  • fmt1 (str) – Format of original datetime.

  • fmt2 (str) – New format for datetime.

Returns:

Datetime reformatted to fmt2.

Return type:

str

dec2deg(dec_co: Sequence[float] | ndarray[Any, dtype[float64]], axis: str = 'lat') list[str]

Wrapper for deg_to_dms().

Parameters:
  • dec_co (list[float]) – Array of either latitude or longitude co-ordinates in decimal degrees.

  • axis (str) – Identifier between latitude ("lat") or longitude ("lon") for N-S, E-W identifier.

Returns:

List of formatted strings in degrees, minutes and seconds.

Return type:

list[str]

deg_to_dms(deg: float, axis: str = 'lat') str

Converts between decimal degrees of lat/lon to degrees, minutes, seconds.

Credit to Gustavo Gonçalves on Stack Overflow. https://stackoverflow.com/questions/2579535/convert-dd-decimal-degrees-to-dms-degrees-minutes-seconds-in-python

Parameters:
  • deg (float) – Decimal degrees of latitude or longitude.

  • axis (str) – Identifier between latitude ("lat") or longitude ("lon") for N-S, E-W direction identifier.

Returns:

String of inputted deg in degrees, minutes and seconds in the form DegreesΒΊ Minutes Seconds Hemisphere.

Return type:

str

dublicator(cls)

Dublicates decorated transform object to handle paired samples.

eliminate_classes(empty_classes: list[int] | tuple[int, ...] | ndarray[Any, dtype[int64]], old_classes: dict[int, str], old_cmap: dict[int, str] | None = None) tuple[dict[int, str], dict[int, int], dict[int, str] | None]

Eliminates empty classes from the class text label and class colour dictionaries and re-normalise.

This should ensure that the remaining list of classes is still a linearly spaced list of numbers.

Parameters:
  • empty_classes (list[int]) – List of classes not found in class_dist and are thus empty/ not present in dataset.

  • old_classes (dict[int, str]) – Optional; Previous mapping of class labels to class names.

  • old_cmap (dict[int, str]) – Optional; Previous mapping of class labels to colours.

Returns:

tuple of dictionaries:
  • Mapping of remaining class labels to class names.

  • Mapping from old to new classes.

  • Mapping of remaining class labels to RGB colours.

Return type:

tuple[dict[int, str], dict[int, int], dict[int, str]]

exist_delete_check(fn: str | Path) None

Checks if given file exists then deletes if true.

Parameters:

fn (str | Path) – Path to file to have existence checked then deleted.

Returns:

None

extract_class_type(var: Any) type

Ensures that a class type is returned from a variable whether it is one already or not.

Parameters:

var (Any) – Variable to get class type from. May already be a class type.

Returns:

Class type of var.

Return type:

type

fallback_params(key: str, params_a: dict[str, Any], params_b: dict[str, Any], fallback: Any | None = None) Any

Search for a value associated with key from

Parameters:
  • key (str) – _description_

  • params_a (dict[str, Any]) – _description_

  • params_b (dict[str, Any]) – _description_

  • fallback (Any) – Optional; _description_. Defaults to None.

Returns:

_description_

Return type:

Any

find_best_of(patch_id: str, manifest: ~pandas.core.frame.DataFrame, selector: ~typing.Callable[[~pandas.core.frame.DataFrame], list[str]] = <function threshold_scene_select>, **kwargs) list[str]

Finds the scenes sorted by cloud cover using selector function supplied.

Parameters:
  • patch_id (str) – Unique patch ID.

  • manifest (DataFrame) – DataFrame outlining cloud cover percentages for all scenes in the patches desired.

  • selector (Callable[[DataFrame], list[str]]) – Optional; Function to use to select scenes. Must take an appropriately constructed DataFrame.

  • **kwargs – Kwargs for func.

Returns:

List of strings representing dates of the selected scenes in YY_MM_DD format.

Return type:

list[str]

find_empty_classes(class_dist: list[tuple[int, int]], class_names: dict[int, str]) list[int]

Finds which classes defined by config files are not present in the dataset.

Parameters:
Returns:

List of classes not found in class_dist and are thus empty/ not present in dataset.

Return type:

list[int]

find_geo_similar(bbox: BoundingBox, max_r: int = 256) BoundingBox

Find an image that is less than or equal to the geo-spatial distance r from the intial image.

Based on the the work of GeoCLR https://arxiv.org/abs/2108.06421v1.

Parameters:
  • bbox (BoundingBox) – Original bounding box.

  • max_r (int) – Optional; Maximum distance new bounding box can be from original. Defaults to 256.

Returns:

New bounding box translated a random displacement from original.

Return type:

BoundingBox

find_modes(labels: Iterable[int], plot: bool = False, classes: dict[int, str] | None = None, cmap_dict: dict[int, str] | None = None) list[tuple[int, int]]

Finds the modal distribution of the classes within the labels provided.

Can plot the results as a pie chart if plot=True.

Parameters:
  • labels (Iterable[int]) – Class labels describing the data to be analysed.

  • plot (bool) – Plots distribution of subpopulations if True.

Returns:

Modal distribution of classes in input in order of most common class.

Return type:

list[tuple[int, int]]

find_tensor_mode(mask: LongTensor) LongTensor

Finds the mode value in a LongTensor.

Parameters:

mask (LongTensor) – Tensor to find modal value in.

Returns:

A 0D, 1-element tensor containing the modal value.

Return type:

LongTensor

Added in version 0.22.

func_by_str(module_path: str, func: str) Callable[[...], Any]

Gets the constructor or callable within a module defined by the names supplied.

Parameters:
  • module_path (str) – Name (and path to) of module desired function or class is within.

  • func (str) – Name of function or class desired.

Returns:

Pointer to the constructor or function requested.

Return type:

Callable[[Any], Any]

get_centre_loc(bounds: BoundingBox) tuple[float, float]

Gets the centre co-ordinates of the parsed bounding box.

Parameters:

bounds (BoundingBox) – Bounding box to find the centre co-ordinates.

Returns:

tuple of the centre x, y co-ordinates of the bounding box.

Return type:

tuple[float, float]

get_cuda_device(device_sig: int | str = 'cuda:0') device

Finds and returns the CUDA device, if one is available. Else, returns CPU as device. Assumes there is at most only one CUDA device.

Parameters:

device_sig (int | str) – Optional; Either the GPU number or string representing the torch device to find. Defaults to 'cuda:0'.

Returns:

CUDA device, if found. Else, CPU device.

Return type:

device

is_notebook() bool

Check if this code is being executed from a Juypter Notebook or not.

Adapted from https://gist.github.com/thomasaarholt/e5e2da71ea3ee412616b27d364e3ae82

Returns:

True if executed by Juypter kernel. False if not.

Return type:

bool

labels_to_ohe(labels: Sequence[int], n_classes: int) ndarray[Any, dtype[Any]]

Convert an iterable of indices to one-hot encoded (OHE) labels.

Parameters:
  • labels (Sequence[int]) – Sequence of class number labels to be converted to OHE.

  • n_classes (int) – Number of classes to determine length of OHE label.

Returns:

Labels in OHE form.

Return type:

ndarray[Any]

lat_lon_to_loc(lat: str | float, lon: str | float) str

Takes a latitude - longitude co-ordinate and returns a string of the semantic location.

Parameters:
  • lat (str | float) – Latitude of location.

  • lon (str | float) – Longitude of location.

Returns:

Semantic location of co-ordinates e.g. β€œBelper, Derbyshire, UK”.

Return type:

str

make_classification_report(pred: Sequence[int] | ndarray[Any, dtype[int64]], labels: Sequence[int] | ndarray[Any, dtype[int64]], class_labels: dict[int, str] | None = None, print_cr: bool = True, p_dist: bool = False) DataFrame

Generates a DataFrame of the precision, recall, f-1 score and support of the supplied predictions and ground truth labels.

Uses scikit-learn’s classification_report to calculate the metrics: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

Parameters:
  • pred (list[int] | ndarray[int]) – List of predicted labels.

  • labels (list[int] | ndarray[int]) – List of corresponding ground truth labels.

  • class_labels (dict[int, str]) – Dictionary mapping class labels to class names.

  • print_cr (bool) – Optional; Whether to print a copy of the classification report DataFrame put through tabulate.

  • p_dist (bool) – Optional; Whether to print to screen the distribution of classes within each dataset.

Returns:

Classification report with the precision, recall, f-1 score and support for each class in a DataFrame.

Return type:

DataFrame

mask_to_ohe(mask: LongTensor, n_classes: int) LongTensor

Converts a segmentation mask to one-hot-encoding (OHE).

Parameters:
  • mask (LongTensor) – Segmentation mask to convert.

  • n_classes (int) – Optional; Number of classes in total across dataset. If not provided, the number of classes is infered from those found in mask.

Note

It is advised that one provides n_classes as there is a fair chance that not all possible classes are in mask. Infering from the classes present in mask therefore is likely to result in shaping issues between masks in a batch.

Returns:

mask converted to OHE. The one-hot-encoding is placed in the leading dimension. (CxHxW) where C is the number of classes.

Return type:

LongTensor

Added in version 0.23.

mask_transform(array: ndarray[Any, dtype[int64]], matrix: dict[int, int]) ndarray[Any, dtype[int64]]
mask_transform(array: LongTensor, matrix: dict[int, int]) LongTensor

Transforms all labels of an N-dimensional array from one schema to another mapped by a supplied dictionary.

Parameters:
  • array (ndarray[int] | LongTensor) – N-dimensional array containing labels to be transformed.

  • matrix (dict[int, int]) – Dictionary mapping old labels to new.

Returns:

Array of transformed labels.

Return type:

ndarray[int] | LongTensor

mkexpdir(name: str, results_dir: Path | str = 'results') None

Makes a new directory below the results directory with name provided. If directory already exists, no action is taken.

Parameters:
  • name (str) – Name of new directory.

  • results_dir (Path | str) – Path to the results directory. Defaults to results.

Returns:

None

modes_from_manifest(manifest: DataFrame, classes: dict[int, str], plot: bool = False, cmap_dict: dict[int, str] | None = None) list[tuple[int, int]]

Uses the dataset manifest to calculate the fractional size of the classes.

Parameters:
  • manifest (DataFrame) – DataFrame containing the fractional sizes of classes and centre pixel labels of all samples of the dataset to be used.

  • plot (bool) – Optional; Whether to plot the class distribution pie chart.

Returns:

Modal distribution of classes in the dataset provided.

Return type:

list[tuple[int, int]]

pair_collate(func: Callable[[Any], Any]) Callable[[Any], Any]

Wraps a collator function so that it can handle paired samples.

Warning

NOT compatible with DistributedDataParallel due to it’s use of pickle. Use stack_sample_pairs() instead as a direct replacement for stack_samples().

Parameters:

func (Callable[[Any], Any]) – Collator function to be wrapped.

Returns:

Wrapped collator function.

Return type:

Callable[[Any], Any]

pair_return(cls)

Wrapper for GeoDataset classes to be able to handle pairs of queries and returns.

Warning

NOT compatible with DistributedDataParallel due to it’s use of pickle. Use PairedGeoDataset directly instead, supplying the dataset to wrap on init.

Raises:

AttributeError – If an attribute cannot be found in either the Wrapper or the wrapped dataset.

print_class_dist(class_dist: list[tuple[int, int]], class_labels: dict[int, str] | None = None) None

Prints the supplied class_dist in a pretty table format using tabulate.

Parameters:
print_config(conf: DictConfig) None

Print function for the configuration file using YAML dump.

Parameters:

conf (dict[str, Any]]) – Optional; Config file to print. If None, uses the global config.

return_updated_kwargs(func: Callable[[...], tuple[Any, ...]]) Callable[[...], tuple[Any, ...]]

Decorator that allows the kwargs supplied to the wrapped function to be returned with updated values.

Assumes that the wrapped function returns a dict in the last position of the tuple of returns with keys in kwargs that have new values.

Parameters:

func (Callable[..., tuple[Any, ...]) – Function to be wrapped. Must take kwargs and return a dict with updated kwargs in the last position of the tuple.

Returns:

Wrapped function.

Return type:

Callable[…, tuple[Any, …]

run_tensorboard(exp_name: str, path: str | list[str] | tuple[str, ...] | Path = '', env_name: str = 'env', host_num: str | int = 6006, _testing: bool = False) int | None

Runs the TensorBoard logs and hosts on a local webpage.

Parameters:
  • exp_name (str) – Unique name of the experiment to run the logs of.

  • path (str | list[str] | tuple[str, ...] | Path) – Path to the directory holding the log. Can be a string or a list of strings for each sub-directory.

  • env_name (str) – Name of the conda environment to run tensorBoard in.

  • host_num (str | int) – Local host number tensorBoard will be hosted on.

Raises:

KeyError – If path is None but the default cannot be found in config, return None.

Returns:

Exitcode for testing purposes. None under normal use.

Return type:

int | None

set_seeds(seed: int) None

Set torch, numpy and random seeds for reproducibility.

Parameters:

seed (int) – Seed number to set all seeds to.

tg_to_torch(cls, keys: Sequence[str] | None = None)

Ensures wrapped transform can handle both Tensor and torchgeo style dict inputs.

Warning

NOT compatible with DistributedDataParallel due to it’s use of pickle. This functionality is now handled within MinervaCompose.

Parameters:

keys (Optional[Sequence[str]]) – Keys to fields within dict inputs to transform values in. Defaults to None.

Raises:

TypeError – If input is not a dict or Tensor.

threshold_scene_select(df: DataFrame, thres: float = 0.3) list[str]

Selects all scenes in a patch with a cloud cover less than the threshold provided.

Parameters:
  • df (DataFrame) – DataFrame containing all scenes and their cloud cover percentages.

  • thres (float) – Optional; Fractional limit of cloud cover below which scenes shall be selected.

Returns:

List of strings representing dates of the selected scenes in YY_MM_DD format.

Return type:

list[str]

timestamp_now(fmt: str = '%d-%m-%Y_%H%M') str

Gets the timestamp of the datetime now.

Parameters:

fmt (str) – Format of the returned timestamp.

Returns:

Timestamp of the datetime now.

Return type:

str

transform_coordinates(x: Sequence[float], y: Sequence[float], src_crs: CRS, new_crs: CRS = WGS84) tuple[Sequence[float], Sequence[float]]
transform_coordinates(x: Sequence[float], y: float, src_crs: CRS, new_crs: CRS = WGS84) tuple[Sequence[float], Sequence[float]]
transform_coordinates(x: float, y: Sequence[float], src_crs: CRS, new_crs: CRS = WGS84) tuple[Sequence[float], Sequence[float]]
transform_coordinates(x: float, y: float, src_crs: CRS, new_crs: CRS = WGS84) tuple[float, float]

Transforms co-ordinates from one CRS to another.

Parameters:
  • x (Sequence[float] | float) – The x co-ordinate(s).

  • y (Sequence[float] | float) – The y co-ordinate(s).

  • src_crs (CRS) – The source co-orinates reference system (CRS).

  • new_crs (CRS) – Optional; The new CRS to transform co-ordinates to. Defaults to wgs_84.

Returns:

The transformed co-ordinates. A tuple if only one x and y were provided, sequence of tuples if sequence of x and y provided.

Return type:

tuple[Sequence[float], Sequence[float] | tuple[float, float]

tsne_cluster(embeddings: ndarray[Any, dtype[Any]], n_dim: int = 2, lr: str = 'auto', n_iter: int = 1000, verbose: int = 1, perplexity: int = 30) Any

Trains a TSNE algorithm on the embeddings passed.

Parameters:
  • embeddings (ndarray[Any]) – Embeddings outputted from the model.

  • n_dim (int, optional) – Number of dimensions to reduce embeddings to. Defaults to 2.

  • lr (str, optional) – Learning rate. Defaults to β€œauto”.

  • n_iter (int, optional) – Number of iterations. Defaults to 1000.

  • verbose (int, optional) – Verbosity. Defaults to 1.

  • perplexity (int, optional) – Relates to number of nearest neighbours used. Must be less than the length of embeddings.

Returns:

Embeddings transformed to n_dim dimensions using TSNE.

Return type:

Any