Utilities

Client

Tasks related to connecting to a Tamr instance

tamr_toolbox.utils.client.health_check(client)[source]

Query the health check API and check if each service is healthy (returns True)

Parameters

client (Client) – the tamr client

Return type

bool

Returns

True if all services are healthy, False if unhealthy

tamr_toolbox.utils.client.create(*, username, password, host, port=9100, protocol='http', store_auth_cookie=False, enforce_healthy=False)[source]

Creates a Tamr client from the provided configuration values

Parameters
  • username (str) – The username to log access Tamr as

  • password (str) – the password for the user

  • host (str) – The ip address of Tamr

  • port (Union[str, int, None]) – The port of the Tamr UI. Pass a value of None to specify an address with no port

  • protocol (str) – https or http

  • store_auth_cookie (bool) – If true will allow Tamr authentication cookie to be stored and reused

  • enforce_healthy (bool) – If true will enforce a healthy state upon creation

Return type

Client

Returns

Tamr client

tamr_toolbox.utils.client.get_with_connection_retry(client, api_endpoint, *, timeout_seconds=600, sleep_seconds=20)[source]
Will handle exceptions when attempting to connect to the Tamr API.

This is used to handle connection issues when Tamr restarts due to a restore.

Parameters
  • client (Client) – A Tamr client object

  • api_endpoint (str) – Tamr API endpoint

  • timeout_seconds (int) – Amount of time before a timeout error is thrown. Default is 600 seconds

  • sleep_seconds (int) – Amount of time in between attempts to connect to Tamr.

Return type

Response

Returns

A response object from API request.

tamr_toolbox.utils.client.poll_endpoint(client, api_endpoint, *, poll_interval_seconds=3, polling_timeout_seconds=None, connection_retry_timeout_seconds=600)[source]

Waits until job has a state of Canceled, Succeeded, or Failed.

Parameters
  • client (Client) – A Tamr client object

  • api_endpoint (str) – Tamr API endpoint

  • poll_interval_seconds (int) – Amount of time in between polls of job state.

  • polling_timeout_seconds (Optional[int]) – Amount of time before a timeout error is thrown.

  • connection_retry_timeout_seconds (int) – Amount of time before timeout error is thrown during connection retry.

Return type

Response

Returns

A response object from API request.

Configuration

Tasks related to loading and using configuration files

tamr_toolbox.utils.config.from_yaml(path_to_file, *, default_path_to_file=None)[source]

Reads a yaml file and creates a dictionary. Input values can be retrieved from environment variables

Parameters
  • path_to_file (Union[str, Path, None]) – Path to config yaml file

  • default_path_to_file (Union[str, Path, None]) – Path to use if path_to_file is null or empty

Return type

Dict[str, Any]

Returns

All configuration variables in a dictionary

Logging

Tasks related to logging within scripts

tamr_toolbox.utils.logger.create(name, *, log_to_terminal=True, log_directory=None, log_prefix='', date_format='%Y-%m-%d')[source]

Return logger object with pre-defined format. Log file will be located under log_directory with file name <log_prefix>_<date>.log, quashing extra separating underscores. Defaults to <date>.log.

For use in scripts only. To log in module files, use the standard library logging module with a module-level logger and enable package logging. See https://docs.python.org/3/howto/logging.html#advanced-logging-tutorial

>>> log = logging.getLogger(__name__)
Parameters
  • name (str) – This sets the name of your logger instance. It does not affect the file name. To change the filename use log_prefix

  • log_to_terminal (bool) – Boolean indicating whether or not to log messages to the terminal.

  • log_directory (Optional[str]) – The directory to place log files inside

  • log_prefix (str) – The string to prepend to the date in the log file name.

  • date_format (str) – format string for date suffix on log file name

Return type

Logger

Returns

Logger object

tamr_toolbox.utils.logger.set_logging_level(logger_name, level)[source]

A useful method for setting logging level for all a given logger and its handlers.

Parameters
  • logger_name (str) – the name of the logger for which to set the level

  • level (str) – log level to use. The set available from core logging package is ‘debug’, ‘info’, ‘warning’, ‘error’

Return type

None

tamr_toolbox.utils.logger.enable_package_logging(package_name, *, log_to_terminal=True, log_directory=None, level=None, log_prefix='', date_format='%Y-%m-%d')[source]

A helper function to enable package logging for any package following python best practices for logging names (i.e. logger name == package.module.submodule).

Parameters
  • package_name (str) – the name of the package for which to enable logging

  • log_to_terminal (bool) – Boolean indicating whether or not to log messages to the terminal

  • log_directory (Optional[str]) – optional log directory which the package will write logs

  • level (Optional[str]) – optional level to specify, default is WARNING (inherited from base logging package)

  • log_prefix (str) – Optional prefix for log files, if None will be blank string

  • date_format (str) – Optional date format for log file

Return type

None

tamr_toolbox.utils.logger.enable_toolbox_logging(*, log_to_terminal=True, log_directory=None, level=None, log_prefix='', date_format='%Y-%m-%d')[source]

A simple wrapper to enable_package_logging to give friendly call for users.

Parameters
  • log_to_terminal (bool) – Boolean indicating whether or not to log messages to the terminal

  • log_directory (Optional[str]) – optional directory to which to write tamr_toolbox logs

  • level (Optional[str]) – Optional logging level to specify, default is WARNING (inherited from base logging package)

  • log_prefix (str) – Optional prefix for log files, if None will be blank string

  • date_format (str) – Optional date format for log file

Return type

None

Operation

Tasks related to Tamr operations (or jobs)

tamr_toolbox.utils.operation.enforce_success(operation)[source]

Raises an error if an operation fails

Parameters

operation (Operation) – A Tamr operation

Return type

None

tamr_toolbox.utils.operation.from_resource_id(tamr, *, job_id)[source]

Create an operation from a job id

Parameters
  • tamr (Client) – A Tamr client

  • job_id (Union[int, str]) – A job ID

Return type

Operation

Returns

A Tamr operation

tamr_toolbox.utils.operation.get_latest(tamr)[source]

Get the latest operation

Parameters

tamr (Client) – A Tamr client

Return type

Operation

Returns

The latest job

tamr_toolbox.utils.operation.get_details(*, operation)[source]

Return a text describing the information of a job

Parameters

operation (Operation) – A Tamr operation

Return type

str

Returns

A text describing the information of a job

tamr_toolbox.utils.operation.get_all(tamr)[source]

Get a list of all jobs or operations.

Parameters

tamr (Client) – A Tamr client

Return type

List[Operation]

Returns

A list of Operation objects.

tamr_toolbox.utils.operation.get_active(tamr)[source]

Get a list of pending and running jobs.

Parameters

tamr (Client) – A Tamr client

Return type

List[Operation]

Returns

A list of Operations objects

tamr_toolbox.utils.operation.wait(operation, *, poll_interval_seconds=3, timeout_seconds=None)[source]

Continuously polls for this operation’s server-side state.

Parameters
  • operation (Operation) – Operation to be polled.

  • poll_interval_seconds (int) – Time interval (in seconds) between subsequent polls.

  • timeout_seconds (Optional[int]) – Time (in seconds) to wait for operation to resolve.

Raises

TimeoutError – If operation takes longer than timeout_seconds to resolve.

Return type

Operation

tamr_toolbox.utils.operation.monitor(operation, *, poll_interval_seconds=1, timeout_seconds=300)[source]

Continuously polls for this operation’s server-side state and returns operation when there is a state change

Parameters
  • operation (Operation) – Operation to be monitored.

  • poll_interval_seconds (float) – Time interval (in seconds) between subsequent polls.

  • timeout_seconds (float) – Time (in seconds) to wait for operation to resolve.

Raises

TimeoutError – If operation takes longer than timeout_seconds to resolve.

Return type

Operation

Testing

Tasks related to testing code

tamr_toolbox.utils.testing.mock_api(*, response_logs_dir=None, enforce_online_test=False, asynchronous=False)[source]

Decorator for pytest tests that mocks API requests by reading a file of pre-generated responses. Will generate responses file based on a real connection if pre-generated responses are not found.

Parameters
  • response_logs_dir (Union[str, Path, None]) – Directory to read/write response logs

  • enforce_online_test (bool) – Whether an online test should be run, even if a response log already exists

  • asynchronous (bool) – Whether or not to wait for Operations called during the running of tests

Return type

Callable

Returns

Decorated function

Downstream

tamr_toolbox.utils.downstream.datasets(dataset, *, include_dependencies_by_name=False)[source]

Returns a dataset’s downstream datasets.

Parameters
  • dataset (Dataset) – The target dataset.

  • include_dependencies_by_name (bool) – Whether to include datasets based on name similarity. No dependencies will be found by name if the dataset is not an unified dataset either based on backened pipeline (if project still exists) or based on regex (dataset name has suffix ‘unified_dataset’).

Return type

List[Dataset]

Returns

List of Dataset objects ordered by number of its downstream dependencies.

Note that there can be bidirectional dependency so datasets with same number of dependencies can depend on each other.

tamr_toolbox.utils.downstream.projects(dataset, *, include_dependencies_by_name=False)[source]

Return list of downstream project_list for a dataset.

Parameters
  • dataset (Dataset) – The target dataset.

  • include_dependencies_by_name (bool) – Whether to include datasets based on name similarity. No dependencies will be found by name if the dataset is not an unified dataset either based on backened pipeline (if project still exists) or based on regex (dataset name has suffix ‘unified_dataset’).

Return type

List[Project]

Returns

List of downstream project_list in order,

including the project the target dataset is part of.

Upstream

Functions related to projects upstream of a specified project

tamr_toolbox.utils.upstream.datasets(dataset)[source]

Check for upstream datasets associated with a specified dataset

Parameters

dataset (Dataset) – the Tamr dataset for which associated upstream datasets are retrieved

Return type

List[Dataset]

Returns

List of Tamr datasets upstream of the target dataset

tamr_toolbox.utils.upstream.projects(project)[source]

Check for upstream projects associated with a specified project

Parameters

project (Project) – the tamr project for which associated upstream projects are retrieved

Return type

List[Project]

Returns

List of tamr projects upstream of the target project

Version

Tasks related to the version of Tamr instances

tamr_toolbox.utils.version.current(client)[source]

Gets the version of Tamr for provided client

Parameters

client (Client) – Tamr client

Return type

str

Returns

String representation of Tamr version

tamr_toolbox.utils.version.is_version_condition_met(*, tamr_version, min_version, max_version=None, exact_version=False, raise_error=False)[source]

Check if Tamr version is valid.

Parameters
  • tamr_version (str) – The version of Tamr being considered

  • min_version (str) – The earliest version of Tamr

  • max_version (Optional[str]) – The latest version of Tamr. Default None, in which case no max version is tested for.

  • exact_version (bool) – Compare against only one release of Tamr. Default is False

  • raise_error (bool) – If True, raise an error if the version condition is not met. Default is False.

Raises
  • ValueError – if min_version is greater than max_version

  • EnvironmentError – if raise_error is True, and the condition is not met

Notes

Patch versions (major.minor.patch) are excluded from the comparison If exact_version is True, max_version will be ignored

See also

utils.version.func_requires_tamr_version

Return type

bool

tamr_toolbox.utils.version.requires_tamr_version(min_version, max_version=None, exact_version=False)[source]

Pie decorator for Tamr version checking

Parameters
  • min_version (str) – The earliest version of Tamr that supports the function

  • max_version (Optional[str]) – The latest version of Tamr that supports the function. Default None, supporting all latest releases of Tamr

  • exact_version (bool) – If True, only support one release of Tamr. Default is False

Examples

>>> @requires_tamr_version(min_version="2021.002")
>>> def refresh_dataset(tamr_dataset, *args, **kwargs):
>>>     return tamr_dataset.refresh()
Raises
  • ValueError – if min_version is greater than max_version

  • EnvironmentError – if raise_error is True, and the condition is not met

Notes

This decorator only inspects the Tamr version of arguments going into the function, and not new instances of Tamr referred to within functional code

Patch versions (major.minor.patch) are excluded from the comparison

See also

utils.version.is_version_condition_met

Return type

Callable

tamr_toolbox.utils.version.enforce_after_or_equal(client, *, compare_version)[source]
Raises an exception if the version of the Tamr client is before the provided compare version

Will be deprecated in favour of raise_warn_tamr_version()

Parameters
  • client (Client) – Tamr client

  • compare_version (str) – String representation of Tamr version

Return type

None

Returns

None

See also

raise_warn_tamr_version ensure_tamr_version