Categorization

Jobs

Tasks related to running jobs for Tamr Categorization projects

tamr_toolbox.project.categorization.jobs.run(project, *, run_apply_feedback=False, process_asynchronously=False)[source]

Run the project

Parameters
  • project (CategorizationProject) – The target categorization project

  • run_apply_feedback (bool) – Whether train should be called on the categorization model

  • process_asynchronously (bool) – Whether or not to wait for the job to finish before returning - must be set to True for concurrent workflow

Return type

List[Operation]

Returns

The operations that were run

tamr_toolbox.project.categorization.jobs.update_unified_dataset(project, *, process_asynchronously=False)[source]

Updates the unified dataset for a categorization project

Parameters
  • project (CategorizationProject) – Target categorization project

  • process_asynchronously (bool) – Whether or not to wait for the job to finish before returning - must be set to True for concurrent workflow

Return type

List[Operation]

Returns

The operations that were run

tamr_toolbox.project.categorization.jobs.apply_feedback(project, *, process_asynchronously=False)[source]

Trains the model only.

Parameters
  • project (CategorizationProject) – Target categorization project

  • process_asynchronously (bool) – Whether or not to wait for the job to finish before returning - must be set to True for concurrent workflow

Return type

List[Operation]

Returns

The operations that were run

tamr_toolbox.project.categorization.jobs.apply_feedback_and_update_results(project, *, process_asynchronously=False)[source]

Trains the model and updates the categorization predictions of a categorization project

Parameters
  • project (CategorizationProject) – Target categorization project

  • process_asynchronously (bool) – Whether or not to wait for the job to finish before returning - must be set to True for concurrent workflow

Return type

List[Operation]

Returns

The operations that were run

tamr_toolbox.project.categorization.jobs.update_results_only(project, *, process_asynchronously=False)[source]

Updates the categorization predictions based on the existing model of a categorization project

Parameters
  • project (CategorizationProject) – Target categorization project

  • process_asynchronously (bool) – Whether or not to wait for the job to finish before returning - must be set to True for concurrent workflow

Return type

List[Operation]

Returns

The operations that were run

Metrics

Tasks related to metrics for Tamr Categorization projects

tamr_toolbox.project.categorization.metrics.get_tier_confidence(project, *, tier=-1, allow_dataset_refresh=False)[source]

Extracts tier-specific average confidence from a Tamr internal dataset <unified dataset name>_classifications_average_confidences in a dictionary

Parameters
  • project (Project) – Tamr project object

  • tier (int) – integer specifying the tier to extract the average confidence; default value will return the average confidence at all leaf categories

  • allow_dataset_refresh (bool) – if True, allows running a job to refresh dataset to make it streamable

Return type

Dict[str, Any]

Returns

dictionary - keys are category paths, joined by ‘|’ if multi-level taxonomy and values are average confidence of the corresponding keys

Raises
  • RuntimeError – if dataset is not streamable and allow_dataset_refresh is False;

  • TypeError – if tier is not of type int; or if the project type is not classification

  • ValueError – if tier is less than -1 or equal to 0

Taxonomy Management

Tasks related to editing the taxonomy for a tamr categorization project

tamr_toolbox.project.categorization.taxonomy.delete_node(client, project_id, path, force_delete=False)[source]

Deletes a node from a taxonomy.

Parameters
  • client (Client) – Tamr client connected to target instance

  • project_id (str) – ID of the categorization project

  • path (list) – Full path of the node to be deleted

  • force_delete (bool) – Optional flag. Default is false. If true, deletes even if there are still

  • false (records assigned to that category. If) –

  • error. (the operation fails with an) –

Returns: None

tamr_toolbox.project.categorization.taxonomy.rename_node(client, project_id, new_name, path)[source]

Renames an existing node in the taxonomy.

Parameters
  • client (Client) – Tamr client connected to target instance

  • project_id (str) – ID of the categorization project

  • new_name (str) – New name to assign to the leaf node

  • path (list) – Full path of the existing leaf node to rename

Returns: None

tamr_toolbox.project.categorization.taxonomy.create_node(client, project_id, path)[source]

Creates a category with the specified path in the project taxonomy.

Parameters
  • client (Client) – Tamr client connected to target instance

  • project_id (str) – ID of the categorization project

  • path (list) – Full path of the new category to be added

Returns: None

tamr_toolbox.project.categorization.taxonomy.get_taxonomy_as_dataframe(client, project_id)[source]

Returns the taxonomy for a project given the project ID.

Parameters
  • client (Client) – Tamr client connected to target instance

  • project_id (str) – ID of the categorization project

Return type

DataFrame

Returns

Current taxonomy categories as a dataframe

Raises

RuntimeError – if project is not a categorization project or if the taxonomy does not exist

tamr_toolbox.project.categorization.taxonomy.move_node(client, project_id, old_node_path, new_node_path, move_verifications=True)[source]

Function to move a node in a taxonomy to a new path. By default, the function will also move any verified categorizations under the old node to the new paths.

Parameters
  • client (Client) – Tamr client connected to the target instance.

  • project_id (str) – Project ID of categorization project.

  • old_node_path (list) – List of the full path for the node to be moved.

  • new_node_path (list) – List of the full path for where the node is to be moved to.

  • move_verifications (bool) – Optional boolean argument to move verifications to the new path.

  • work. (Setting to false may result in loss of) –

Returns: None

Schema

tamr_toolbox.project.categorization.schema.map_attribute(project, *, source_attribute_name, source_dataset_name, unified_attribute_name)

Maps source_attribute in source_dataset to unified_attribute in unified_dataset. If the mapping already exists it will log a warning and return the existing AttributeMapping from the project’s collection.

Parameters
  • source_attribute_name (str) – Source attribute name to map

  • source_dataset_name (str) – Source dataset containing the source attribute

  • unified_attribute_name (str) – Unified attribute to which to map the source attribute

  • project (Project) – The project in which to perform the mapping

Return type

AttributeMapping

Returns

The created AttributeMapping

Raises

ValueError – if input variables source_attribute_name or source_dataset_name or unified_attribute_name are set to empty strings; or if the dataset source_dataset_name is not found on Tamr; or if source_attribute_name is missing from the attributes of source_attribute_name

tamr_toolbox.project.categorization.schema.unmap_attribute(project, *, source_attribute_name, source_dataset_name, unified_attribute_name)

Unmaps a source attribute.

Parameters
  • source_attribute_name (str) – the name of the source attribute to unmap

  • source_dataset_name (str) – the name of the source dataset containing that source attribute

  • unified_attribute_name (str) – the unified attribute from which to unmap

  • project (Project) – the project in which to unmap the attribute

Return type

None

Returns

None

tamr_toolbox.project.categorization.schema.bootstrap_dataset(project, *, source_dataset, force_add_dataset_to_project=False)

Bootstraps a dataset (i.e. maps all source columns to themselves)

Parameters
  • source_dataset (Dataset) – the source dataset (a Dataset object not a string)

  • project (Project) – the project to do the mapping ing

  • force_add_dataset_to_project (bool) – boolean whether to add the dataset to the project if it is not already a part of it

Return type

List[AttributeMapping]

Returns

List of the AttributeMappings generated

Raises

RuntimeError – if source_dataset is not part of the given project, set ‘force_add_dataset_to_project’ flag to True to automatically add it

tamr_toolbox.project.categorization.schema.unmap_dataset(project, *, source_dataset, remove_dataset_from_project=False, skip_if_missing=False)

Wholly unmaps a dataset and optionally removes it from a project.

Parameters
  • source_dataset (Dataset) – the source dataset (Dataset object not a string) to unmap

  • project (Project) – the project in which to unmap the dataset

  • remove_dataset_from_project (bool) – boolean to also remove the dataset from the project

  • skip_if_missing (bool) – boolean to skip if dataset is not in project. If set to false and dataset is not in project will raise a RuntimeError

Return type

None

Returns

None

Raises

RuntimeError – if source_dataset is not in project and skip_if_missing not set to True

Transformations

class tamr_toolbox.project.categorization.transformations.InputTransformation(transformation, datasets=<factory>)

A transformation scoped to input datasets

Version:

Requires Tamr 2020.009.0 or later

Parameters
  • transformation (str) – The text of a transformations script

  • datasets (List[Dataset]) – The list of input datasets that the script should be applied to

class tamr_toolbox.project.categorization.transformations.TransformationGroup(input_scope=<factory>, unified_scope=<factory>)

A group of input transformations and unified transformations

Version:

Requires Tamr 2020.009.0 or later

Parameters
  • input_scope (List[InputTransformation]) – A list of transformation to apply to input datasets

  • unified_scope (List[str]) – A list of transformation scripts to apply to the unified dataset

tamr_toolbox.project.categorization.transformations.get_all(project)

Get the transformations of a Project

Version:

Requires Tamr 2020.009.0 or later

Parameters

project (Project) – Project containing transformations

Return type

TransformationGroup

Returns

All input transformations and unified transformations of a project

tamr_toolbox.project.categorization.transformations.set_all(project, tx, *, allow_overwrite=True)

Set the transformations of a Project

Version:

Requires Tamr 2020.009.0 or later

Parameters
  • project (Project) – Project to place transformations within

  • tx (TransformationGroup) – Transformations to put into project

  • allow_overwrite – Whether existing transformations can be overwritten

Return type

Response

Returns

Response object created when transformations of a project are replaced

Raises
  • RuntimeError – if allow_overwrite is set to False but transformations already exists in project

  • ValueError – if provided tx are invalid

tamr_toolbox.project.categorization.transformations.get_all_unified(project)

Get the unified transformations of a Project

Version:

Requires Tamr 2020.009.0 or later

Parameters

project (Project) – Project containing transformations

Return type

List[str]

Returns

All unified transformations of a project

tamr_toolbox.project.categorization.transformations.set_all_unified(project, tx, *, allow_overwrite=True)

Set the unified transformations of a Project. Any input transformations will not be altered

Version:

Requires Tamr 2020.009.0 or later

Parameters
  • project (Project) – Project to place transformations within

  • tx (List[str]) – Unified transformations to put into project

  • allow_overwrite – Whether existing unified transformations can be overwritten

Return type

Response

Returns

Response object created when transformations of a project are replaced

Raises

RuntimeError – if allow_overwrite is set to False but transformations already exists in project