Dataset

Manage

tamr_toolbox.dataset.manage.exists(*, client, dataset_name)[source]

Check if a dataset exists in a Tamr instance

Parameters
  • client (Client) – Tamr python client object for the target instance

  • dataset_name (str) – The dataset name

Return type

bool

Returns

True or False for if the dataset exists in target instance

tamr_toolbox.dataset.manage.create(*, client, dataset_name, dataset=None, primary_keys=None, attributes=None, attribute_types=None, attribute_descriptions=None, description=None, external_id=None, tags=None)[source]

Flexibly create a source dataset in Tamr

A template dataset object can be passed in to create a duplicate dataset with a new name. If the template dataset is not provided, the primary_keys must be defined for the dataset to be created. Additional attributes can be added in the attributes argument. The default attribute type will be ARRAY STRING. Non-default attribute types can be specified in the attribute_types dictionary. Any attribute descriptions can be specified in the attribute_descriptions dictionary.

Parameters
  • client (Client) – TUC client object

  • dataset_name (str) – name for the new dataset being created

  • dataset (Optional[Dataset]) – optional dataset TUC object to use as a template for the new dataset

  • primary_keys (Optional[List[str]]) – one or more attributes for primary key(s) of the new dataset

  • attributes (Optional[Iterable[str]]) – a list of attribute names to create in the new dataset

  • attribute_types (Optional[Dict[str, Union[PrimitiveType, Array, Map, Record]]]) – dictionary for non-default types, attribute name is the key and AttributeType is the value

  • attribute_descriptions (Optional[Dict[str, str]]) – dictionary for attribute descriptions, attribute name is the key and the attribute description is the value

  • description (Optional[str]) – description of the new dataset

  • external_id (Optional[str]) – external_id for dataset, if None Tamr will create one for you

  • tags (Optional[List[str]]) – the list of tags for the new dataset

Return type

Dataset

Returns

Dataset created in Tamr

Raises

Example

>>> import tamr_toolbox as tbox
>>> tamr_client = tbox.utils.client.create(**instance_connection_info)
>>> tbox.dataset.manage.create(
>>>     client=tamr_client,
>>>     dataset_name="my_new_dataset",
>>>     primary_keys=["unique_id"],
>>>     attributes=["name","address"],
>>>     description="My new dataset",
>>> )
tamr_toolbox.dataset.manage.update(dataset, *, attributes=None, attribute_types=None, attribute_descriptions=None, description=None, tags=None, override_existing_types=False)[source]

Flexibly update a source dataset in Tamr

All the attributes that should exist in the dataset must be defined in the attributes argument. This function will add/remove attributes in the dataset until the dataset attributes matches the set of attributes passed in as an argument. The default attribute type will be ARRAY STRING . To set non-default attribute types, they must be defined in the attribute_types dictionary. Any attribute descriptions can be specified in the attribute_descriptions dictionary. By default, the existing attribute types will not change unless override_existing_types is set to True. When False, the attribute type updates will only be logged.

Parameters
  • dataset (Dataset) – An existing TUC dataset

  • attributes (Optional[Iterable[str]]) – Complete list of attribute names that should exist in the updated dataset

  • attribute_types (Optional[Dict[str, Union[PrimitiveType, Array, Map, Record]]]) – dictionary for non-default types, attribute name is the key and AttributeType is the value

  • attribute_descriptions (Optional[Dict[str, str]]) – dictionary for attribute descriptions, attribute name is the key and the attribute description is the value

  • description (Optional[str]) – updated description of dataset, if None will not update the description

  • tags (Optional[List[str]]) – updated tags for the dataset, if None will not update tags

  • override_existing_types (bool) – boolean flag, when true will alter existing attribute’s types

Return type

Dataset

Returns

Updated Dataset

Raises

Example

>>> import tamr_toolbox as tbox
>>> from tbox.models import attribute_type
>>> tamr_client = tbox.utils.client.create(**instance_connection_info)
>>> dataset = = tamr_client.datasets.by_name("my_dataset_name")
>>> tbox.dataset.manage.update(
>>>     client=tamr_client,
>>>     dataset=dataset,
>>>     attributes=["unique_id","name","address","total_sales"],
>>>     attribute_types={"total_sales":attribute_type.ARRAY(attribute_type.DOUBLE)},
>>>     override_existing_types = True,
>>> )
tamr_toolbox.dataset.manage.create_attributes(*, dataset, attributes, attribute_types=None, attribute_descriptions=None)[source]

Create new attributes in a dataset

The default attribute type will be ARRAY STRING. To set non-default attribute types, they must be defined in the attribute_types dictionary. Any attribute descriptions can be specified in the attribute_descriptions dictionary.

Parameters
  • dataset (Dataset) – An existing TUC dataset

  • attributes (Iterable[str]) – list of attribute names to be added to dataset

  • attribute_types (Optional[Dict[str, Union[PrimitiveType, Array, Map, Record]]]) – dictionary for non-default types, attribute name is the key and AttributeType is the value

  • attribute_descriptions (Optional[Dict[str, str]]) – dictionary for attribute descriptions, attribute name is the key and the attribute description is the value

Return type

Dataset

Returns

Updated Dataset

Raises
  • requests.HTTPError – If any HTTP error is encountered

  • TypeError – If the attributes argument is not an Iterable

  • ValueError – If the dataset is a unified dataset

  • ValueError – If an attribute passed in already exists in the dataset

tamr_toolbox.dataset.manage.edit_attributes(*, dataset, attribute_types=None, attribute_descriptions=None, override_existing_types=True)[source]

Edit existing attributes in a dataset

The attribute type and/or descriptions can be updated to new values. Attributes that will be updated must be in either the attribute_types or attribute_descriptions dictionaries or both. The default attribute type will be ARRAY STRING. To set non-default attribute types, they must be defined in the attribute_types dictionary. Any attribute descriptions can be specified in the attribute_descriptions dictionary. If only the attribute_descriptions dictionary is defined, the attribute type will not be updated.

Parameters
  • dataset (Dataset) – An existing TUC dataset

  • attribute_types (Optional[Dict[str, Union[PrimitiveType, Array, Map, Record]]]) – dictionary for non-default types, attribute name is the key and AttributeType is the value

  • attribute_descriptions (Optional[Dict[str, str]]) – dictionary for attribute descriptions, attribute name is the key and the attribute description is the value

  • override_existing_types (bool) – bool flag, when true will alter existing attributes

Return type

Dataset

Returns

Updated Dataset

Raises
  • requests.HTTPError – If any HTTP error is encountered

  • ValueError – If the dataset is not a source dataset

  • ValueError – If a passed attribute does not exist in the dataset

  • ValueError – If a passed attribute is a primary key and can’t be removed

  • ValueError – If there are no updates to attributes in attribute_types or attribute_descriptions arguments

tamr_toolbox.dataset.manage.delete_attributes(*, dataset, attributes=None)[source]

Remove attributes from dataset by attribute name

Parameters
  • dataset (Dataset) – An existing TUC dataset

  • attributes (Optional[Iterable[str]]) – list of attribute names to delete from dataset

Return type

Dataset

Returns

Updated Dataset

Raises
  • ValueError – If the dataset is not a source dataset

  • ValueError – If a passed attribute does not exist in the dataset

  • ValueError – If a passed attribute is a primary key and can’t be removed

  • TypeError – If the attributes argument is not an Iterable

tamr_toolbox.dataset.manage.update_records(dataset, *, updates=None, delete_all=False, primary_keys, primary_key_name)[source]

Flexibly update the records of a dataset. The user supplies a list of primary keys for a subset of the datasets records, along with a list of updates describing how each record should be altered. An update should either be the string “delete” or a dictionary in “attribute: value” format. In the first case, the record having the corresponding primary key is deleted, and in the second case, the data in the dictionary replaces the record having the corresponding primary key. If no such record exists, a new record is created. Alternatively, the user can set a flag to specify that all records indicated by the list of primary keys should be deleted.

Parameters
  • dataset (Dataset) – An existing TUC dataset

  • updates (Optional[list]) – List of updates to push to the dataset

  • delete_all (bool) – Whether all indicated records should be deleted

  • primary_keys (List[str]) – List of primary key values for all target records

  • primary_key_name (str) – Name of the primary key of the target dataset

Returns

Updated dataset

Raises
  • KeyError – If an indicated attribute does not exist

  • TypeError – If an update in the list is not “delete” or a dict

  • ValueError – If updates and primary_keys have differing lengths

Profile

Additional functions to manipulate the profile of the dataset.

tamr_toolbox.dataset._dataset.get_profile(dataset, allow_create_or_refresh=False)[source]

Returns a dataset profile object. Optionally can refresh or create profile if missing or out-of-date. :type dataset: Dataset :param dataset: Tamr dataset object :type allow_create_or_refresh: bool :param allow_create_or_refresh: optional bool to allow creation/refreshing of profile info

Return type

DatasetProfile

Returns

DatasetProfile object Warning if profile information is out of date and allow_create_or_refresh is False

Raises

RuntimeError – if profile has not been created and allow_create_or_refresh is False