Dataset¶
Manage¶
- tamr_toolbox.dataset.manage.exists(*, client, dataset_name)[source]¶
Check if a dataset exists in a Tamr instance
- tamr_toolbox.dataset.manage.create(*, client, dataset_name, dataset=None, primary_keys=None, attributes=None, attribute_types=None, attribute_descriptions=None, description=None, external_id=None, tags=None)[source]¶
Flexibly create a source dataset in Tamr
A template dataset object can be passed in to create a duplicate dataset with a new name. If the template dataset is not provided, the primary_keys must be defined for the dataset to be created. Additional attributes can be added in the attributes argument. The default attribute type will be ARRAY STRING. Non-default attribute types can be specified in the attribute_types dictionary. Any attribute descriptions can be specified in the attribute_descriptions dictionary.
- Parameters
client (
Client
) – TUC client objectdataset_name (
str
) – name for the new dataset being createddataset (
Optional
[Dataset
]) – optional dataset TUC object to use as a template for the new datasetprimary_keys (
Optional
[List
[str
]]) – one or more attributes for primary key(s) of the new datasetattributes (
Optional
[Iterable
[str
]]) – a list of attribute names to create in the new datasetattribute_types (
Optional
[Dict
[str
,Union
[PrimitiveType
,Array
,Map
,Record
]]]) – dictionary for non-default types, attribute name is the key and AttributeType is the valueattribute_descriptions (
Optional
[Dict
[str
,str
]]) – dictionary for attribute descriptions, attribute name is the key and the attribute description is the valuedescription (
Optional
[str
]) – description of the new datasetexternal_id (
Optional
[str
]) – external_id for dataset, if None Tamr will create one for youtags (
Optional
[List
[str
]]) – the list of tags for the new dataset
- Return type
- Returns
Dataset created in Tamr
- Raises
requests.HTTPError – If any HTTP error is encountered
ValueError – If both dataset and primary_keys are not defined
ValueError – If the dataset already exists
TypeError – If the attributes argument is not an Iterable
Example
>>> import tamr_toolbox as tbox >>> tamr_client = tbox.utils.client.create(**instance_connection_info) >>> tbox.dataset.manage.create( >>> client=tamr_client, >>> dataset_name="my_new_dataset", >>> primary_keys=["unique_id"], >>> attributes=["name","address"], >>> description="My new dataset", >>> )
- tamr_toolbox.dataset.manage.update(dataset, *, attributes=None, attribute_types=None, attribute_descriptions=None, description=None, tags=None, override_existing_types=False)[source]¶
Flexibly update a source dataset in Tamr
All the attributes that should exist in the dataset must be defined in the attributes argument. This function will add/remove attributes in the dataset until the dataset attributes matches the set of attributes passed in as an argument. The default attribute type will be ARRAY STRING . To set non-default attribute types, they must be defined in the attribute_types dictionary. Any attribute descriptions can be specified in the attribute_descriptions dictionary. By default, the existing attribute types will not change unless override_existing_types is set to True. When False, the attribute type updates will only be logged.
- Parameters
dataset (
Dataset
) – An existing TUC datasetattributes (
Optional
[Iterable
[str
]]) – Complete list of attribute names that should exist in the updated datasetattribute_types (
Optional
[Dict
[str
,Union
[PrimitiveType
,Array
,Map
,Record
]]]) – dictionary for non-default types, attribute name is the key and AttributeType is the valueattribute_descriptions (
Optional
[Dict
[str
,str
]]) – dictionary for attribute descriptions, attribute name is the key and the attribute description is the valuedescription (
Optional
[str
]) – updated description of dataset, if None will not update the descriptiontags (
Optional
[List
[str
]]) – updated tags for the dataset, if None will not update tagsoverride_existing_types (
bool
) – boolean flag, when true will alter existing attribute’s types
- Return type
- Returns
Updated Dataset
- Raises
requests.HTTPError – If any HTTP error is encountered
ValueError – If the dataset is not a source dataset
TypeError – If the attributes argument is not an Iterable
Example
>>> import tamr_toolbox as tbox >>> from tbox.models import attribute_type >>> tamr_client = tbox.utils.client.create(**instance_connection_info) >>> dataset = = tamr_client.datasets.by_name("my_dataset_name") >>> tbox.dataset.manage.update( >>> client=tamr_client, >>> dataset=dataset, >>> attributes=["unique_id","name","address","total_sales"], >>> attribute_types={"total_sales":attribute_type.ARRAY(attribute_type.DOUBLE)}, >>> override_existing_types = True, >>> )
- tamr_toolbox.dataset.manage.create_attributes(*, dataset, attributes, attribute_types=None, attribute_descriptions=None)[source]¶
Create new attributes in a dataset
The default attribute type will be ARRAY STRING. To set non-default attribute types, they must be defined in the attribute_types dictionary. Any attribute descriptions can be specified in the attribute_descriptions dictionary.
- Parameters
dataset (
Dataset
) – An existing TUC datasetattributes (
Iterable
[str
]) – list of attribute names to be added to datasetattribute_types (
Optional
[Dict
[str
,Union
[PrimitiveType
,Array
,Map
,Record
]]]) – dictionary for non-default types, attribute name is the key and AttributeType is the valueattribute_descriptions (
Optional
[Dict
[str
,str
]]) – dictionary for attribute descriptions, attribute name is the key and the attribute description is the value
- Return type
- Returns
Updated Dataset
- Raises
requests.HTTPError – If any HTTP error is encountered
TypeError – If the attributes argument is not an Iterable
ValueError – If the dataset is a unified dataset
ValueError – If an attribute passed in already exists in the dataset
- tamr_toolbox.dataset.manage.edit_attributes(*, dataset, attribute_types=None, attribute_descriptions=None, override_existing_types=True)[source]¶
Edit existing attributes in a dataset
The attribute type and/or descriptions can be updated to new values. Attributes that will be updated must be in either the attribute_types or attribute_descriptions dictionaries or both. The default attribute type will be ARRAY STRING. To set non-default attribute types, they must be defined in the attribute_types dictionary. Any attribute descriptions can be specified in the attribute_descriptions dictionary. If only the attribute_descriptions dictionary is defined, the attribute type will not be updated.
- Parameters
dataset (
Dataset
) – An existing TUC datasetattribute_types (
Optional
[Dict
[str
,Union
[PrimitiveType
,Array
,Map
,Record
]]]) – dictionary for non-default types, attribute name is the key and AttributeType is the valueattribute_descriptions (
Optional
[Dict
[str
,str
]]) – dictionary for attribute descriptions, attribute name is the key and the attribute description is the valueoverride_existing_types (
bool
) – bool flag, when true will alter existing attributes
- Return type
- Returns
Updated Dataset
- Raises
requests.HTTPError – If any HTTP error is encountered
ValueError – If the dataset is not a source dataset
ValueError – If a passed attribute does not exist in the dataset
ValueError – If a passed attribute is a primary key and can’t be removed
ValueError – If there are no updates to attributes in attribute_types or attribute_descriptions arguments
- tamr_toolbox.dataset.manage.delete_attributes(*, dataset, attributes=None)[source]¶
Remove attributes from dataset by attribute name
- Parameters
- Return type
- Returns
Updated Dataset
- Raises
ValueError – If the dataset is not a source dataset
ValueError – If a passed attribute does not exist in the dataset
ValueError – If a passed attribute is a primary key and can’t be removed
TypeError – If the attributes argument is not an Iterable
- tamr_toolbox.dataset.manage.update_records(dataset, *, updates=None, delete_all=False, primary_keys, primary_key_name)[source]¶
Flexibly update the records of a dataset. The user supplies a list of primary keys for a subset of the datasets records, along with a list of updates describing how each record should be altered. An update should either be the string “delete” or a dictionary in “attribute: value” format. In the first case, the record having the corresponding primary key is deleted, and in the second case, the data in the dictionary replaces the record having the corresponding primary key. If no such record exists, a new record is created. Alternatively, the user can set a flag to specify that all records indicated by the list of primary keys should be deleted.
- Parameters
dataset (
Dataset
) – An existing TUC datasetupdates (
Optional
[list
]) – List of updates to push to the datasetdelete_all (
bool
) – Whether all indicated records should be deletedprimary_keys (
List
[str
]) – List of primary key values for all target recordsprimary_key_name (
str
) – Name of the primary key of the target dataset
- Returns
Updated dataset
- Raises
KeyError – If an indicated attribute does not exist
TypeError – If an update in the list is not “delete” or a dict
ValueError – If updates and primary_keys have differing lengths
Profile¶
Additional functions to manipulate the profile of the dataset.
- tamr_toolbox.dataset._dataset.get_profile(dataset, allow_create_or_refresh=False)[source]¶
Returns a dataset profile object. Optionally can refresh or create profile if missing or out-of-date. :type dataset:
Dataset
:param dataset: Tamr dataset object :type allow_create_or_refresh:bool
:param allow_create_or_refresh: optional bool to allow creation/refreshing of profile info- Return type
- Returns
DatasetProfile object Warning if profile information is out of date and allow_create_or_refresh is False
- Raises
RuntimeError – if profile has not been created and allow_create_or_refresh is False