Dataset¶
Manage¶
-
tamr_toolbox.dataset.manage.
exists
(*, client, dataset_name)[source]¶ Check if a dataset exists in a Tamr instance
-
tamr_toolbox.dataset.manage.
create
(*, client, dataset_name, dataset=None, primary_keys=None, attributes=None, attribute_types=None, attribute_descriptions=None, description=None, external_id=None, tags=None)[source]¶ Flexibly create a source dataset in Tamr
A template dataset object can be passed in to create a duplicate dataset with a new name. If the template dataset is not provided, the primary_keys must be defined for the dataset to be created. Additional attributes can be added in the attributes argument. The default attribute type will be ARRAY STRING. Non-default attribute types can be specified in the attribute_types dictionary. Any attribute descriptions can be specified in the attribute_descriptions dictionary.
- Parameters
client (
Client
) – TUC client objectdataset_name (
str
) – name for the new dataset being createddataset (
Optional
[Dataset
]) – optional dataset TUC object to use as a template for the new datasetprimary_keys (
Optional
[List
[str
]]) – one or more attributes for primary key(s) of the new datasetattributes (
Optional
[Iterable
[str
]]) – a list of attribute names to create in the new datasetattribute_types (
Optional
[Dict
[str
,Union
[PrimitiveType
,Array
,Map
,Record
]]]) – dictionary for non-default types, attribute name is the key and AttributeType is the valueattribute_descriptions (
Optional
[Dict
[str
,str
]]) – dictionary for attribute descriptions, attribute name is the key and the attribute description is the valuedescription (
Optional
[str
]) – description of the new datasetexternal_id (
Optional
[str
]) – external_id for dataset, if None Tamr will create one for youtags (
Optional
[List
[str
]]) – the list of tags for the new dataset
- Return type
- Returns
Dataset created in Tamr
- Raises
requests.HTTPError – If any HTTP error is encountered
ValueError – If both dataset and primary_keys are not defined
ValueError – If the dataset already exists
TypeError – If the attributes argument is not an Iterable
Example
>>> import tamr_toolbox as tbox >>> tamr_client = tbox.utils.client.create(**instance_connection_info) >>> tbox.dataset.manage.create( >>> client=tamr_client, >>> dataset_name="my_new_dataset", >>> primary_keys=["unique_id"], >>> attributes=["name","address"], >>> description="My new dataset", >>> )
-
tamr_toolbox.dataset.manage.
update
(dataset, *, attributes=None, attribute_types=None, attribute_descriptions=None, description=None, tags=None, override_existing_types=False)[source]¶ Flexibly update a source dataset in Tamr
All the attributes that should exist in the dataset must be defined in the attributes argument. This function will add/remove attributes in the dataset until the dataset attributes matches the set of attributes passed in as an argument. The default attribute type will be ARRAY STRING . To set non-default attribute types, they must be defined in the attribute_types dictionary. Any attribute descriptions can be specified in the attribute_descriptions dictionary. By default, the existing attribute types will not change unless override_existing_types is set to True. When False, the attribute type updates will only be logged.
- Parameters
dataset (
Dataset
) – An existing TUC datasetattributes (
Optional
[Iterable
[str
]]) – Complete list of attribute names that should exist in the updated datasetattribute_types (
Optional
[Dict
[str
,Union
[PrimitiveType
,Array
,Map
,Record
]]]) – dictionary for non-default types, attribute name is the key and AttributeType is the valueattribute_descriptions (
Optional
[Dict
[str
,str
]]) – dictionary for attribute descriptions, attribute name is the key and the attribute description is the valuedescription (
Optional
[str
]) – updated description of dataset, if None will not update the descriptiontags (
Optional
[List
[str
]]) – updated tags for the dataset, if None will not update tagsoverride_existing_types (
bool
) – boolean flag, when true will alter existing attribute’s types
- Return type
- Returns
Updated Dataset
- Raises
requests.HTTPError – If any HTTP error is encountered
ValueError – If the dataset is not a source dataset
TypeError – If the attributes argument is not an Iterable
Example
>>> import tamr_toolbox as tbox >>> from tbox.models import attribute_type >>> tamr_client = tbox.utils.client.create(**instance_connection_info) >>> dataset = = tamr_client.datasets.by_name("my_dataset_name") >>> tbox.dataset.manage.update( >>> client=tamr_client, >>> dataset=dataset, >>> attributes=["unique_id","name","address","total_sales"], >>> attribute_types={"total_sales":attribute_type.ARRAY(attribute_type.DOUBLE)}, >>> override_existing_types = True, >>> )
-
tamr_toolbox.dataset.manage.
create_attributes
(*, dataset, attributes, attribute_types=None, attribute_descriptions=None)[source]¶ Create new attributes in a dataset
The default attribute type will be ARRAY STRING. To set non-default attribute types, they must be defined in the attribute_types dictionary. Any attribute descriptions can be specified in the attribute_descriptions dictionary.
- Parameters
dataset (
Dataset
) – An existing TUC datasetattributes (
Iterable
[str
]) – list of attribute names to be added to datasetattribute_types (
Optional
[Dict
[str
,Union
[PrimitiveType
,Array
,Map
,Record
]]]) – dictionary for non-default types, attribute name is the key and AttributeType is the valueattribute_descriptions (
Optional
[Dict
[str
,str
]]) – dictionary for attribute descriptions, attribute name is the key and the attribute description is the value
- Return type
- Returns
Updated Dataset
- Raises
requests.HTTPError – If any HTTP error is encountered
TypeError – If the attributes argument is not an Iterable
ValueError – If the dataset is a unified dataset
ValueError – If an attribute passed in already exists in the dataset
-
tamr_toolbox.dataset.manage.
edit_attributes
(*, dataset, attribute_types=None, attribute_descriptions=None, override_existing_types=True)[source]¶ Edit existing attributes in a dataset
The attribute type and/or descriptions can be updated to new values. Attributes that will be updated must be in either the attribute_types or attribute_descriptions dictionaries or both. The default attribute type will be ARRAY STRING. To set non-default attribute types, they must be defined in the attribute_types dictionary. Any attribute descriptions can be specified in the attribute_descriptions dictionary. If only the attribute_descriptions dictionary is defined, the attribute type will not be updated.
- Parameters
dataset (
Dataset
) – An existing TUC datasetattribute_types (
Optional
[Dict
[str
,Union
[PrimitiveType
,Array
,Map
,Record
]]]) – dictionary for non-default types, attribute name is the key and AttributeType is the valueattribute_descriptions (
Optional
[Dict
[str
,str
]]) – dictionary for attribute descriptions, attribute name is the key and the attribute description is the valueoverride_existing_types (
bool
) – bool flag, when true will alter existing attributes
- Return type
- Returns
Updated Dataset
- Raises
requests.HTTPError – If any HTTP error is encountered
ValueError – If the dataset is not a source dataset
ValueError – If a passed attribute does not exist in the dataset
ValueError – If a passed attribute is a primary key and can’t be removed
ValueError – If there are no updates to attributes in attribute_types or attribute_descriptions arguments
-
tamr_toolbox.dataset.manage.
delete_attributes
(*, dataset, attributes=None)[source]¶ Remove attributes from dataset by attribute name
- Parameters
- Return type
- Returns
Updated Dataset
- Raises
ValueError – If the dataset is not a source dataset
ValueError – If a passed attribute does not exist in the dataset
ValueError – If a passed attribute is a primary key and can’t be removed
TypeError – If the attributes argument is not an Iterable
Profile¶
Additional functions to manipulate the profile of the dataset.
-
tamr_toolbox.dataset._dataset.
get_profile
(dataset, allow_create_or_refresh=False)[source]¶ Returns a dataset profile object. Optionally can refresh or create profile if missing or out-of-date. :type dataset:
Dataset
:param dataset: Tamr dataset object :type allow_create_or_refresh:bool
:param allow_create_or_refresh: optional bool to allow creation/refreshing of profile info- Return type
- Returns
DatasetProfile object Warning if profile information is out of date and allow_create_or_refresh is False
- Raises
RuntimeError – if profile has not been created and allow_create_or_refresh is False