Dataframe

This module requires an optional dependency. See Installation for details.

Dataframe operations

Tasks related to moving data in or out of Tamr using pandas.DataFrames

tamr_toolbox.data_io.dataframe.flatten(df, *, delimiter='|', columns=None, force=False)[source]

Converts DataFrame columns with list type to strings and returns a copy of the DataFrame with this change. Tamr often produces datasets with columns of list type, which often are more easily consumed as single-valued columns.

Parameters
  • df (DataFrame) – DataFrame from Tamr dataset

  • delimiter (str) – string to use as delimiter for concatenating lists

  • columns (Optional[List[str]]) – optional, list of columns to flatten

  • force (bool) – if True, will force non-string inner types to string

Return type

DataFrame

Returns

flattened DataFrame

tamr_toolbox.data_io.dataframe.from_dataset(dataset, *, columns=None, flatten_delimiter=None, flatten_columns=None, force_flatten=False, nrows=None, allow_dataset_refresh=False)[source]

Creates a DataFrame from a Tamr Dataset

Parameters
  • dataset (Dataset) – Tamr Dataset object

  • columns (Optional[List[str]]) – optional, ordered list of columns to keep

  • flatten_delimiter (Optional[str]) – if set, flatten list types to strings by concatenating with this delimiter

  • flatten_columns (Optional[List[str]]) – optional, list of columns to flatten

  • force_flatten (bool) – if False, arrays with inner types other than string will not be flattened. if True, will force all inner types to strings when flattening values.

  • nrows (Optional[int]) – number of rows to read. default None will read all rows

  • allow_dataset_refresh (bool) – if True, allows running a job to refresh dataset to make streamable

Return type

DataFrame

Returns

DataFrame

Raises

ValueError – if columns or flatten_columns contain columns that are not present in dataset

tamr_toolbox.data_io.dataframe.profile(df)[source]

Computes profile statistics from an input DataFrame, and returns statistics in another DataFrame. Intended to be used for validation checks on a DataFrame before upserting records to a Tamr Dataset

Parameters

df (DataFrame) – DataFrame

Return type

DataFrame

Returns

DataFrame with profile statistics

tamr_toolbox.data_io.dataframe.validate(df, *, raise_error=True, require_present_columns=None, require_unique_columns=None, require_nonnull_columns=None)[source]

Performs validation checks on a DataFrame. Returns a dict of columns that fail each check, and optionally returns an error. Intended to be used on a DataFrame prior to upserting records into a Tamr dataset.

Parameters
  • df (DataFrame) – DataFrame

  • raise_error (bool) – if True, will raise a ValueError on failing checks. if False, will print Warning and return a dict

  • require_present_columns (Optional[List[str]]) – list of columns that are checked to be present

  • require_unique_columns (Optional[List[str]]) – list of columns that are checked to have all unique values, e.g. a primary key column

  • require_nonnull_columns (Optional[List[str]]) – list of columns that are checked to have all non-null values

Return type

ValidationCheck

Returns

ValidationCheck object, with bool for whether all checks passed and dict of failing columns

Raises

ValueError – if raise_error is set True, and any checks fail