Dataframe

This module requires an optional dependency. See Installation for details.

Dataframe operations

Tasks related to moving data in or out of Tamr using pandas.DataFrames

tamr_toolbox.data_io.dataframe.flatten(df, *, delimiter='|', columns=None, force=False)[source]

Converts DataFrame columns with list type to strings and returns a copy of the DataFrame with this change. Tamr often produces datasets with columns of list type, which often are more easily consumed as single-valued columns.

Parameters
  • df (DataFrame) – DataFrame from Tamr dataset

  • delimiter (str) – string to use as delimiter for concatenating lists

  • columns (Optional[List[str]]) – optional, list of columns to flatten

  • force (bool) – if True, will force non-string inner types to string

Return type

DataFrame

Returns

flattened DataFrame

tamr_toolbox.data_io.dataframe.from_dataset(dataset, *, columns=None, flatten_delimiter=None, flatten_columns=None, force_flatten=False, nrows=None, allow_dataset_refresh=False)[source]

Creates a DataFrame from a Tamr Dataset

Parameters
  • dataset (Dataset) – Tamr Dataset object

  • columns (Optional[List[str]]) – optional, ordered list of columns to keep

  • flatten_delimiter (Optional[str]) – if set, flatten list types to strings by concatenating with this delimiter

  • flatten_columns (Optional[List[str]]) – optional, list of columns to flatten

  • force_flatten (bool) – if False, arrays with inner types other than string will not be flattened. if True, will force all inner types to strings when flattening values. if True, flatten_delimiter must be specified.

  • nrows (Optional[int]) – number of rows to read. default None will read all rows

  • allow_dataset_refresh (bool) – if True, allows running a job to refresh dataset to make streamable

Return type

DataFrame

Returns

DataFrame

Raises

ValueError – if columns or flatten_columns contain columns that are not present in dataset

tamr_toolbox.data_io.dataframe.profile(df)[source]

Computes profile statistics from an input DataFrame, and returns statistics in another DataFrame. Intended to be used for validation checks on a DataFrame before upserting records to a Tamr Dataset

Parameters

df (DataFrame) – DataFrame

Return type

DataFrame

Returns

DataFrame with profile statistics

tamr_toolbox.data_io.dataframe.validate(df, *, raise_error=True, require_present_columns=None, require_unique_columns=None, require_nonnull_columns=None, custom_checks=())[source]

Performs validation checks on a DataFrame. Returns a dict of columns that fail each check, and optionally returns an error. Intended to be used on a DataFrame prior to upserting records into a Tamr dataset.

Parameters
  • df (DataFrame) – DataFrame

  • raise_error (bool) – if True, will raise a ValueError on failing checks. if False, will print Warning and return a dict

  • require_present_columns (Optional[List[str]]) – list of columns that are checked to be present

  • require_unique_columns (Optional[List[str]]) – list of columns that are checked to have all unique values, e.g. a primary key column

  • require_nonnull_columns (Optional[List[str]]) – list of columns that are checked to have all non-null values

  • custom_checks (Iterable[Tuple[Callable[[Any], bool], List[str]]]) – collection of tuples each containing a custom function and list of columns, on which to apply it

Return type

ValidationCheck

Returns

ValidationCheck object, with bool for whether all checks passed and dict of failing columns

Raises

ValueError – if raise_error is set True, and any checks fail