Dataframe¶
This module requires an optional dependency. See Installation for details.
Dataframe operations¶
Tasks related to moving data in or out of Tamr using pandas.DataFrames
- tamr_toolbox.data_io.dataframe.flatten(df, *, delimiter='|', columns=None, force=False)[source]¶
Converts DataFrame columns with list type to strings and returns a copy of the DataFrame with this change. Tamr often produces datasets with columns of list type, which often are more easily consumed as single-valued columns.
- Parameters
- Return type
- Returns
flattened DataFrame
- tamr_toolbox.data_io.dataframe.from_dataset(dataset, *, columns=None, flatten_delimiter=None, flatten_columns=None, force_flatten=False, nrows=None, allow_dataset_refresh=False)[source]¶
Creates a DataFrame from a Tamr Dataset
- Parameters
dataset (
Dataset
) – Tamr Dataset objectcolumns (
Optional
[List
[str
]]) – optional, ordered list of columns to keepflatten_delimiter (
Optional
[str
]) – if set, flatten list types to strings by concatenating with this delimiterflatten_columns (
Optional
[List
[str
]]) – optional, list of columns to flattenforce_flatten (
bool
) – if False, arrays with inner types other than string will not be flattened. if True, will force all inner types to strings when flattening values. if True, flatten_delimiter must be specified.nrows (
Optional
[int
]) – number of rows to read. default None will read all rowsallow_dataset_refresh (
bool
) – if True, allows running a job to refresh dataset to make streamable
- Return type
- Returns
DataFrame
- Raises
ValueError – if columns or flatten_columns contain columns that are not present in dataset
- tamr_toolbox.data_io.dataframe.profile(df)[source]¶
Computes profile statistics from an input DataFrame, and returns statistics in another DataFrame. Intended to be used for validation checks on a DataFrame before upserting records to a Tamr Dataset
- tamr_toolbox.data_io.dataframe.validate(df, *, raise_error=True, require_present_columns=None, require_unique_columns=None, require_nonnull_columns=None, custom_checks=())[source]¶
Performs validation checks on a DataFrame. Returns a dict of columns that fail each check, and optionally returns an error. Intended to be used on a DataFrame prior to upserting records into a Tamr dataset.
- Parameters
df (
DataFrame
) – DataFrameraise_error (
bool
) – if True, will raise a ValueError on failing checks. if False, will print Warning and return a dictrequire_present_columns (
Optional
[List
[str
]]) – list of columns that are checked to be presentrequire_unique_columns (
Optional
[List
[str
]]) – list of columns that are checked to have all unique values, e.g. a primary key columnrequire_nonnull_columns (
Optional
[List
[str
]]) – list of columns that are checked to have all non-null valuescustom_checks (
Iterable
[Tuple
[Callable
[[Any
],bool
],List
[str
]]]) – collection of tuples each containing a custom function and list of columns, on which to apply it
- Return type
- Returns
ValidationCheck object, with bool for whether all checks passed and dict of failing columns
- Raises
ValueError – if raise_error is set True, and any checks fail