DF-Connect

Client

Tasks related to interacting with the Tamr auxiliary service DF-connect

class tamr_toolbox.data_io.df_connect.client.Client(host, port, protocol, base_path, tamr_username, tamr_password, jdbc_info, cert)[source]

A data class for interacting with df_connect via jdbc.

Parameters
  • host (str) – the host where df_connect is running

  • port (str) – the port on which df_connect is listening

  • protocol (str) – http or https

  • base_path (str) – if using nginx-like proxy this is the redirect path

  • tamr_username (str) – the tamr account to use

  • tamr_password (str) – the password for the tamr account to use

  • jbdc_info – configuration information for the jdbc connection

  • cert (Optional[str]) – optional path to a certfile for authentication

tamr_toolbox.data_io.df_connect.client.from_config(config, config_key='df_connect', jdbc_key='ingest')[source]

Constructs a Client object from a json dictionary.

Parameters
  • config (Dict[str, Any]) – A json dictionary of configuration values

  • config_key (str) – block of the config to parse for values. Defaults to ‘df_connect’

  • jdbc_key (str) – the key used to specify which block of df_connect–>jdbc in configuration to be used for picking up database connection information. Defaults to ‘ingest’

Return type

Client

Returns

A Client object

tamr_toolbox.data_io.df_connect.client.create(*, host, port='', protocol, base_path='', tamr_username, tamr_password, jdbc_dict, cert=None)[source]

Simple wrapper for creating an instance of Client dataclass object.

Parameters
  • host (str) – the host where df_connect is running

  • port (str) – the port on which df_connect is listening

  • protocol (str) – http or https

  • base_path – if using nginx-like proxy this is the redirect path

  • tamr_username (str) – the tamr account to use

  • tamr_password (str) – the password for the tamr account to use

  • jdbc_dict (Dict[str, Any]) – configuration information for the jdbc connection

  • cert (Optional[str]) – optional path to a certfile for authentication

Return type

Client

Returns

An instance of tamr_toolbox.data_io.df_connect.Client

tamr_toolbox.data_io.df_connect.client.get_connect_session(connect_info)[source]

Returns an authenticated session using Tamr credentials from configuration. Raises an exception if df_connect is not installed or running correctly.

Parameters

connect_info (Client) – An instance of a Client object

Return type

Session

Returns

An authenticated session

Raises

RuntimeError – if a connection to df_connect cannot be established

tamr_toolbox.data_io.df_connect.client.ingest_dataset(connect_info, *, dataset_name, query, primary_key=None)[source]

Ingest a dataset into Tamr via df-df_connect given dataset name, query string, and optional list of columns for primary key

Parameters
  • dataset_name (str) – Name of dataset

  • query (str) – jdbc query to execute in the database and results of which will be loaded into Tamr

  • connect_info (Client) – A Client object for establishing session and loading jdbc parameters

  • primary_key – list of columns to use as primary key. If None then df_connect will generate its own primary key

Return type

Dict[str, Any]

Returns

JSON response from API call

Raises

HTTPError – if the call to ingest the dataset was unsuccessful

tamr_toolbox.data_io.df_connect.client.export_dataset(connect_info, *, dataset_name, target_table_name, truncate_before_load=False, **kwargs)[source]

Export a dataset via jdbc to a target database.

Parameters
  • dataset_name (str) – the name of the dataset to export

  • target_table_name (str) – the table in the database to update

  • truncate_before_load (bool) – whether or not to truncate the database table before load

  • connect_info (Client) – A Client object for establishing session and loading jdbc parameters

  • jdbc_key – the key for picking up relevant block for export from config file. See examples directory for usage

Return type

Dict[str, Any]

Returns

JSON response from API call

Raises

HTTPError – if the call to export the dataset was unsuccessful

tamr_toolbox.data_io.df_connect.client.execute_statement(connect_info, statement)[source]

Calls the execute statement endpoint of df-df_connect.

Parameters
  • statement (str) – the SQL statement to be executed

  • connect_info (Client) – A Client object for establishing session and loading jdbc parameters

Return type

Dict[str, Any]

Returns

JSON response from API call

Raises

HTTPError – if the call to df_connect was unsuccessful

tamr_toolbox.data_io.df_connect.client.profile_query_results(connect_info, *, dataset_name, queries)[source]

Profile the contents of JDBC queries via df_connect and write results to a Tamr dataset. For example the query “select * from table A” means that all rows from table A will be profiled, while “select * from table A where name==”my_name”” will only profile rows meeting the name==”my_name” condition. The same Tamr dataset can be used for profile results from multiple queries

Parameters
  • dataset_name (str) – Name of Tamr dataset for the profiling results

  • queries (List[str]) – list of JDBC queries to execute in the database, the results of which will be profiled

  • connect_info (Client) – A Client object for establishing session and loading jdbc parameters

Return type

Dict[str, Any]

Returns

JSON response from API call

Raises

HTTPError – if the call to profile the dataset was unsuccessful

tamr_toolbox.data_io.df_connect.client.export_dataset_avro_schema(connect_info, *, url, dataset_name, fs_type)[source]

Takes a connect info object and writes the avro schema to specified url for specified dataset. By default assumes HDFS but if local_fs is set to true writes to server file system.

Parameters
  • connect_info (Client) – The connect client to use

  • url (str) – the location in the relevant file system to which to write the avro schema

  • dataset_name (str) – the name of the dataset

  • fs_type (Enum) – the remote filesystem type. Currently supports ‘HDFS’ and ‘LOCAL’

Return type

Union[Dict[str, Any], bool]

Returns

json returned by df-connects /urlExport/<hdfs/serverfs>/avroSchema endpoint

Raises

HTTPError – if the call to export the schema was unsuccessful

tamr_toolbox.data_io.df_connect.client.export_dataset_as_avro(connect_info, *, url, dataset_name, fs_type)[source]

Takes a connect info object and writes the corresponding avro file to specified url for specified dataset. By default assumes HDFS but if local_fs is set to true writes to server file system.

Parameters
  • connect_info (Client) – The connect client to use

  • url (str) – the location in the relevant file system to which to write the avro schema

  • dataset_name (str) – the name of the dataset

  • fs_type (Enum) – the remote filesystem type. Currently supports ‘HDFS’ and ‘LOCAL’

Return type

Union[Dict[str, Any], bool]

Returns

json returned by df-connects /urlExport/<hdfs/serverfs>/avroSchema endpoint

Raises
  • ValueError – if using an unsupported type of file system

  • HTTPError – if the call to export the dataset was unsuccessful

JdbcInfo

Tasks related to handling jdbc information for the Tamr auxiliary service DF-connect

class tamr_toolbox.data_io.df_connect.jdbc_info.JdbcInfo(jdbc_url, db_user, db_password, fetch_size)[source]

A dataclass to tie together relevant data to ingest data into df_connect.

Parameters
  • jdbc_url (str) – The jdbc url for the database to which to connect

  • db_user (str) – The database user as whom to authenticate

  • db_password (str) – The password for the database user

  • fetch_size (int) – The number of records by which to chunk the jdbc ResultSet

tamr_toolbox.data_io.df_connect.jdbc_info.from_config(config, *, config_key='df_connect', jdbc_key='ingest')[source]

Create an instance of JdbcInfo from a json object.

Parameters
  • config (Dict[str, Any]) – A json dictionary containing configuration values

  • config_key (str) – the top-level key of the config to use.

  • jdbc_key (str) – the key to use for the jdbc block. Needs to be within config_key block. Defaults to ‘ingest’, but can be used to specify any sub-block of a config object or yaml file. See example configs and exports for more context.

Return type

JdbcInfo

tamr_toolbox.data_io.df_connect.jdbc_info.create(*, jdbc_url, db_user, db_password, fetch_size)[source]

A simple wrapper to create an object of type JdbcInfo

Parameters
  • jdbc_url (str) – The jdbc url for the database to which to connect

  • db_user (str) – The database user as whom to authenticate

  • db_password (str) – The password for the database user

  • fetch_size (int) – The number of records by which to chunk the jdbc ResultSet

Return type

JdbcInfo

Returns

An instance of a JdbcInfo object.