RealTime

RealTime Matching

tamr_toolbox.realtime.matching.update_realtime_match_data(*, project, do_update_clusters=True, do_use_manual_clustering=False, **options)[source]

Updates data for RealTime match queries if needed, based on latest published clusters.

Parameters
  • project (MasteringProject) – project to be updated

  • do_update_clusters (bool) – whether to update clusters, default True

  • do_use_manual_clustering (bool) – whether to use externally managed clustering, default False

  • options – Options passed to underlying Operation

Return type

Operation

Returns

an operation object describing the update operation

Raises

RuntimeError – if update API call fails

tamr_toolbox.realtime.matching.poll_realtime_match_status(*, project, match_client, num_tries=10, wait_sec=1)[source]

Check if match service is queryable. Try up to num_tries times at 1 sec (or user-specified) interval.

Parameters
  • project (MasteringProject) – the mastering project whose status to check

  • match_client (Client) – a Tamr client set to use the port of the Match API

  • num_tries (int) – max number of times to poll endpoint, default 10

  • wait_sec (int) – number of seconds to wait between tries, default 1

Return type

bool

Returns

bool indicating whether project is queryable

tamr_toolbox.realtime.matching.match_query(*, project, match_client, records, type, primary_key=None, batch_size=None, min_match_prob=None, max_num_matches=None)[source]

Find the best matching clusters or records for each supplied record. Returns a dictionary where each key correpsonds to an input record and the value is a list of the RealTime match results for that record. An empty result list indicates a null response from matching (or no responses above the min_match_prob, if that parameter was supplied).

Parameters
  • project (MasteringProject) – the mastering project to query for matches

  • match_client (Client) – a Tamr client set to use the port of the Match API

  • records (List[Dict[str, Any]]) – list of records to match

  • type (str) – one of “records” or “clusters” – whether to pull record or cluster matches

  • primary_key (Optional[str]) – a primary key for the data; if supplied, this must be a field in input records

  • batch_size (Optional[int]) – split input into this batch size for match query calls (e.g. to prevent network timeouts), default None sends a single query with all records

  • min_match_prob (Optional[float]) – if set, only matches with probability above minimum will be returned, default None

  • max_num_matches (Optional[int]) – if set, at most max_num_matches will be returned for each input record in records, default None

Return type

Dict[Union[int, str], List[Dict[str, Any]]]

Returns

Dict keyed by integers (indices of inputs), or by primary_key if primary_key is supplied, with value a list containing matched data

Raises
  • ValueError – if match type is not “records” or “clusters”, or if batch_size is invalid

  • RuntimeError – if query fails

tamr_toolbox.realtime.matching.transform_and_match_query(*, project, match_client, records, type, primary_key=None, batch_size=None, min_match_prob=None, max_num_matches=None, default_source_name=None)[source]

Find the best matching clusters or records for each supplied record. Returns a dictionary where each key correpsonds to an input record and the value is a list of the RealTime match results for that record. An empty result list indicates a null response from matching (or no responses above the min_match_prob, if that parameter was supplied). Will run schema mapping and transformations prior to realtime match. If LLT, is not enabled will just run default LLM with no transformation or schema mapping

Parameters
  • project (MasteringProject) – the mastering project to query for matches

  • match_client (Client) – a Tamr client set to use the port of the Match API

  • records (List[Dict[str, Any]]) – list of records to match

  • type (str) – one of “records” or “clusters” – whether to pull record or cluster matches

  • primary_key (Optional[str]) – a primary key for the data; if supplied, this must be a field in input records

  • batch_size (Optional[int]) – split input into this batch size for match query calls (e.g. to prevent network timeouts), default None sends a single query with all records

  • min_match_prob (Optional[float]) – if set, only matches with probability above minimum will be returned, default None

  • max_num_matches (Optional[int]) – if set, at most max_num_matches will be returned for each input record in records, default None

  • default_source_name (Optional[str]) – the default source name used for schema mapping in LLT, default None

Return type

Dict[Union[int, str], List[Dict[str, Any]]]

Returns

Dict keyed by integers (indices of inputs), or by primary_key if primary_key is supplied, with value a list containing matched data

Raises
  • ValueError – if match type is not “records” or “clusters”, or if batch_size is invalid

  • RuntimeError – if query fails