RealTime

RealTime Matching

tamr_toolbox.realtime.matching.update_realtime_match_data(*, project, do_update_clusters=True, do_use_manual_clustering=False, **options)[source]

Updates data for RealTime match queries if needed, based on latest published clusters.

Parameters
  • project (MasteringProject) – project to be updated

  • do_update_clusters (bool) – whether to update clusters, default True

  • do_use_manual_clustering (bool) – whether to use externally managed clustering, default False

  • options – Options passed to underlying Operation

Return type

Operation

Returns

an operation object describing the update operation

Raises

RuntimeError – if update API call fails

tamr_toolbox.realtime.matching.poll_realtime_match_status(*, project, match_client, num_tries=10, wait_sec=1)[source]

Check if match service is queryable. Try up to num_tries times at 1 sec (or user-specified) interval.

Parameters
  • project (MasteringProject) – the mastering project whose status to check

  • match_client (Client) – a Tamr client set to use the port of the Match API

  • num_tries (int) – max number of times to poll endpoint, default 10

  • wait_sec (int) – number of seconds to wait between tries, default 1

Return type

bool

Returns

bool indicating whether project is queryable

tamr_toolbox.realtime.matching.match_query(*, project, match_client, records, type, primary_key=None, batch_size=None, min_match_prob=None, max_num_matches=None)[source]

Find the best matching clusters or records for each supplied record. Returns a dictionary where each key correpsonds to an input record and the value is a list of the RealTime match results for that record. An empty result list indicates a null response from matching (or no responses above the min_match_prob, if that parameter was supplied).

Parameters
  • project (MasteringProject) – the mastering project to query for matches

  • match_client (Client) – a Tamr client set to use the port of the Match API

  • records (List[Dict[str, Any]]) – list of records to match

  • type (str) – one of “records” or “clusters” – whether to pull record or cluster matches

  • primary_key (Optional[str]) – a primary key for the data; if supplied, this must be a field in input records

  • batch_size (Optional[int]) – split input into this batch size for match query calls (e.g. to prevent network timeouts), default None sends a single query with all records

  • min_match_prob (Optional[float]) – if set, only matches with probability above minimum will be returned, default None

  • max_num_matches (Optional[int]) – if set, at most max_num_matches will be returned for each input record in records, default None

Return type

Dict[Union[int, str], List[Dict[str, Any]]]

Returns

Dict keyed by integers (indices of inputs), or by primary_key if primary_key is supplied, with value a list containing matched data

Raises
  • ValueError – if match type is not “records” or “clusters”, or if batch_size is invalid

  • RuntimeError – if query fails