Translation

This module requires an optional dependency. See Installation for details.

Translate

Tasks related to efficiently translating data not present in existing translation dictionaries

tamr_toolbox.enrichment.translate.standardize_phrases(original_phrases)[source]

Standardize phrases to translate to avoid re-translating previously translated phrases but with different formating

Parameters

original_phrases (List[str]) – List of phrases to standardize

Return type

List[str]

Returns

List of standardized text

tamr_toolbox.enrichment.translate.get_phrases_to_translate(original_phrases, translation_dictionary)[source]

Find phrases not previously translated and initiate dictionary entry

Parameters
Return type

List[str]

Returns

List of standardized phrases not present as keys of the translation dictionary

tamr_toolbox.enrichment.translate.from_list(all_phrases, client, dictionary, *, source_language='auto', target_language='en', chunk_size=100, translation_model='nmt', intermediate_save_every_n_chunks=None, intermediate_save_to_disk=False, intermediate_folder='/tmp')[source]

Translate a list of phrases from source language to target language. The translation is saved in a dictionary on your local file system before updating the main dictionary

Parameters
  • all_phrases (List[str]) – List of standardized phrases to translate.

  • client (Client) – a google translate api client

  • dictionary (Dict[str, TranslationDictionary]) – a toolbox translation dictionary

  • source_language (str) – the language the text to translate is in, “auto” means the api_client google_api api_client will try to detect the source language automatically

  • target_language (str) – the language to translate into

  • chunk_size (int) – number of phrases to translate per api_client calls, set too high and you will hit API user rate limit errors

  • translation_model (str) – google_api api_client api_client model to use, “nmt” or “pbmt”. Choose “pbmt” if an “nmt” model doesn’t exists for your source to target language pair

  • intermediate_save_every_n_chunks (Optional[int]) – save periodically api_client dictionary to disk every n chunk of phrases translated

  • intermediate_save_to_disk (bool) – decide whether to save periodically the dictionary to disk to avoid loss of translation data if code breaks

  • intermediate_folder (str) – path to folder where dictionary will be save periodically to avoid loss of translation data

Return type

Dict[str, TranslationDictionary]

Returns

The updated translation dictionary

Raises

ValueError – if the argument chunk_size is set to 0

Dictionary

Toolbox translation dictionaries have the following general format:

TranslationDictionary(
    standardized_phrase="cheddar cheese",
    translated_phrase="fromage cheddar",
    detected_language="en",
    original_phrases={"cheddar cheese"},
)

Toolbox translation dictionaries have the following format when loaded in memory to be able to access each dictionary by their standardized phrase:

{
    "cheddar cheese": TranslationDictionary(
        standardized_phrase="cheddar cheese",
        translated_phrase="fromage cheddar",
        detected_language="en",
        original_phrases={"cheddar cheese"},
    ),
    "ground beef": TranslationDictionary(
        standardized_phrase="ground beef",
        translated_phrase="boeuf haché",
        detected_language="en",
        original_phrases={"ground beef"},
    ),
}

When loaded on Tamr, translation dictionaries are source dataset with “standardized_phrase” as the primary key and the following attributes:

  • “translated_phrase”

  • “detected_language”

  • “original_phrases”


Tasks related to creating, updating, saving and moving translation dictionaries in and out of Tamr

class tamr_toolbox.enrichment.dictionary.TranslationDictionary(standardized_phrase=None, translated_phrase=None, detected_language=None, original_phrases=<factory>)[source]

A DataClass for translation dictionaries

Parameters
  • standardized_phrase (Optional[str]) – The unique common standardized version of all original_phrases

  • translated_phrase (Optional[str]) – The translated standardized phrase to the target language of the dictionary

  • detected_language (Optional[str]) – The language detected of the standardized phrase if source language is set to auto

  • original_phrases (Set[str]) – A set of original phrases which all convert to the standardized phrases when applying standardization

tamr_toolbox.enrichment.dictionary.filename(dictionary_folder, *, target_language='en', source_language='auto')[source]

Generate a toolbox translation dictionary file path

Parameters
Return type

str

Returns

A toolbox translation dictionary file path

tamr_toolbox.enrichment.dictionary.create(dictionary_folder, *, target_language='en', source_language='auto')[source]

Create an empty dictionary on disk

Parameters
Return type

str

Returns

A path to a dictionary

tamr_toolbox.enrichment.dictionary.to_json(dictionary)[source]

Convert a toolbox translation dictionary entries to a json format where set object are converted to list

Parameters

dictionary (Dict[str, TranslationDictionary]) – a toolbox translation dictionary

Return type

List[str]

Returns

A list of toolbox translation dictionary entries in json format

tamr_toolbox.enrichment.dictionary.to_dict(dictionary)[source]

Convert a toolbox translation dictionary entries to a dictionary format where set object are converted to list

Parameters

dictionary (Dict[str, TranslationDictionary]) – a toolbox translation dictionary

Return type

List[Dict[str, Union[str, List]]]

Returns

A list of toolbox translation dictionary entries in dictionary format

tamr_toolbox.enrichment.dictionary.save(translation_dictionary, dictionary_folder, *, target_language='en', source_language='auto')[source]

Save a toolbox translation dictionary to disk

Parameters

Returns:

Return type

None

tamr_toolbox.enrichment.dictionary.load(dictionary_folder, *, target_language='en', source_language='auto')[source]

Load a toolbox translation dictionary from disk to memory

Parameters
Return type

Dict[str, TranslationDictionary]

Returns

A toolbox translation dictionary

Raises

RuntimeError – if the dictionary was found on disk but is not of a valid toolbox translation dictionary type

tamr_toolbox.enrichment.dictionary.update(main_dictionary, tmp_dictionary)[source]

Update a toolbox translation dictionary with another temporary translation dictionary

Parameters

Returns:

Return type

None

tamr_toolbox.enrichment.dictionary.convert_to_mappings(dictionary)[source]

Transform a translation dictionary into a mapping of original phrases to translated phrases :type dictionary: Dict[str, TranslationDictionary] :param dictionary: a toolbox translation dictionary

Return type

Dict[str, str]

Returns

a dictionary with original phrase as key and translate phrase as value

tamr_toolbox.enrichment.dictionary.from_dataset(dataset)[source]

Stream a dictionary from Tamr

Parameters

dataset (Dataset) – Tamr Dataset object

Return type

Dict[str, TranslationDictionary]

Returns

A toolbox translation dictionary

Raises
  • ValueError – if the provided dataset is not a toolbox translation dictionary dataset

  • NameError – if the provided dataset does not contain all the attributes of a toolbox translation dictionary

  • RuntimeError – if there is any other problem while reading the dataset as a toolbox translation dictionary

tamr_toolbox.enrichment.dictionary.to_dataset(dictionary, *, dataset=None, datasets_collection=None, target_language=None, source_language=None, create_dataset=False)[source]

Ingest a toolbox dictionary in Tamr, creates the source dataset if it doesn’t exists

Parameters
  • dictionary (Dict[str, TranslationDictionary]) – a toolbox translation dictionary

  • dataset (Optional[Dataset]) – a Tamr client dataset

  • datasets_collection (Optional[DatasetCollection]) – a Tamr client datasets collection

  • target_language (Optional[str]) – the target language of the given dictionary

  • source_language (Optional[str]) – the source language of the given dictionary

  • create_dataset (bool) – flag to create or upsert to an existing translation dictionary source dataset

Return type

str

Returns

The name of the created or updated Tamr Dataset

Raises
  • ValueError – if create_dataset is False and dataset is not provided or is not a toolbox translation dictionary dataset. If create_dataset is True but datasets_collection or target_language or source_language is missing or the Tamr dataset already exists

  • RuntimeError – if there is an error during the creation of the Tamr dataset attributes