Translation¶
This module requires an optional dependency. See Installation for details.
Translate¶
Tasks related to efficiently translating data not present in existing translation dictionaries
- tamr_toolbox.enrichment.translate.standardize_phrases(original_phrases)[source]¶
Standardize phrases to translate to avoid re-translating previously translated phrases but with different formating
- tamr_toolbox.enrichment.translate.get_phrases_to_translate(original_phrases, translation_dictionary)[source]¶
Find phrases not previously translated and initiate dictionary entry
- tamr_toolbox.enrichment.translate.from_list(all_phrases, client, dictionary, *, source_language='auto', target_language='en', chunk_size=100, translation_model='nmt', intermediate_save_every_n_chunks=None, intermediate_save_to_disk=False, intermediate_folder='/tmp')[source]¶
Translate a list of phrases from source language to target language. The translation is saved in a dictionary on your local file system before updating the main dictionary
- Parameters
all_phrases (
List
[str
]) – List of standardized phrases to translate.client (
Client
) – a google translate api clientdictionary (
Dict
[str
,TranslationDictionary
]) – a toolbox translation dictionarysource_language (
str
) – the language the text to translate is in, “auto” means the api_client google_api api_client will try to detect the source language automaticallytarget_language (
str
) – the language to translate intochunk_size (
int
) – number of phrases to translate per api_client calls, set too high and you will hit API user rate limit errorstranslation_model (
str
) – google_api api_client api_client model to use, “nmt” or “pbmt”. Choose “pbmt” if an “nmt” model doesn’t exists for your source to target language pairintermediate_save_every_n_chunks (
Optional
[int
]) – save periodically api_client dictionary to disk every n chunk of phrases translatedintermediate_save_to_disk (
bool
) – decide whether to save periodically the dictionary to disk to avoid loss of translation data if code breaksintermediate_folder (
str
) – path to folder where dictionary will be save periodically to avoid loss of translation data
- Return type
- Returns
The updated translation dictionary
- Raises
ValueError – if the argument chunk_size is set to 0
Dictionary¶
Toolbox translation dictionaries have the following general format:
TranslationDictionary(
standardized_phrase="cheddar cheese",
translated_phrase="fromage cheddar",
detected_language="en",
original_phrases={"cheddar cheese"},
)
Toolbox translation dictionaries have the following format when loaded in memory to be able to access each dictionary by their standardized phrase:
{
"cheddar cheese": TranslationDictionary(
standardized_phrase="cheddar cheese",
translated_phrase="fromage cheddar",
detected_language="en",
original_phrases={"cheddar cheese"},
),
"ground beef": TranslationDictionary(
standardized_phrase="ground beef",
translated_phrase="boeuf haché",
detected_language="en",
original_phrases={"ground beef"},
),
}
When loaded on Tamr, translation dictionaries are source dataset with “standardized_phrase” as the primary key and the following attributes:
“translated_phrase”
“detected_language”
“original_phrases”
Tasks related to creating, updating, saving and moving translation dictionaries in and out of Tamr
- class tamr_toolbox.enrichment.dictionary.TranslationDictionary(standardized_phrase=None, translated_phrase=None, detected_language=None, original_phrases=<factory>)[source]¶
A DataClass for translation dictionaries
- Parameters
standardized_phrase (
Optional
[str
]) – The unique common standardized version of all original_phrasestranslated_phrase (
Optional
[str
]) – The translated standardized phrase to the target language of the dictionarydetected_language (
Optional
[str
]) – The language detected of the standardized phrase if source language is set to autooriginal_phrases (
Set
[str
]) – A set of original phrases which all convert to the standardized phrases when applying standardization
- tamr_toolbox.enrichment.dictionary.filename(dictionary_folder, *, target_language='en', source_language='auto')[source]¶
Generate a toolbox translation dictionary file path
- Parameters
dictionary_folder (
Union
[str
,Path
]) – base directory where dictionaries are savedtarget_language (
str
) – the language to translate into, for a list of allowed inputs: https://cloud.google.com/translate/docs/basic/discovering-supported-languagessource_language (
str
) – the language the text to translate is in, if None, assumes it is “auto”
- Return type
- Returns
A toolbox translation dictionary file path
- tamr_toolbox.enrichment.dictionary.create(dictionary_folder, *, target_language='en', source_language='auto')[source]¶
Create an empty dictionary on disk
- Parameters
dictionary_folder (
str
) – base directory where dictionary is savedtarget_language (
str
) – the language to translate into, for a list of allowed inputs: https://cloud.google.com/translate/docs/basic/discovering-supported-languagessource_language (
str
) – the language the text to translate is in, if None, assumes it is “auto”
- Return type
- Returns
A path to a dictionary
- tamr_toolbox.enrichment.dictionary.to_json(dictionary)[source]¶
Convert a toolbox translation dictionary entries to a json format where set object are converted to list
- Parameters
dictionary (
Dict
[str
,TranslationDictionary
]) – a toolbox translation dictionary- Return type
- Returns
A list of toolbox translation dictionary entries in json format
- tamr_toolbox.enrichment.dictionary.to_dict(dictionary)[source]¶
Convert a toolbox translation dictionary entries to a dictionary format where set object are converted to list
- tamr_toolbox.enrichment.dictionary.save(translation_dictionary, dictionary_folder, *, target_language='en', source_language='auto')[source]¶
Save a toolbox translation dictionary to disk
- Parameters
translation_dictionary (
Dict
[str
,TranslationDictionary
]) – dictionary object to be saved to diskdictionary_folder (
str
) – base directory where dictionary is savedtarget_language (
str
) – the language to translate into, for a list of allowed inputs: https://cloud.google.com/translate/docs/basic/discovering-supported-languagessource_language (
str
) – the language the text to translate is in, if None, assumes it is “auto”
Returns:
- Return type
- tamr_toolbox.enrichment.dictionary.load(dictionary_folder, *, target_language='en', source_language='auto')[source]¶
Load a toolbox translation dictionary from disk to memory
- Parameters
dictionary_folder (
str
) – base directory where dictionary is savedtarget_language (
str
) – the language to translate into, for a list of allowed inputs: https://cloud.google.com/translate/docs/basic/discovering-supported-languagessource_language (
str
) – the language the text to translate is in, if None, assumes it is “auto”
- Return type
- Returns
A toolbox translation dictionary
- Raises
RuntimeError – if the dictionary was found on disk but is not of a valid toolbox translation dictionary type
- tamr_toolbox.enrichment.dictionary.update(main_dictionary, tmp_dictionary)[source]¶
Update a toolbox translation dictionary with another temporary translation dictionary
- Parameters
main_dictionary (
Dict
[str
,TranslationDictionary
]) – the main toolbox translation dictionary containing past translation resultstmp_dictionary (
Dict
[str
,TranslationDictionary
]) – a temporary toolbox translation dictionary containing new translation
Returns:
- Return type
- tamr_toolbox.enrichment.dictionary.convert_to_mappings(dictionary)[source]¶
Transform a translation dictionary into a mapping of original phrases to translated phrases :type dictionary:
Dict
[str
,TranslationDictionary
] :param dictionary: a toolbox translation dictionary
- tamr_toolbox.enrichment.dictionary.from_dataset(dataset)[source]¶
Stream a dictionary from Tamr
- Parameters
dataset (
Dataset
) – Tamr Dataset object- Return type
- Returns
A toolbox translation dictionary
- Raises
ValueError – if the provided dataset is not a toolbox translation dictionary dataset
NameError – if the provided dataset does not contain all the attributes of a toolbox translation dictionary
RuntimeError – if there is any other problem while reading the dataset as a toolbox translation dictionary
- tamr_toolbox.enrichment.dictionary.to_dataset(dictionary, *, dataset=None, datasets_collection=None, target_language=None, source_language=None, create_dataset=False)[source]¶
Ingest a toolbox dictionary in Tamr, creates the source dataset if it doesn’t exists
- Parameters
dictionary (
Dict
[str
,TranslationDictionary
]) – a toolbox translation dictionarydatasets_collection (
Optional
[DatasetCollection
]) – a Tamr client datasets collectiontarget_language (
Optional
[str
]) – the target language of the given dictionarysource_language (
Optional
[str
]) – the source language of the given dictionarycreate_dataset (
bool
) – flag to create or upsert to an existing translation dictionary source dataset
- Return type
- Returns
The name of the created or updated Tamr Dataset
- Raises
ValueError – if create_dataset is False and dataset is not provided or is not a toolbox translation dictionary dataset. If create_dataset is True but datasets_collection or target_language or source_language is missing or the Tamr dataset already exists
RuntimeError – if there is an error during the creation of the Tamr dataset attributes