Translation¶
This module requires an optional dependency. See Installation for details.
Translate¶
Tasks related to efficiently translating data not present in existing translation dictionaries
-
tamr_toolbox.enrichment.translate.
standardize_phrases
(original_phrases)[source]¶ Standardize phrases to translate to avoid re-translating previously translated phrases but with different formating
-
tamr_toolbox.enrichment.translate.
get_phrases_to_translate
(original_phrases, translation_dictionary)[source]¶ Find phrases not previously translated and initiate dictionary entry
-
tamr_toolbox.enrichment.translate.
from_list
(all_phrases, client, dictionary, *, source_language='auto', target_language='en', chunk_size=100, translation_model='nmt', intermediate_save_every_n_chunks=None, intermediate_save_to_disk=False, intermediate_folder='/tmp')[source]¶ Translate a list of phrases from source language to target language. The translation is saved in a dictionary on your local file system before updating the main dictionary
- Parameters
all_phrases (
List
[str
]) – List of standardized phrases to translate.client (
Client
) – a google translate api clientdictionary (
Dict
[str
,TranslationDictionary
]) – a toolbox translation dictionarysource_language (
str
) – the language the text to translate is in, “auto” means the api_client google_api api_client will try to detect the source language automaticallytarget_language (
str
) – the language to translate intochunk_size (
int
) – number of phrases to translate per api_client calls, set too high and you will hit API user rate limit errorstranslation_model (
str
) – google_api api_client api_client model to use, “nmt” or “pbmt”. Choose “pbmt” if an “nmt” model doesn’t exists for your source to target language pairintermediate_save_every_n_chunks (
Optional
[int
]) – save periodically api_client dictionary to disk every n chunk of phrases translatedintermediate_save_to_disk (
bool
) – decide whether to save periodically the dictionary to disk to avoid loss of translation data if code breaksintermediate_folder (
str
) – path to folder where dictionary will be save periodically to avoid loss of translation data
- Return type
- Returns
The updated translation dictionary
- Raises
ValueError – if the argument chunk_size is set to 0
Dictionary¶
Toolbox translation dictionaries have the following general format:
TranslationDictionary(
standardized_phrase="cheddar cheese",
translated_phrase="fromage cheddar",
detected_language="en",
original_phrases={"cheddar cheese"},
)
Toolbox translation dictionaries have the following format when loaded in memory to be able to access each dictionary by their standardized phrase:
{
"cheddar cheese": TranslationDictionary(
standardized_phrase="cheddar cheese",
translated_phrase="fromage cheddar",
detected_language="en",
original_phrases={"cheddar cheese"},
),
"ground beef": TranslationDictionary(
standardized_phrase="ground beef",
translated_phrase="boeuf haché",
detected_language="en",
original_phrases={"ground beef"},
),
}
When loaded on Tamr, translation dictionaries are source dataset with “standardized_phrase” as the primary key and the following attributes:
“translated_phrase”
“detected_language”
“original_phrases”
Tasks related to creating, updating, saving and moving translation dictionaries in and out of Tamr
-
class
tamr_toolbox.enrichment.dictionary.
TranslationDictionary
(standardized_phrase=None, translated_phrase=None, detected_language=None, original_phrases=<factory>)[source]¶ A DataClass for translation dictionaries
- Parameters
standardized_phrase (
Optional
[str
]) – The unique common standardized version of all original_phrasestranslated_phrase (
Optional
[str
]) – The translated standardized phrase to the target language of the dictionarydetected_language (
Optional
[str
]) – The language detected of the standardized phrase if source language is set to autooriginal_phrases (
Set
[str
]) – A set of original phrases which all convert to the standardized phrases when applying standardization
-
class
tamr_toolbox.enrichment.dictionary.
SetEncoder
(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶ A Class to transform type ‘set’ to type ‘list’ when saving objects to JSON format
-
default
(python_object)[source]¶ Transform a set into a list if input is a set
- Parameters
python_object – the python object to be saved to a json format
- Returns
Default json encoder format of input object or List if input is a Set
-
encode
(o)¶ Return a JSON string representation of a Python data structure.
>>> from json.encoder import JSONEncoder >>> JSONEncoder().encode({"foo": ["bar", "baz"]}) '{"foo": ["bar", "baz"]}'
-
iterencode
(o, _one_shot=False)¶ Encode the given object and yield each string representation as available.
For example:
for chunk in JSONEncoder().iterencode(bigobject): mysocket.write(chunk)
-
-
tamr_toolbox.enrichment.dictionary.
filename
(dictionary_folder, *, target_language='en', source_language='auto')[source]¶ Generate a toolbox translation dictionary file path
- Parameters
dictionary_folder (
Union
[str
,Path
]) – base directory where dictionaries are savedtarget_language (
str
) – the language to translate into, for a list of allowed inputs: https://cloud.google.com/translate/docs/basic/discovering-supported-languagessource_language (
str
) – the language the text to translate is in, if None, assumes it is “auto”
- Return type
- Returns
A toolbox translation dictionary file path
-
tamr_toolbox.enrichment.dictionary.
create
(dictionary_folder, *, target_language='en', source_language='auto')[source]¶ Create an empty dictionary on disk
- Parameters
dictionary_folder (
str
) – base directory where dictionary is savedtarget_language (
str
) – the language to translate into, for a list of allowed inputs: https://cloud.google.com/translate/docs/basic/discovering-supported-languagessource_language (
str
) – the language the text to translate is in, if None, assumes it is “auto”
- Return type
- Returns
A path to a dictionary
-
tamr_toolbox.enrichment.dictionary.
to_json
(dictionary)[source]¶ Convert a toolbox translation dictionary entries to a json format where set object are converted to list
- Parameters
dictionary (
Dict
[str
,TranslationDictionary
]) – a toolbox translation dictionary- Return type
- Returns
A list of toolbox translation dictionary entries in json format
-
tamr_toolbox.enrichment.dictionary.
to_dict
(dictionary)[source]¶ Convert a toolbox translation dictionary entries to a dictionary format where set object are converted to list
-
tamr_toolbox.enrichment.dictionary.
save
(translation_dictionary, dictionary_folder, *, target_language='en', source_language='auto')[source]¶ Save a toolbox translation dictionary to disk
- Parameters
translation_dictionary (
Dict
[str
,TranslationDictionary
]) – dictionary object to be saved to diskdictionary_folder (
str
) – base directory where dictionary is savedtarget_language (
str
) – the language to translate into, for a list of allowed inputs: https://cloud.google.com/translate/docs/basic/discovering-supported-languagessource_language (
str
) – the language the text to translate is in, if None, assumes it is “auto”
Returns:
- Return type
-
tamr_toolbox.enrichment.dictionary.
load
(dictionary_folder, *, target_language='en', source_language='auto')[source]¶ Load a toolbox translation dictionary from disk to memory
- Parameters
dictionary_folder (
str
) – base directory where dictionary is savedtarget_language (
str
) – the language to translate into, for a list of allowed inputs: https://cloud.google.com/translate/docs/basic/discovering-supported-languagessource_language (
str
) – the language the text to translate is in, if None, assumes it is “auto”
- Return type
- Returns
A toolbox translation dictionary
- Raises
RuntimeError – if the dictionary was found on disk but is not of a valid toolbox translation dictionary type
-
tamr_toolbox.enrichment.dictionary.
update
(main_dictionary, tmp_dictionary)[source]¶ Update a toolbox translation dictionary with another temporary translation dictionary
- Parameters
main_dictionary (
Dict
[str
,TranslationDictionary
]) – the main toolbox translation dictionary containing past translation resultstmp_dictionary (
Dict
[str
,TranslationDictionary
]) – a temporary toolbox translation dictionary containing new translation
Returns:
- Return type
-
tamr_toolbox.enrichment.dictionary.
convert_to_mappings
(dictionary)[source]¶ Transform a translation dictionary into a mapping of original phrases to translated phrases :type dictionary:
Dict
[str
,TranslationDictionary
] :param dictionary: a toolbox translation dictionary
-
tamr_toolbox.enrichment.dictionary.
from_dataset
(dataset)[source]¶ Stream a dictionary from Tamr
- Parameters
dataset (
Dataset
) – Tamr Dataset object- Return type
- Returns
A toolbox translation dictionary
- Raises
ValueError – if the provided dataset is not a toolbox translation dictionary dataset
NameError – if the provided dataset does not contain all the attributes of a toolbox translation dictionary
RuntimeError – if there is any other problem while reading the dataset as a toolbox translation dictionary
-
tamr_toolbox.enrichment.dictionary.
to_dataset
(dictionary, *, dataset=None, datasets_collection=None, target_language=None, source_language=None, create_dataset=False)[source]¶ Ingest a toolbox dictionary in Tamr, creates the source dataset if it doesn’t exists
- Parameters
dictionary (
Dict
[str
,TranslationDictionary
]) – a toolbox translation dictionarydatasets_collection (
Optional
[DatasetCollection
]) – a Tamr client datasets collectiontarget_language (
Optional
[str
]) – the target language of the given dictionarysource_language (
Optional
[str
]) – the source language of the given dictionarycreate_dataset (
bool
) – flag to create or upsert to an existing translation dictionary source dataset
- Return type
- Returns
The name of the created or updated Tamr Dataset
- Raises
ValueError – if create_dataset is False and dataset is not provided or is not a toolbox translation dictionary dataset. If create_dataset is True but datasets_collection or target_language or source_language is missing or the Tamr dataset already exists
RuntimeError – if there is an error during the creation of the Tamr dataset attributes