CSV¶

Export¶

Tasks related to moving data in or out of Tamr using delimited files

tamr_toolbox.data_io.csv.from_dataset(dataset, export_file_path, *, csv_delimiter=',', columns=None, column_name_dict=None, flatten_delimiter='|', quote_character='"', quoting=0, na_value='NaN', nrows=None, allow_dataset_refresh=False, buffer_size=10000, overwrite=False, encoding='utf-8')[source]¶

Export a Tamr Dataset to a csv file. Records are streamed to disk and written according to a given buffer size. As a result this is more memory efficient than first reading to a pandas.DataFrame and writing to CSV.

Parameters

dataset (Dataset) – Tamr Dataset object
export_file_path (Union[Path, str]) – Path to the csv file where the dataset will be saved
csv_delimiter (str) – Delimiter of the csv file
columns (Optional[List[str]]) – Optional, Ordered list of columns to write. If None, write all columns in arbitrary order.
column_name_dict (Optional[Dict[str, str]]) – Optional, Dictionary in the format {<Tamr dataset column name> : <new csv column name>}, used to rename some or all columns in the output file.
flatten_delimiter (str) – Flatten list types to strings by concatenating with this delimiter
quote_character (str) – Character used to escape value for csv delimiter when it appears in the value.
quoting (int) – The escape strategy to use according to the Python csv writer. See https://docs.python.org/2/library/csv.html#csv.QUOTE_MINIMAL
na_value (str) – Value to write that represents empty or missing data. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html for the na_values supported by default in pandas.read_csv
nrows (Optional[int]) – Optional, Number of rows to write. If None, then write all rows.
allow_dataset_refresh (bool) – If True, allows running a job to refresh dataset to make streamable. Otherwise a RuntimeError will be thrown if the dataset is unstreamable.
buffer_size (int) – Number of records to store in memory before writing to disk
overwrite (bool) – if True and export_file_name already exists, overwrite the file. Otherwise throw an error
encoding (str) – The encoding to use in the written file. See https://docs.python.org/3/library/functions.html#open

Return type

int

Returns

The total number of records written

Raises

FileExistsError – if the csv file to which the dataset is to be streamed exists and overwrite is False
RuntimeError – if dataset is not streamable and allow_dataset_refresh is False
ValueError – if columns or flatten_columns contain columns that are not present in dataset, or if column renaming would yield duplicate column names

tamr_toolbox.data_io.csv.from_taxonomy(project, export_file_path, *, csv_delimiter=',', flatten_delimiter='|', quote_character='"', quoting=0, overwrite=False, encoding='utf-8')[source]¶

Export a Tamr taxonomy to a csv file. Records are streamed to disk and written according to a given buffer size.