CSV¶
Export¶
Tasks related to moving data in or out of Tamr using delimited files
- tamr_toolbox.data_io.csv.from_dataset(dataset, export_file_path, *, csv_delimiter=',', columns=None, column_name_dict=None, flatten_delimiter='|', quote_character='"', quoting=0, na_value='NaN', nrows=None, allow_dataset_refresh=False, buffer_size=10000, overwrite=False, encoding='utf-8')[source]¶
Export a Tamr Dataset to a csv file. Records are streamed to disk and written according to a given buffer size. As a result this is more memory efficient than first reading to a pandas.DataFrame and writing to CSV.
- Parameters
dataset (
Dataset
) – Tamr Dataset objectexport_file_path (
Union
[Path
,str
]) – Path to the csv file where the dataset will be savedcsv_delimiter (
str
) – Delimiter of the csv filecolumns (
Optional
[List
[str
]]) – Optional, Ordered list of columns to write. If None, write all columns in arbitrary order.column_name_dict (
Optional
[Dict
[str
,str
]]) – Optional, Dictionary in the format {<Tamr dataset column name> : <new csv column name>}, used to rename some or all columns in the output file.flatten_delimiter (
str
) – Flatten list types to strings by concatenating with this delimiterquote_character (
str
) – Character used to escape value for csv delimiter when it appears in the value.quoting (
int
) – The escape strategy to use according to the Python csv writer. See https://docs.python.org/2/library/csv.html#csv.QUOTE_MINIMALna_value (
str
) – Value to write that represents empty or missing data. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html for the na_values supported by default in pandas.read_csvnrows (
Optional
[int
]) – Optional, Number of rows to write. If None, then write all rows.allow_dataset_refresh (
bool
) – If True, allows running a job to refresh dataset to make streamable. Otherwise a RuntimeError will be thrown if the dataset is unstreamable.buffer_size (
int
) – Number of records to store in memory before writing to diskoverwrite (
bool
) – if True and export_file_name already exists, overwrite the file. Otherwise throw an errorencoding (
str
) – The encoding to use in the written file. See https://docs.python.org/3/library/functions.html#open
- Return type
- Returns
The total number of records written
- Raises
FileExistsError – if the csv file to which the dataset is to be streamed exists and overwrite is False
RuntimeError – if dataset is not streamable and allow_dataset_refresh is False
ValueError – if columns or flatten_columns contain columns that are not present in dataset, or if column renaming would yield duplicate column names
- tamr_toolbox.data_io.csv.from_taxonomy(project, export_file_path, *, csv_delimiter=',', flatten_delimiter='|', quote_character='"', quoting=0, overwrite=False, encoding='utf-8')[source]¶
Export a Tamr taxonomy to a csv file. Records are streamed to disk and written according to a given buffer size.
- Parameters
project (
Project
) – Tamr Project objectexport_file_path (
Union
[Path
,str
]) – Path to the csv file where the dataset will be savedcsv_delimiter (
str
) – Delimiter of the csv fileflatten_delimiter (
str
) – Flatten list types to strings by concatenating with this delimiterquote_character (
str
) – Character used to escape value for csv delimiter when it appears in the value.quoting (
int
) – The escape strategy to use according to the Python csv writer. See https://docs.python.org/2/library/csv.html#csv.QUOTE_MINIMALoverwrite (
bool
) – if True and export_file_name already exists, overwrite the file. Otherwise throw an errorencoding (
str
) – The encoding to use in the written file. See https://docs.python.org/3/library/functions.html#open
- Return type
- Returns
The total number of records written
- Raises
FileExistsError – if export_file_path exists and overwrite is set to False
IOError – if the specified filepath does not exist or cannot be accessed
RuntimeError – if the classification project is not yet associated with a taxonomy or taxonomy cannot be written to a csv file
TypeError – if the project type is not classification
ValueError – if columns and flatten_columns are identical values