CSV

Export

Tasks related to moving data in or out of Tamr using delimited files

tamr_toolbox.data_io.csv.from_dataset(dataset, export_file_path, *, csv_delimiter=', ', columns=None, flatten_delimiter='|', quote_character='"', quoting=0, na_value='NaN', nrows=None, allow_dataset_refresh=False, buffer_size=10000, overwrite=False)[source]

Export a Tamr Dataset to a csv file. Records are streamed to disk and written according to a given buffer size. As a result this is more memory efficient than first reading to a pandas.DataFrame and writing to CSV.

Parameters
  • dataset (Dataset) – Tamr Dataset object

  • export_file_path (Union[Path, str]) – Path to the csv file where the dataset will be saved

  • csv_delimiter (str) – Delimiter of the csv file

  • columns (Optional[List[str]]) – Optional, Ordered list of columns to write. If None, write all columns in arbitrary order.

  • flatten_delimiter (str) – Flatten list types to strings by concatenating with this delimiter

  • quote_character (str) – Character used to escape value for csv delimiter when it appears in the value.

  • quoting (int) – The escape strategy to use according to the Python csv writer. See https://docs.python.org/2/library/csv.html#csv.QUOTE_MINIMAL

  • na_value (str) – Value to write that represents empty or missing data. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html for the na_values supported by default in pandas.read_csv

  • nrows (Optional[int]) – Optional, Number of rows to write. If None, then write all rows.

  • allow_dataset_refresh (bool) – If True, allows running a job to refresh dataset to make streamable. Otherwise a RuntimeError will be thrown if the dataset is unstreamable.

  • buffer_size (int) – Number of records to store in memory before writing to disk

  • overwrite (bool) – if True and export_file_name already exists, overwrite the file. Otherwise throw an error

Return type

int

Returns

The total number of records written

Raises
  • FileExistsError – if the csv file to which the dataset is to be streamed exists and overwrite is False

  • RuntimeError – if dataset is not streamable and allow_dataset_refresh is False

  • ValueError – if columns or flatten_columns contain columns that are not present in dataset

tamr_toolbox.data_io.csv.from_taxonomy(project, export_file_path, *, csv_delimiter=', ', flatten_delimiter='|', quote_character='"', quoting=0, overwrite=False)[source]

Export a Tamr taxonomy to a csv file. Records are streamed to disk and written according to a given buffer size.

Parameters
  • project (Project) – Tamr Project object

  • export_file_path (Union[Path, str]) – Path to the csv file where the dataset will be saved

  • csv_delimiter (str) – Delimiter of the csv file

  • flatten_delimiter (str) – Flatten list types to strings by concatenating with this delimiter

  • quote_character (str) – Character used to escape value for csv delimiter when it appears in the value.

  • quoting (int) – The escape strategy to use according to the Python csv writer. See https://docs.python.org/2/library/csv.html#csv.QUOTE_MINIMAL

  • overwrite (bool) – if True and export_file_name already exists, overwrite the file. Otherwise throw an error

Return type

int

Returns

The total number of records written

Raises
  • FileExistsError – if export_file_path exists and overwrite is set to False

  • IOError – if the specified filepath does not exist or cannot be accessed

  • RuntimeError – if the classification project is not yet associated with a taxonomy or taxonomy cannot be written to a csv file

  • TypeError – if the project type is not classification

  • ValueError – if columns and flatten_columns are identical values