carmenproxy

Tool for proxying programmatic HTTP requests.


Keywords
data, fastexport, fastteradata, python, teradata
License
MIT
Install
pip install carmenproxy==0.1.0

Documentation

fastteradata

Latest Travis CI build status Coverage

Tools for faster and optimized interaction with Teradata and large datasets. This was initially made to make using Teradata's fast export utility easier and get around various spool space issues that may arise from large tables.

Installation

fastteradata can easily be downloaded straight from github via the following:

pip install git+https://github.com/mark-hoffmann/fastteradata.git

Usage

The main task this package accomplishes right now is export large tables faster and easier than trying to load them directly. This is accomplished by connecting to a Teradata account and reading metadata about the table of interest to autogenerate and execute the appropriate fastexport script. You then have the option to read the data into memory to clean it and then serialize it as a pickle file with correct data types and columns.

Update 11/10/2017 Basic fastloading capabilities have been added for doing a fastload from a pandas dataframe

Additional components will continue to be worked on such as
  • Better optimization of auto parrellelization
  • The ability to join different tables for an export
  • Plus popular suggestions from others in the issues

Setup

Before you can use fasteradata, you must create a file in your working home directory called ".fastteradata", such that the path is "~/.fastteradata". This file follows json formatting rules and an example can be found here. All fields surround with underscores need to be updated. Don't necessary need two different environments.


Once you have your credential file set up, you are ready to go.

Extracting a table

An example of using the module is as follows:

import fastteradata as ftd

ftd.extract_table("/absolute/path/to/output", "TABLE_NAME", "ENV_NAME", "DB_NAME", nrows=50, connector="pyodbc")
This particular call will
  • Create a data and pickled folder in the directory of your absolute path argument
  • Connect to Teradata via the connector "pyodbc"
  • Read metadata about the table
  • Generate a fast export script in your absolute path directory
  • Execute the script and populate a data file with the first 50 rows of the table into the data/ subdirectory
  • Read in the data file and attempt to clean it with np.nans where appropriate as well as appropriate typecasting
  • Save the resulting pandas dataframe into the pickled/ subdirectory


Method Signatures

While this project is small, the method signatures can be found below. If this starts to become much larger, I will generate Sphinx docs.


extract_table(abs_path, table_name, env, db, nrows=-1, connector = "teradata", columns = [], clean_and_pickle=True, partition_key="", partition_type="year")

Summary

Extracts table information from Teradata and saves / executes the appropriate files

Args

abs_path (str): Absolute path where you want your scripts to reside and data and pickled subdirectories made

table_name (str): Teradata table name you wish to query

env (str): Environment that you want to connect to as specified in your .fastteradata file. (People usually have a testing and production environment)

db (str): Database name to connect to

nrows (int): default = -1 The default of -1 means ALL rows. Otherwise, you can specificy a subset of rows such as 20

connector (str): default = 'teradata' The default uses the teradata python module to connect to the cluster. Valid options include 'teradata' and 'pyodbc'

columns (list(str)): default = [] Subset of columns to use, default is all of the columns found in the metadata, however subsets can be selected by passing in ['col1','col2','col3']

clean_and_pickle (bool): default = True Refers to if you want to read the resulting data file into memory to clean and then serialize in your pickled subdirectory

partition_key (str): default = '' There is no partitioning by default. When you define a partition key, it MUST BE A DATE COLUMN AS DEFINED IN TERADATA. This breaks up the exporting into paritions by the partition_type argument. This generates multiple fexp scripts and executes them in parrelel using the available cores. This helps to break up extremely large datasets or increase speed. When a parition key is defined, after all of the partition files are finished loading from Teradata, the resulting data is COMBINED into a SINGLE DATA FILE and finishes processing through the following cleaning, data type specification, and serializing.

partition_type (str): default = 'year' Default is to partition the partition_key by distict YEAR. Valid options include "year" or "month"

Returns

Column list received from the metadata if clean_and_pickle is set to False, else nothing. Column names are returned in this case so you can save them and use them to read the raw data file later with appropriate columns.



load_table(abs_path, df, table_name, env, db, connector = "teradata", clear_table=True)

Summary

Loads a pandas dataframe from memory into teradata via the optimized fastload functionality.

Args

abs_path (str): Absolute path where you want your scripts to reside and data and pickled subdirectories made (Most of the time should be same absolute path as the extract_table abs_path)

df (pandas DataFrame): The pandas DataFrame that you want to save up to teradata

table_name (str): The desired table name

env (str): Environment that you want to connect to as specified in your .fastteradata file. (People usually have a testing and production environment)

db (str): Database name to connect to

connector (str): default = 'teradata' The default uses the teradata python module to connect to the cluster. Valid options include 'teradata' and 'pyodbc'

clear_table (bool): default = 'True' This specifies if you want the table you specify in your db to be dropped before loading in. Right now, this is the recomended way of using this function. Otherwise, you have to be sure to have the columns exactly right and error handling has not been robustly built out for that case yet.

Returns

Nothing

Requirements

Compatibility

fastteradata currently supports Python 3.6

Licence

MIT

Authors

fastteradata was written by Mark Hoffmann.