rsdf

Redshift interface to Pandas DataFrames


Keywords
redshift, pandas, upsert, aws, data-frame
License
MIT
Install
pip install rsdf==0.9.0

Documentation

RSDF

Build Status PyPI version License

Set of utils to connect Pandas DataFrames and Redshift. This module will add a new function to the DataFrame object. Inspired by josepablog gist.

Installation

To install rsdf, simply use pip:

$ pip install rsdf

If you were using the older version, you can also install it with pip:

$ pip install git+git://github.com/bufferapp/rsdf.git@d1a5feca220cef9ba7da16da57a746dfb24ee8d7

Usage

Once rdsf is imported, the DataFrame objects will have new functions:

import pandas as pd
import rsdf


engine_string = 'redshift://user:password@endpoint:port/db'

users = pd.read_sql_query('select * from users limit 10', engine_string)

users['money'] = users['money'] * 42

# Write it back to Redshift
users.to_redshift(
    table_name='users',
    schema='public',
    engine=engine_string,
    s3_bucket='users-data',
    s3_key='rich_users.gzip',
    if_exists='update',
    primary_key='id'
)

Alternatively, if no engine is provided, the rsdf module will try to figure out the engine string from the following environment variables:

  • REDSHIFT_USER
  • REDSHIFT_PASSWORD
  • REDSHIFT_ENDPOINT
  • REDSHIFT_DB_NAME
  • REDSHIFT_DB_PORT

Since rsdf uploads the files to S3 and then runs a COPY command to add the data to Redshift you'll also need to provide (or have them in the environment loaded) these two AWS variables:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY

License

MIT © Buffer