metta-data
Train Matrix and Test Matrix Storage
Description
Python library for storing and recalling meta data, and DataFrames of training and testing sets.
Installation
To get the latest stable version:
pip install metta-data
To get the current master branch:
pip install git+git://github.com/dssg/metta-data.git
How-to
metta
expects you to hand it a dictionary for each dataframe with the following keys:
-
beginning_of_time
(date.datetime): The earliest time that enters your covariate calculations. -
end_time
(date.dateime): The last time that enters your covariate calculations. -
label_window
(str): The length of the labeling window you are using in this matrix eg: '1y', '6m' -
label_name
(str): The outcome variable's column name. This column must be in the last position in your dataframe. -
matrix_id
(str): Human readable id for the dataset
Storing a train and test pair
import metta
train_config = {'beginning_of_time': datetime.date(2012, 12, 20),
'end_time': datetime.date(2016, 12, 20),
'label_window': '3m',
'label_name': 'inspection_1yr',
'label_type': 'binary',
'matrix_id': 'CDPH_2012',
'feature_names': ['break_last_3yr', 'soil', 'pressure_zone'],
'indices': ['entity_id', 'as_of_date'] }
test_config = {'beginning_of_time': datetime.date(2015, 12, 20),
'end_time': datetime.date(2016, 12, 21),
'label_window': '3m',
'label_name': 'inspection_1yr',
'label_type': 'binary'
'matrix_id': 'CDPH_2015',
'feature_names': ['break_last_3yr', 'soil', 'pressure_zone'],
'inidces': ['entity_id', 'as_of_date'] }
metta.archive_train_test(train_config,
X_train,
test_config,
X_test,
directory='./old_matrices',
format='hd5',
overwrite=False)
Storing a train and multiple test sets
import metta
train_config = {'beginning_of_time': datetime.date(2012, 12, 20),
'end_time': datetime.date(2016, 12, 20),
'label_window': '3m',
'label_name': 'inspection_1yr',
'label_type': 'binary',
'matrix_id': 'CDPH_2012',
'feature_names': ['break_last_3yr', 'soil', 'pressure_zone'],
'indices': ['entity_id', 'as_of_date'] }
base_test_config = {'beginning_of_time': datetime.date(2015, 12, 20),
'end_time': datetime.date(2016, 12, 21),
'label_window': '3m',
'label_name': 'inspection_1yr',
'label_type': 'binary',
'matrix_id': 'CDPH_2015',
'feature_names': ['break_last_3yr', 'soil', 'pressure_zone'],
'indices': ['entity_id', 'as_of_date']}
train_uuid = metta.archive_matrix(train_config, X_train, directory='./matrices')
test_uuids = []
for years in range(1, 5):
test_config = base_test_config.copy()
test_config['beginning_of_time'] += relativedelta(years=years)
test_config['end_time'] += relativedelta(years=years)
test_config['matrix_id'] = 'CDPH_{}'.format(test_config['end_time'].year)
test_uuids.append(metta.archive_matrix(
test_config,
df_data,
directory='./matrices',
overwrite=False,
format='csv',
train_uuid=train_uuid
))
Uploading to S3
dict_config = yaml.load(open('aws_keys.yaml'))
metta.upload_to_s3(access_key_id=dict_config['AWSAccessKey'],
secret_access_key=dict_config['AWSSecretKey'],
bucket=dict_config['Bucket'],
folder=dict_config['Folder'],
directory='./old_matrices')