Hemlock
Hemlock is an open-source project exploring ways to create a common data access layer that eliminates the need to understand underlying data topologies but still preserving the requirements of each data source such as access control, performance, and formats.
Install instructions
Option A, install using pip:
sudo pip install hemlock
Option B, build from source:
git clone https://github.com/Lab41/Hemlock.git
cd Hemlock
sudo python setup.py install
Required Dependencies
Python modules:
- MySQLdb
- texttable
- couchbase >= 1.0
- APScheduler
Build a server running MySQL to store user accounts, tenants, and registered systems.
Build a Couchbase 2.0 cluster to store metadata and data of registered systems.
Build an ElasticSearch 0.90.2 cluster to store the index of Couchbase.
Add XDCR one-way replication from Couchbase to ElasticSearch using this plugin (Note, grab version 1.1.0).
Once the plugin is installed, be sure and update the couchbase_template.json under plugins/transport-couchbase/ to have the following:
{
"template" : "*",
"order" : 10,
"mappings" : {
"couchbaseCheckpoint" : {
"_source" : {
"includes" : ["doc.*"]
},
"date_detection" : false,
"dynamic_templates": [
{
"store_no_index": {
"match": "*",
"mapping": {
"store" : "no",
"index" : "no",
"include_in_all" : false
}
}
}
]
},
"_default_" : {
"_source" : {
"includes" : ["meta.*"]
},
"date_detection" : false,
"properties" : {
"meta" : {
"type" : "object",
"include_in_all" : false
}
}
}
}
}
Once that is added, start up ElasticSearch with bin/elasticsearch
and then perform the following the first time:
curl -XPUT http://localhost:9200/_template/couchbase -d @plugins/transport-couchbase/couchbase_template.json
Installing required databases
- Create database
hemlock
in MySQL. - Create bucket
hemlock
in Couchbase. - Create index
hemlock
in ElasticSearch.
Getting started
-
Create Hemlock credentials (see 'Credential files')
HEMLOCK_MYSQL_SERVER=192.168.1.10 HEMLOCK_MYSQL_USERNAME=user HEMLOCK_MYSQL_DB=hemlock HEMLOCK_MYSQL_PW=pass HEMLOCK_COUCHBASE_SERVER=192.168.1.20 HEMLOCK_COUCHBASE_BUCKET=hemlock HEMLOCK_COUCHBASE_USERNAME=hemlock HEMLOCK_COUCHBASE_PW=pass HEMLOCK_ELASTICSEARCH_ENDPOINT=192.168.1.30
(if you'd like these to persist, consider adding export before each line and performing
source
on the file) -
Create a tenant, role, user, and data source system
hemlock tenant-create --name Project1 hemlock tenant-list hemlock role-create --name User hemlock role-list hemlock user-create --name User1 \ --username Username1 \ --email user1@email.com \ --role_id 42ba73f9-0ab6-4a50-908c-1585955754f4 \ --tenant_id 7d0f6b0d-334a-4d89-bd1a-70e8e1c04aa6 hemlock user-list hemlock register-local-system --name System1 \ --data_type csv \ --description "description" \ --tenant_id 7d0f6b0d-334a-4d89-bd1a-70e8e1c04aa6 \ --hostname system1.fqdn \ --endpoint http://hemlock.server/ \ --poc_name user1 \ --poc_email user1@email.com hemlock system-list
- Add credentials for data source system, for example: mysql_creds
bash MYSQL_SERVER=192.168.1.30 MYSQL_DB=db1 #MYSQL_TABLE=table1 MYSQL_USERNAME=user MYSQL_PW=pass
-
Store a client
hemlock client-store --name mysql_client_1 --type mysql --system_id 7d0f6b0d-334a-4d89-bd1a-70e8e1c04aa6 --credential_file /path/to/mysql_creds hemlock client-list
- Add credentials for hemlock
bash hemlock hemlock-server-store --credential_file /path/to/hemlock_creds
-
Create a schedule server (optional)
hemlock schedule-server-create --name schedule_server_1 hemlock schedule-server-list
-
Add a schedule for the data source system to run (optional)
hemlock client-schedule --name schedule1 \ --minute "54" \ --hour "12" \ --day_of_month "*" \ --month "*" \ --day_of_week "*" \ --client_id 7d0f6b0d-334a-4d89-bd1a-70e8e1c04aa6 --schedule_server_id 7d0f6b0d-334a-4d89-bd1a-70e8e1c04aa6 hemlock schedule-list
- Perform a test run for pulling data from the data source system
bash hemlock client-run --uuid 7d0f6b0d-334a-4d89-bd1a-70e8e1c04aa6
-
Search for data that has been loaded into Hemlock
hemlock query-data --user 7d0f6b0d-334a-4d89-bd1a-70e8e1c04aa6 --query foo
or
Direct with elasticsearch: http://elasticsearch.fqdn:9200/hemlock/_search?q=foo Which returns something the following: { "took": 14, "timed_out": false, "_shards": { "total": 20, "successful": 20, "failed": 0 }, "hits": { "total": 1, "max_score": 3.6582048, "hits": [ { "_index": "hemlock", "_type": "couchbaseDocument", "_id": "865f458b4421ae5fd758e3c81aca9f8d8b4696b6", "_score": 3.6582048, "_source": { "meta": { "id": "865f458b4421ae5fd758e3c81aca9f8d8b4696b6", "rev": "1-0010f1ac6045ccf40000000000000000", "flags": 0, "expiration": 0 } } } ] } } Now we can feed the 'id' into Couchbase to return the full document: http://couchbase.fqdn:8092/hemlock/865f458b4421ae5fd758e3c81aca9f8d8b4696b6 Which returns something like the following: { "hemlock-system": "a50b86c2-59f7-42a3-aa67-3367579189fe", "hemlock-date": "2013-09-03 16:10:20", "stream": "DOYLIE" }
Credential files
-
Create a
hemlock_creds
file (see hemlock_creds_sample for an example):HEMLOCK_MYSQL_SERVER=192.168.1.10 HEMLOCK_MYSQL_USERNAME=user HEMLOCK_MYSQL_DB=hemlock HEMLOCK_MYSQL_PW=pass HEMLOCK_COUCHBASE_SERVER=192.168.1.20 HEMLOCK_COUCHBASE_BUCKET=hemlock HEMLOCK_COUCHBASE_USERNAME=hemlock HEMLOCK_COUCHBASE_PW=pass
Create credential files for each client you intend to use (examples).
Currently supported data sources
Technology | Parameter | Python Module Dependencies |
---|---|---|
MySQL | mysql | MySQLdb |
MongoDB | mongo | pymongo |
Redis | redis | redis |
Local FileSystem | fs | magic, pdfminer, xmltodict |
RESTful API | rest | |
Streams | stream_odd |
Adding a new data source type
Create a new class under the clients folder for each new data source type. Most
classes will need two methods defined: connect_client
and get_data
.
The following is a template that can be used to work from:
class HMyclient:
def connect_client(self, client_dict):
# return a handle that can be used to get data from the data source
return c_server
def get_data(self, client_dict, c_server, h_server, client_uuid):
# data_list is an array of arrays to contain the data
data_list = [[]]
# desc_list is an array that contains the schema (if exists or known)
desc_list = []
return data_list, desc_list
Usage examples
-
Create a tenant
hemlock tenant-create --name Project1
-
Create a role
hemlock role-create --name User
-
Create a user
hemlock user-create --name User1 \ --username Username1 \ --email user1@email.com \ --role_id 42ba73f9-0ab6-4a50-908c-1585955754f4 \ --tenant_id 7d0f6b0d-334a-4d89-bd1a-70e8e1c04aa6
-
Register a local system
hemlock register-local-system --name System1 \ --data_type csv \ --description "description" \ --tenant_id 7d0f6b0d-334a-4d89-bd1a-70e8e1c04aa6 \ --hostname system1.fqdn \ --endpoint http://hemlock.server/ \ --poc_name user1 \ --poc_email user1@email.com
-
List registered systems
hemlock system-list
-
List created users
hemlock user-list
-
Lists created tenants
hemlock tenant-list
- Connecting to a client
- Full CLI API list
Related repositories
Documentation
Tests
The tests for this project use py.test
Contributing to Hemlock
What to contribute? Awesome! Issue a pull request or see more details here.