embulk-input-azure_blob_storage

Reads files stored on Microsoft Azure Blob Storage.


Keywords
azure, azure-storage, embulk, embulk-input-plugin, embulk-plugin
License
Apache-2.0
Install
gem install embulk-input-azure_blob_storage -v 0.3.1

Documentation

Azure Blob Storage file input plugin for Embulk

Build Status

Embulk file input plugin read files stored on Microsoft Azure Blob Storage

embulk-input-azure_blog_storage v0.2.0+ requires Embulk v0.9.12+

Overview

  • Plugin type: file input
  • Resume supported: no
  • Cleanup supported: yes

Configuration

First, create Azure Storage Account.

  • account_name: storage account name (string, required)
  • account_key: primary access key (string, required)
  • container: container name data stored (string, required)
  • path_prefix: prefix of target keys (string, required) (string, required)
  • incremental: enables incremental loading(boolean, optional. default: true). If incremental loading is enabled, config diff for the next execution will include last_path parameter so that next execution skips files before the path. Otherwise, last_path will not be included.
  • path_match_pattern: regexp to match file paths. If a file path doesn't match with this pattern, the file will be skipped (regexp string, optional)
  • total_file_count_limit: maximum number of files to read (integer, optional)

Proxy configuration

  • proxy:
    • type: (string, required, default: null)
      • http: use HTTP Proxy
    • host: (string, required)
    • port: (int, required, default: 8080)
    • user: (string, optional)
    • password: (string, optional)

Example

in:
  type: azure_blob_storage
  account_name: myaccount
  account_key: myaccount_key
  container: my-container
  path_prefix: logs/csv-

Example for "sample_01.csv.gz" , generated by embulk example

in:
  type: azure_blob_storage
  account_name: myaccount
  account_key: myaccount_key
  container: my-container
  path_prefix: logs/csv-
  decoders:
  - {type: gzip}
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    header_line: true
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out: {type: stdout}

To filter files using regexp:

in:
  type: sftp
  path_prefix: logs/csv-
  ...
  path_match_pattern: \.csv$   # a file will be skipped if its path doesn't match with this pattern

  ## some examples of regexp:
  #path_match_pattern: /archive/         # match files in .../archive/... directory
  #path_match_pattern: /data1/|/data2/   # match files in .../data1/... or .../data2/... directory
  #path_match_pattern: .csv$|.csv.gz$    # match files whose suffix is .csv or .csv.gz

With proxy

in:
  type: azure_blob_storage
  ...
  proxy:
      type: http
      host: proxy_host
      port: 8080
      user: proxy_user
      password: proxy_secret_pass

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously

Test

$ ./gradlew test  # -t to watch change of files and rebuild continuously

To run unit tests, we need to configure the following environment variables.

Additionally, following files will be needed to upload to existing GCS bucket.

When environment variables are not set, skip some test cases.

AZURE_ACCOUNT_NAME
AZURE_ACCOUNT_KEY
AZURE_CONTAINER
AZURE_CONTAINER_IMPORT_DIRECTORY (optional, if needed)

If you're using Mac OS X El Capitan and GUI Applications(IDE), like as follows.

$ vi ~/Library/LaunchAgents/environment.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>my.startup</string>
  <key>ProgramArguments</key>
  <array>
    <string>sh</string>
    <string>-c</string>
    <string>
      launchctl setenv AZURE_ACCOUNT_NAME my-account-name
      launchctl setenv AZURE_ACCOUNT_KEY my-account-key
      launchctl setenv AZURE_CONTAINER my-container
      launchctl setenv AZURE_CONTAINER_IMPORT_DIRECTORY unittests
    </string>
  </array>
  <key>RunAtLoad</key>
  <true/>
</dict>
</plist>

$ launchctl load ~/Library/LaunchAgents/environment.plist
$ launchctl getenv AZURE_ACCOUNT_NAME //try to get value.

Then start your applications.