Easily generate Apache Sqoop commands based on YAML config file

sqoop, yaml, hadoop, hive, mysql, postgresql
pip install cornet==0.1.3


Cornet Build Status Coverage Status

Cornet is a command-line tool on top of Apache Sqoop (v1) to simplify ingestion of data from RDBMs to Hive (main author).

Cornet is a generator of Sqoop commands. If you are a data engineer using Sqoop and you find yourself writing hundreds of Sqoop commands which are hard to maintain and update, Cornet might be of great help to you.


Ingest tables from RDBMS to Hive u

  • type-based java-type-mapping and hive-type-mapping
  • skip selected tables
  • import only selected tables

Cornet: First steps

Cornet needs a configuration file in a YAML format. Here's an example of the most simple config file called example.yaml:

- source:
    port: 5432
    db: portal
    user: marcel
    password: my-secret-password
    driver: postgres
    db: source

Assume the PostgreSQL database contains tables customers and products. Calling cornet example.yaml then generates the following to stdout:

sqoop import \
    --table customers \
    --hive-table source.customers \
    --hive-import \
    --connect jdbc:postgresql:// \
    --username marcel \
    --password my-secret-password

sqoop import \
    --table products \
    --hive-table source.products \
    --connect jdbc:postgresql:// \
    --hive-import \
    --username marcel \
    --password my-secret-password

This output can be then run on the Hadoop cluster, for instance as follows:

cornet example.yaml >
chmod +x

This example by itself does not solve any pain point. However, checkout the Features section for the cool features that we have built in.

Cool features

Type-based column mapping

Out of the box, Sqoop supports only standard JDBC column types. For other types, e.g. varbinary in MySQL or UUID in Postgres, you need to add a --map-column-hive and --map-column-java parameter to Sqoop. Unfortunately, Sqoop expect the mapping for columns rather than types.

  • as of Sqoop 1.45, the only way to ingest such data is to write a separate Sqoop command for each table, with the column mapping specified. In case of large databases, this might lead to thousands of Sqoop commands needed.
  • if a new column with a non-standard type is added, Sqoop ingestion fails

Cornet solves this problem by providing a type-based mapping. Use the type-mapping section as follows:

- source:
    db: portal
      VARBINARY: String
      UUID: String
      VARBINARY: String
      UUID: String

Using this configuration, if there are any VARBINARY or UUID columns present in the portal database, Cornet will add --map-column-java and --map-column-hive to the corresponding Sqoop commands. For example:

sqoop import \
    --table products \
    --map-column-java customer_id=String,some_blob_column=String \
    --map-column-hive customer_id=String,some_blob_column=String \

Powerful selection of tables, including regexp-based filtering

If you don't need to import all tables from a database, use the skip_tables section. For example:

- source:
    db: portal
    - schema_version
    - log

The tables are are actually regular expressions. For instance, you can simply exclude all tables starting a prefix QUARTZ_ as follows:

    - QUARTZ_.*

Similarly, there's a import_tables which allows you to import only selected list of tables:

Add a prefix to Hive tables

Add arbitrary Sqoop parameters

Override Sqoop parameters on a per-table basis

Support for password-file

Don't repeat yourself: Meet the global section

It's all Jinja!

The YAML config files are actually Jinja2 templates! This might come very handy.

For example, you can ingest databases with a similar name using the for-cycle:

{% for country in ['us', 'gb', 'fr'] %}
- source:
    db: portal_{{country}}
    table_prefix: 'portal_{{country}}_'
{% endfor %}

To make the YAML config file even more DRY, checkout the Jinja2 variables and blocks and many other features that Jinja2 provides.

Explanatory syntax errors

We have tried out best to provide good explanatory messages if there is something missing or not quite right in the config file.


Cornet currently supports Postgres and MySQL. Please create an issue if you need support for other databases.

First, install dev-headers for the source database you plan to ingest into Hive:


  • Ubuntu/Debian: apt-get install libpq-dev
  • RedHat/CentOs: yum install postgresql-libs
  • Mac OS: should be installed by default


  • Ubuntu/Debian: apt-get install python-dev libmysqlclient-dev
  • RedHat/CentOs: yum install python-devel mysql-devel
  • Mac OS: brew install mysql

Then, install Cornet with pip, with desired connectors in the brackets:

  • Cornet with Postgres: pip install cornet[postgres]
  • Cornet with MySQL: pip install cornet[mysql]
  • Cornet with MySQL and Postgres: pip install cornet[mysql,postgres]

Issues and contributions

Please create an issue if you encounter any problem or need an additional feature. We'll try to get back to you as soon as possible.


Created with passion by Marcel and Daan.