nldsl

A DSL for data science with a syntax similar to a natural language.


Licenses
AFL-3.0/NCGL-UK-2.0
Install
pip install nldsl==0.15.4

Documentation

NLDSL

NLDSL is a tool to create domain specific languages (DSLs) for data science, which can be translated into executable code. A new DSL is created by deriving from the CodeGenerator class and rules are added to it via simple python functions. Besides providing code generation NLDSL allows the user to define DSL-level function, which are then treated as first-class rules. Currently we provide extensions for Pandas and PySpark.

Documentation and Tutorial

The complete documentation as well as a tutorial can be found on the project web page.

Installing

Open a terminal and run:

pip install nldsl

Example Usage

  1. Instanciate a code generator for Pandas.
from nldsl import PandasCodeGenerator, grammar
code_gen = PandasCodeGenerator()
  1. Define a DSL statement and create executable code.
dsl_stmt = "## x = on df | rename columns old1 to new1, old2 to new2"
python_code = code_gen(dsl_stmt)
  1. Add a new rule to the DSL and use it to create executable code.
new_rule = "#$ my rule $x = rename columns old to $x | drop duplicates"
dsl_stmt = "## x = on df | my rule new"
python_code = code_gen(new_rule + dsl_stmt)
  1. Implement a new rule via python.
def rename_columns(code, args):
    return code + ".rename(columns={})".format({args[i]:args[i+2] for i in range(0, len(args), 3)})
  1. Implement the "rename columns" rule using the grammar decorator, which parses the arguments for you.
@grammar
def rename_columns(code, args):
    """ Desc...
    Grammar:
        rename columns $my_var[$old to $new]
    """
    # The 'Grammar' section in the docstring of this function is mandatory.
    return code + ".rename(columns={})".format(dict(args["my_var"]))
  1. A new rule can either be added to a certain instance of a CodeGenerator or to the class itself.
code_gen["rename columns"] = rename_columns
PandasCodeGenerator.register_function("rename colummns", rename_columns)
  1. Create a custom CodeGenerator.
from nldsl import CodeGenerator

class MyCodeGenerator(CodeGenerator):
    # optionally you can specify an __init__, extract_environment and postprocessing method.
    pass
  1. By default a python expression parser is provided for every grammar rule, but you may also create your own.
from nldsl.core import ExpressionRule

class MyExpressionRule(ExpressionRule):
    def __init__(self, expr_name, next_keyword):
        super().__init__(expr_name, next_keyword)
        # The symbol of an expression is modified by specifing it in the operator_map.
        # Note that spaces need to be added explicitly.
        self.operator_map["and"] = " & "
        self.operator_map["not"] = "~"

        # You can also change the type of an operator,
        # e.g. from binary to being a member function of the left argument.
        self.operator_map["in"] = ".isin"
        self.operator_type["in"] = OperatorType.UNARY_FUNCTION

# Tell the grammar rule 'my_rule' to use MyExpressionRule instead of the default.
@grammar(expr=MyExpressionRule)
def my_rule(code, args):
    # ...
  1. CodeGenerators may be organized in a hierarchy and combined or specialized via polymorphism.
class CodeGenA(CodeGenerator): pass
CodeGenA.register_function("my_rule_a", my_rule_a)

class CodeGenB(CodeGenerator): pass
CodeGenB.register_function("my_rule_b", my_rule_b)

# Combine the rules in CodeGenA und CodeGenB
class CodeGenC(A, B): pass

# Specialize a rule from CodeGenA
class CodeGenD(A): pass
CodeGenD.register_function("my_rule_a", new_rule_a)
  1. The prefix of evaluation and definition DSL statements may be overwritten by a code generator.
class MyCodeGen(CodeGenerator): pass

MyCodeGen.init_meta_model(eval_prefix="//#", define_prefix="//$")

Python Versions

Tested and supported for Python 3.4+