Data transformation and anaytics library - Excel Connection and reader.


Keywords
Business, Intelligence, BI, ETL, Data, Integration, Management, DbDataReader, Processing, Analytics, sqlserver, sqlite, sql, ftp, sftp, oracle, postgres, postgressql, mysql, azure, excel
License
Apache-2.0
Install
Install-Package dexih.connections.excel -Version 1.1.1066

Documentation

Data Transformation Libraries

Nuget Packages

Base Libraries

Package Nuget Link
Transforms - (Start here) The base transforms library, table caching, and a basic connection.
Functions - Standard functions, data type conversations and other support libraries for the transforms.

SQL Connections

Package Nuget Link
Sql Server
Postgres Sql
Sqlite
MySql
Oracle

Flat File Connections

Package Nuget Link
Local flat files
FTP flat files
SFTP flat files
Azure Storage flat files

Other Connections

Package Nuget Link
Restful Web Services
Excel
Integration Hub

What is this?

This library provides .net developers the ability to implement read, transform, analyze and data delivery capabilities within their applications.

The key features are:

  • Built in .net core, and cross-platform tested on Windows, OSX and Linux platforms.
  • Provides a uniform view for a set of heterogenous data sources. These include:
    • Sql Databases
    • NoSql Databases - Azure Storage Tables.
    • Rest based web services.
    • Text files of varying formats (locally and via SFTP or FTP)
    • Excel files.
    • Directly from in memory POCO (Plain Old CLR Objects).
  • Data functions that enable on the fly field encryption, json/xml parsing, text parsing, and geographical calculations.
  • Analytical functions allowing standard deviation, moving averages and other functions operating across multiple data rows.
  • Transforms which can be chained together to enable complex data shaping. These include capabilities such as group, sort, row pivot and lookups.
  • CDC (change data capture) capabilities that can:
    • Detect updates, deletes in target data and apply changes.
    • Preserve change history by maintaining row versions on target tables.
  • Apply data validation/rejection though column level validation and rejection rules.
  • Auto capture data profiling statistics and column distribution analysis.

This library can be used as a foundation for applications that need process small or high volumes of data such as:

  • Business Intelligence and reporting.
  • Data Integration and Batch Processing.
  • Real-time analytics and alerting.

How does it work?

The library consists of the following primary classes:

  • Readers - Connects to, and retrieves underlying data.
  • Tranforms - Prepares and processes data as needed. Transforms can be chained together as many times as needed to perform more complex data processing.
  • Writer - Writes the data out to the final destination.

Why not just use SQL / Linq and Entity Framework?

The transformations in this library work best when used in conjunction with optimised SQL queries, however Sql has limits in many areas of data processing. The transform library can provide the following benefits:

  • Analytic calculations that are difficult to carry out using SQL. For example statistical calculations such as median, moving average are very difficult to accomplish with SQL.
  • Row pivoting functions can be used to easily translate structures such as Xml and Json into tabular data.
  • Join data from multiple data sources. For example you can join datasets on the fly from different databases, or connect to csv files, web services, or no-sql databases.
  • Reduce workloads on databases can avoid database locking and performance problems on operational databases.
  • Building reusable functions and logic that can be applied across multiple databases and database types.

Using the Readers

The readers are used to start (or join into) the data chain.

Reading from a class (POCO Reader)

The following statement will convert a populated class and convert this to a streamable reader.

var reader = new ReaderPoco<BookClass>(books);

Here is a full sample that loads and displays the books class. Noting the Field attribute can be used to define properties such as an alternate name and the delta type (such as natural key, tracking field ect.)

public class Program
{
    public static void Main(string[] args)
    {
        var poco = new CreatePocoReader();
        poco.Create();
    }
}

public class BookClass
{
    [Field("code", DeltaType = TableColumn.EDeltaType.NaturalKey)]
    public string Code { get; set; }

    [Field("name")]
    public string Name { get; set; }

    [Field("name")]
    public int Cost { get; set; }

    [Field("date_published")]
    public DateTime Published { get; set; }

}

public class CreatePocoReader
{
    public void Create()
    {
        var books = CreateBooksData();
        var reader = new ReaderPoco<BookClass>(books);

        DisplayReader(reader);
    }
    
    public List<BookClass> CreateBooksData()
    {
        var books = new List<BookClass>();
        
        books.Add(new BookClass() {Code = "001", Name = "Lord of the rings", Cost = 15, Published = new DateTime(1954, 07,29)});
        books.Add(new BookClass() {Code = "002", Name = "Harry Potter and the Philosopher's Stone", Cost = 12, Published = new DateTime(1997, 06,26)});
        books.Add(new BookClass() {Code = "003", Name = "A Game of Thrones", Cost = 16, Published = new DateTime(1996, 07,01)});

        return books;
    }

    public void DisplayReader(DbDataReader reader)
    {
        while (reader.Read())
        {
            for (var i = 0; i < reader.FieldCount; i++)
            {
                Console.Write(reader.GetName(i) + ":" + reader[i].ToString() + (i < reader.FieldCount-1 ? ", " : ""));
            }
            Console.WriteLine();
        }
    }
}

Memory Reader

A memory reader uses a table object populated with data. The following sample creates a memory reader:

//Create the table class
var table = new Table("test", 0,
    new TableColumn("StringColumn", DataType.ETypeCode.String, TableColumn.EDeltaType.NaturalKey),
    new TableColumn("IntColumn", DataType.ETypeCode.Int32, TableColumn.EDeltaType.NaturalKey),
    new TableColumn("DecimalColumn", DataType.ETypeCode.Decimal, TableColumn.EDeltaType.NaturalKey),
    new TableColumn("DateColumn", DataType.ETypeCode.DateTime, TableColumn.EDeltaType.NaturalKey),
    new TableColumn("SortColumn", DataType.ETypeCode.Int32, TableColumn.EDeltaType.TrackingField)
            );

//Populate the table with some data
    table.AddRow("value01", 1, 1.1, Convert.ToDateTime("2015/01/01"), 10 );
    table.AddRow("value02", 2, 2.1, Convert.ToDateTime("2015/01/02"), 9 );
    table.AddRow("value03", 3, 3.1, Convert.ToDateTime("2015/01/03"), 8 );
    table.AddRow("value04", 4, 4.1, Convert.ToDateTime("2015/01/04"), 7 );
    table.AddRow("value05", 5, 5.1, Convert.ToDateTime("2015/01/05"), 6 );

//Initialize the ReaderMemory, with an indicator that the "StringColumn" is sorted
var reader = new ReaderMemory(table);

###Row Creator Reader

A row creator, is a simple reader that generates a sequence of numbers. This can be used to generate test data, or dimensions such as a date dimension that requires a fixed sequence of values.The following sample creates a row generator reader:

//Create a sequence of rows from 1, 1000 incrementing by 2
var rowCreator = new ReaderRowCreator(1, 1000, 2);

###Using Connection Readers

The dexih.connections can be used to generate a reader from various type of sources, dependent on the connection being used. See the documentation for each individual connection for specifics.

The following example uses the ConnectionSqlite to read from a local sqlite database table.

//Create a connection to sqlite
var connection = new ConnectionSqlite()
{
    Name = "Test Connection",
    NtAuthentication = true,
    ServerName = "c:\\data\database.sqlite"
};

//get the table structure.
var tableResult = connection.GetSourceTableInfo("SalesTable");
if(!table.Success)
    throw new Exception("Could not retrieve table information");
var table = tableResult.Value;

//create a new connection to the database.
var newConnection = connection.NewConnection();

var readerResult = connection.GetDatabaseReader(table, newConnection);

if(!readerResult.Success)
   throw new Exception("Reader issue: " + readerResult.Message);

//new instance of the reader
var reader = readerResult.Value;

Reading using an existing DbDataReader

If you've already have a DbDataReader object, then the following will translate this into a transform object:

var reader = new ReaderDbDataReader(sourceReader);

Here is an example that reads data from a sqlserver table, applies a filter, performs some analytics, and then writes the data to a sql server table.

public void FirstTransform(SqlConnection sourceConnection, SqlConnection targetConnection)
{
    // Retrieve the data from the database
    var cmd = new SqlCommand("select * from Sales.SalesOrderHeader ", sourceConnection);
    var sourceReader = cmd.ExecuteReader();

    // Load the reader into transform source, which will start the transform chain.
    var transformSource = new ReaderDbDataReader(sourceReader);

    // Create a custom filter that removes records where PurchaseOrderNumber is null
    var transformFilter = new TransformFilter(
        transformSource,
        new List<Function>()
        {
            new Function(
                new Func<string, bool>((value) => value != null), //function code
                new[] { "PurchaseOrderNumber" },  //input column
                null, null )
        }
    );

    // Add median, and sum calculation
    var transformGroup = new TransformGroup(
        transformFilter,
        new List<ColumnPair>() //The fields to groupby
        {
            new ColumnPair("OrderDate")
        },
        new List<Function>()
        {
            StandardFunctions.GetFunctionReference("Median", new[] { "TotalDue" }, "DailyMedian", null),
            StandardFunctions.GetFunctionReference("Sum", new[] { "TotalDue" }, "DailyTotal", null)
        },
        true //Pass through column = true will will pass through original fields/rows and merge in the aggregates
    );

    using (var bulkCopy = new SqlBulkCopy(targetConnection))
    {
        bulkCopy.DestinationTableName = "SalesOrderDaily";
        bulkCopy.WriteToServer(transformGroup);
    }
}

The Transform Class

The transform class is a implementation of the DbDataReader class that leverages all the benefits of the DbDataReader, whilst extending the functionality to allow for more advanced data integration.

Basics

The transform class inherits the DbDataReader class and can be used in the same way to read data.

The following is a simple example of reading records from a Transform object.

while(await transform.ReadAsync())
{
    for(int i =0; i < transform.FieldCount; i++)
    {
        Console.WriteLine("Field: {0}, Value: {1}", transform.GetName(i), transform[i].ToString());
    }
}

Caching

The transform class enhances the DbDataReader class by providing a built-in caching mechanism. The caching can be switched off or set to full caching or partial caching. When caching is enabled, this allows:

  • Navigate backwards through previous records.
  • Reset the reader to the start, without re-accessing the source database again.
  • Peek at previous records (without resetting the current row).
  • Search / lookup against previous loaded records.
  • On demand lookup, where the cache is first referenced, and then referred to the data source if not found.

Caching Settings

The transform can store a cache of records already read. By default the cache is off. There are two types of caching options.

  • OnDemandCache - Will retain records that have been read by the ReadAsync() or LookupRow() function. Whenever the LookupRow function is called the transform will first check the cache for the row, and if not found ask the underlying connection for the row.
  • PreLoadCache - Will load all records into memory the first time the LookupRow function is called. It will then only refer to the cache when retrieving additional records.

The transform can also set a maximum cache size, which causes the transform to store the last n rows. This can be useful to manage memory usage on large tables, and store rows when a commit fails.

The following example sets the caching to OnDemandCache and stores a maximum of 1000 rows.

transform.SetCacheMethod(ECacheMethod.OnDemandCache, 1000);

Navigating through the cache

The following methods are available to navigate through cached records:

  • SetRowNumber(int rowNumber) - Sets a specific row number. When the ReadAsync() is called, it will start reading from this number. An exception will be raised if the row number exceeds the number of cached rows.
  • RowPeek(int rowNumber, object[] values) - Populates the values array with the rows values as the specified rowNumber.

Lookup Function

The lookup function can be used to retrieve values from the cache or directly through a supported connection.

The syntax for the lookup function is: Task<ReturnValue<object[]>> LookupRow(List<Filter> filters)

The lookup function applies the following logic to retrive a record:

  1. Looks in the cache for the record.
  2. Executes a direct lookup if supported by the Connection.
  3. Scans through each row until the lookup is found.

The following connections support direct lookups:

  • ConnectionSqlite
  • ConnectionSql
  • Connection
  • ConnectionAzure
  • ConnectionWebServiceRestful
  • ConnectionWebServiceSoap
  • ConnectionMemory

The following example shows how to use the lookup function

//gets the transform reader.
var reader = connection.GetTransformReader(table, null);

//open the reader
openResult = await reader.Open();
if(!openResult.Success)
    throw new Exception("Open Reader failed:" + openResult.Message);

//set the caching
reader.SetCacheMethod(ECacheMethod.PreLoadCache);

//set a filter for the lookup
var filters = new List<Filter> { new Filter("IntColumn", Filter.ECompare.IsEqual, 5) };

//call the lookup
var returnLookup = await reader.LookupRow(filters);

if(returnLookup.Success)
    Console.WriteLine("The record was found.");
else
    Console.WriteLine("The record was not found.");

Note: Lookups best used where there are a low number of lookup values, or when calling a function (such as a web service) where the lookup has specific parameters (such as current datetime). Where there are two (or more) large datasets that need to be joined together, the JoinTransform will generally perform faster.

Chaining Transforms

Transforms can be chained together to create a complete data transformation.

A transform chain should use to the following pattern:

  1. Start with a reader
  2. Chain together the general transforms as required to produce the required data transformation.
  3. End the chain with an optional validation transform, followed by a delta transform
  4. The end point can either be read directly as a DbReader (i.e. iterating through the ReadAsync() function) or the TransformWriter class can be used to write the result set to a connection.

###Creating the Transforms

After creating a reader to start the transformation, the next step is to chain the transforms together to perform the desired data processing.

The following example assumes a reader has already been created, and then chains a sort and mapping transform together.

//create use the sort transform to sort by the "StringField"
var sortField = new List<Sort> { new Sort("StringField", Descending} )};
var sortTransform = new TransformSort(reader, sortField);

//maps the "StringColumn" field to the name "OutputColumn"
var mappingFields = List<ColumnPair> mappingFields = new List<ColumnPair>() { new ColumnPair("StringColumn" "OutputColumn")};
var mappingTransform = new TransformMapping(sortTransform, mappingFields, null);

//read the data using mappingTransform.ReadAsync();

###Delivering the data

Saving or publishing data can be done using the functions that support DbDataReader (such as the SqlBulk) or using the TransformWriter class.

The following is a simple example that uses the SqlBulkCopy class to write a transform result to a Sql Server table.

using (var bulkCopy = new SqlBulkCopy(targetConnection))
{
    bulkCopy.DestinationTableName = "SalesOrderDaily";
    bulkCopy.WriteToServer(mappingTransform);
}

The library also includes a TransformWriter class. This class can be used to save data to any of the available Connection libraries. It also includes logic to apply operations such as update, delete, preserve, and reject which are produced by the TransformValidation and TransformDelta controls.

The following code shows how to use this class:

TransformWriter writer = new TransformWriter();
TransformWriterResult writerResult = new TransformWriterResult();
var returnResult = await writer.WriteAllRecords(writerResult, transform, targetTable, targetConnection, rejectTable, rejectConnection, CancellationToken.None);

if(!returnResult.Success)
    throw new Exception("The writer failed to run with message: " returnResult.Message);

if(!writerResult.RunStatus == Abended)
    throw new Exception("The writer failed with message: " writerResult.Message);

Console.WriteLine("Finished with status {0}, and processed {1} rows.", writerResult.RunStatus, writerResult.RowsTotal.ToString());

Table Class

The table class is required by every transform, and it provides metadata and caching capabilities.

A table can either be created manually, or by calling the GetSourceTableInfo function in a connection class.

To retrieve from a connection:

var getTableResult = connection.GetSourceTableInfo("MyTable", null);
if(!getTableResult.Success)
    throw new Exception("There was an issue getting the table: " + getTableResult.Message);

Table table = getTableResult.Value;

Encryption & Hashing

Sensitive fields can easily be encrypted, decrypted or hashed through the transform control.

Built in Transforms

The built-in transforms are implementations of the base transform class. These can be chained together to allow almost any type of data transformation.

The following is a short description of the built-in transforms:

Transform Description
Filter Filters the rows based on boolean conditions specified by the filter functions.
Group Allows rows to be grouped and analytic and aggregate functions to be applied to the result. Using the "Pass Through Columns" setting, the group will keep the original number of rows and join in the analytic calculation.
Join The join allows an extra input data stream to be joined to the main table.
Lookup This allows a row lookup to be performed against another data source or external function.
Mapping Maps source fields to target fields using simple source-target mappings or advanced calculations using mapping functions.
Row Using a row function, translates values into rows. This can be used to pivot values, or to parse JSON/XML fields into data rows.
Sort Sorts the dataset by one or more columns and an ascending or descending order.
Concatenate Concatenates the rows of two data stream together (this is similar to sql UNION)
Profile Generates statistics a dataset, without impacting the dataset (meaning it can be inserted anywhere statistics are required). The built in profile functions allow for detection of string patterns, data types, uniqueness et.
Validation The validation transform automatically checks the fitness of incoming data, against configurable validation rules, and the data type of the data. Where data does not pass the validation, it can be cleaned, or marked for rejection.
Delta Compares the incoming dataset, against another dataset, produces a delta, and generates a set of audit data. The delta has a number of update strategies that can be configured, including update/delete detection and history preservation. In a data warehouse this can be used to generate type 2 slowly changing dimensions.

To run a complete transformation, these transforms can be chained together in any logical sequence.

Some tips:

  • Filter early, to reduce the number of rows other transforms need to process.
  • The Group, Row and the Join run faster with a sorted dataset. If you can sort the data through SQL or other means, set the SortField property on the feeding transforms.

Using Functions

Functions are used across all transforms to map, aggregate and filter data. Functions accept values from the incoming data rows, and return values which are mapped to output columns.

There are two types of functions that can be defined; stateless or state.

Stateless Functions

Stateless functions are used by the mapping and filter transforms, and do not maintain state between rows.

The following example shows how to create a new function that adds two values ( i + j). If this function is added to a mapping transform, the transform will look for the input columns value1 and value2 and map to the target column AddResult. Additional output columns can be defined by specifying fields in the output

using dexih.functions;

Function function1 = 
	new Function(
		new Func<int, int, int>((i, j) => i + j), 
		new string[] { "value1", "value2" }, 
		"AddResult", 
		null //specify more output fields here.
	);
	
///test the function writes value - 8
Console.WriteLine (Int32)function1.RunFunction(new object[] { 6, 2 }).Value

Functions can also be created by referencing an existing class and method. The following example is equivalent to the previous example, however it references a standard function.

using dexih.functions;

Function function2 = 
	new Function(
		typeof(StandardFunctions), 
		"Add", 
		new string[] { "value1", "value2" }, 
		"AddResult", 
		null
	);

State functions

Functions containing a state are used for aggregations, analytics, and row pivoting where multiple rows of data are required to run the function. To implement a state function, three discrete functions must be defined:

  1. The primary function - Called for each row in the grouping. This should be void function with the input parameters specifying the column values to be processed.
  2. The result function - Called to retrieve a result when the grouping has completed. This should return thee result, and specify out parameters for any additional values to be returned.
  3. The reset function - this is called to reset variables and start receiving data for thee next group.

Here is a simple example that implements the sum function:

class CalculateSum
{
	int total = 0;
	
	public void Sum(int value)
	{
		total = total + value;
	}
	
	public int SumResult()
	{
		return total;
	}
	
	public void SumReset()
	{
		total =  0;
	}
}

##Using the Transforms

###Filter Transform

Filters the rows based on boolean conditions specified by the filter functions.

The filter transform consists of one or more functions that return a boolean result. A record will filtered through when any of the functions are false.

The following example uses the Equal function to remove any rows where the JunkColumn is equal to the value "junk".

//set a Conditions list
List<Function> Conditions = new List<Function>();

//use the built in IsEqual function.
Function Function = StandardFunctions.GetFunctionReference("IsEqual");
Function.Inputs = new dexih.functions.Parameter[] {
        new dexih.functions.Parameter("JunkColumn", ETypeCode.String, true, null, "StringColumn" ),
        new dexih.functions.Parameter("Compare", ETypeCode.String, false, "junk") };

//use the NotCondition property to change the function to `not equal`.
Function.NotCondition = true

//add the function to the conditions list
Conditions.Add(Function);

//create the new filter transform with the conditions applied.
TransformFilter TransformFilter = new TransformFilter(InTransform, Conditions);

###Sort Transform

Sorts the dataset by one or more columns and an ascending or descending order.

The following example sorts the incoming data by Column1 (ascending) and Column2 (descending).

var SortFields = new List<Sort> { 
    new Sort("Column1", Sort.EDirection.Ascending) 
    new Sort("Column2", Sort.EDirection.Descending) 
};
TransformSort TransformSort = new TransformSort(Source, SortFields);    

Note: The sort transform outputs the OutputSortFields property. This property is used to inform downstream transforms that the dataset is sorted. This is used by the Group, Row, Join and Delta transforms to indicate the data is sorted as needed by this. If the inbound transform already has OutputSortFields set to the same sort order as the sort transform, the sort will do nothing.

###Mapping Transform

Maps source fields to target fields using simple source-target mappings or advanced calculations using mapping functions.

The following example uses the ColumnPair class to perform some column mappings, and a function to perform a substring.

List<ColumnPair> MappingColumns = new List<ColumnPair>();

//map the sourceFieldName to the targetFieldName
MappingColumns.Add(new ColumnPair("sourceFieldName", "targetFieldName"));

//map some other fields without any name change.
MappingColumns.Add(new ColumnPair("createDate"));
MappingColumns.Add(new ColumnPair("updateDate"));

List<Function> MappingFunctions = new List<Function>();

//use the substring function to limit the 'bigField' to 20 characters
Function = StandardFunctions.GetFunctionReference("Substring");
Function.TargetColumn = "trimmedField";
Function.Inputs = new dexih.functions.Parameter[] {
        new dexih.functions.Parameter("name", ETypeCode.String, true, null, "bigField" ),
        new dexih.functions.Parameter("start", ETypeCode.Int32, false, 0),
        new dexih.functions.Parameter("length", ETypeCode.Int32, false, 20) 
};
MappingFunctions.Add(Function);

//create the mapping transform
transformMapping = new TransformMapping(InTransform, MappingColumns, MappingFunctions);

The mapping transform can also use the property PassThroughColumns. If this is set to true, the mapping transform will map any source fields that haven't already been used by a mapping function or mapping columnpair.

###Join Transform

The join allows an extra input data stream to be joined to the primary table, and is similar to an sql left outer join. The rows from the primary table will be maintained, with the join fields being added based on the join conditions.

The join is optimized to work with sorted data or non-sorted data:

  • Merge Join - If the incoming data from the primary and join tables are sorted in the same order as the join field, the join transform will perform a merge join. The merge join if fast and can operate effectively over high volumes of data with very low memory usage.
  • Hash Join - If the incoming datasets are not sorted, a hash join will be used. The Hash join loads the join table into memory, and streams the primary table. The primary table should be the larger of the two tables being joined for optimal performance. This performs quickly with smaller join tables, however if the join table is in the millions of records, memory constraints can cause failures.

The following example shows a merge join operation. If the sort operations were not specified, the transform would default to a hash join.

//create primary table
Table primaryTable = new Table("Sales", 0,
    new TableColumn("SaleDate", DataType.ETypeCode.DateTime, TableColumn.EDeltaType.NaturalKey),
    new TableColumn("ProductId", DataType.ETypeCode.Int32, TableColumn.EDeltaType.NaturalKey),
    new TableColumn("SalesValue", DataType.ETypeCode.Decimal, TableColumn.EDeltaType.TrackingField);

//add data, note the productid column is sorted.
primaryTable.AddRow( Convert.ToDateTime("2015/01/01"), 10, 123 );
primaryTable.AddRow( Convert.ToDateTime("2015/01/01"), 10, 124 );
primaryTable.AddRow( Convert.ToDateTime("2015/01/02"), 20, 111 );
primaryTable.AddRow( Convert.ToDateTime("2015/01/02"), 20, 112 );

//create a reader, and indicate data is sorted by the productid
ReaderMemory primaryReader = new ReaderMemory(primaryTable, new List<Sort>() { new Sort("ProductId") } );

//create join table
joinTable table = new Table("Products", 0,
    new TableColumn("ProductId", DataType.ETypeCode.Int32, TableColumn.EDeltaType.NaturalKey),
    new TableColumn("ProductName", DataType.ETypeCode.String, TableColumn.EDeltaType.TrackingField);

//add data, note the productid column is sorted.
joinTable.AddRow( 10, "Product Number 10" );
joinTable.AddRow( 20, "Product Number 20" );

//create a reader, and indicate data is sorted by the productid
ReaderMemory joinReader = new ReaderMemory(joinTable, new List<Sort>() { new Sort("ProductId") } );

//create the join reader which can now be streamed by calling transformJoin.ReadAsync()
TransformJoin transformJoin = new TransformJoin(
    primaryReader, 
    joinReader, 
    new List<JoinPair>() { new JoinPair("ProductId", "ProductId") }
);

###Lookup Transform

The lookup transform is similar to the join transform in that it performs a left outer join type of operation on a primary and secondary dataset. The differences are:

  • The lookup transform can perform direct lookups to the underlying connection. For example if the underlying connection is a database, the lookup will execute a database lookup repeatedly for each primary row. If the connection is a web service it will call the web service function repeatedly for each primary row.
  • The lookup transform can utilize a cache on demand process. This means that after a single lookup is completed, the value can stay in cache for future lookups.
  • The lookup can use more complex operations through functions, whereas the join can only perform equal operations.
  • The lookup can not be used against connections where the secondary table does not support a direct lookup (i.e. flat files).

The lookup is best used when calling functions or when the reference table is very large. For example calling function to retrieve stock price at a precise time.

###Group Transform

The group transform can be used perform an SQL like group by process, or to run analytical/aggregate functions against a dataset.

This example shows how to use the group transform to perform a simple sum function against the product column of the dataset.

//create primary table
Table primaryTable = new Table("Sales", 0,
    new TableColumn("Product", DataType.ETypeCode.Int32, TableColumn.EDeltaType.NaturalKey),
    new TableColumn("Value", DataType.ETypeCode.Decimal, TableColumn.EDeltaType.TrackingField);

//add data, note the productid column is sorted.
primaryTable.AddRow( 'Product1', 1 );
primaryTable.AddRow( 'Product1', 2 );
primaryTable.AddRow( 'Product2', 3 );
primaryTable.AddRow( 'Product2', 4 );

//create a reader, and indicate data is sorted by the productid
ReaderMemory primaryReader = new ReaderMemory(primaryTable, new List<Sort>() { new Sort("ProductId") } );

List<Function> Aggregates = new List<Function>() {
    StandardFunctions.GetFunctionReference("Sum", new[] { "Value" }, "SumValue", null)
};

TransformGroup transformGroup = new TransformGroup(primaryReader, null, Aggregates, true);

The transform group supports the PassThroughColumns setting. When this is set to true the Group transform will run the aggregatation function, but will leave the original rows and columns intact.

For example the following dataset:

Product Value
Apples 1
Apples 2
Oranges 3
Oranges 4

If PassthroughColumns is set to false, will aggregate to:

Product Sum
Apples 3
Oranges 7

If is PassThroughColumns is set to true, this will keep the base records, and add the aggregates as follows:

Product Value Sum
Apples 1 3
Apples 2 3
Oranges 3 7
Oranges 4 7

###Row Transform

Using a row function, translates values into rows. This can be used to pivot values, or to parse JSON/XML fields into data rows.

###Profile Transform

The profile transform generates statistics a dataset, without impacting the primary dataset (meaning it can be inserted anywhere statistics are required). The built in profile functions allow for detection of string patterns, data types, uniqueness et.

###Validation Transform

The validation transform automatically checks the fitness of incoming data. The validation transform automatically tests the data types of incoming data against configurable validation rules. Where data does not pass the validation, it can be cleaned, or marked for rejection. | Delta | Compares the incoming dataset, against another dataset, produces a delta, and generates a set of audit data. The delta has a number of update strategies that can be configured, including update/delete detection and history preservation. In a data warehouse this can be used to generate type 2 slowly changing dimensions.