Noggle.TikaOnDotNet.Parser

Wrapper and helper functions for running Apache Tika text extraction and parsing on .NET through Noggle.TikaOnDotNet.


Keywords
Tika, extraction-api, text-analysis, text-parser
License
Apache-2.0
Install
Install-Package Noggle.TikaOnDotNet.Parser -Version 1.20.4

Documentation

Tika on .NET

NuGet version

The project provides a .NET wrapper with simple helper functions around the Tika text extraction library. To use the Tika Java libraries in your .NET application via IKVM.

Getting Started

Usage

using Noggle.TikaOnDotNet.Parser;

var tika = new Tika();

//simple text extraction from File/URL/Stream
string textFromFile = tika.ParseToString(stringToFile);
string textFromStream = tika.ParseToString(streamObject);
string textFromByteArray =tika.ParseToString(byteArrayObject);

//Parse a document and extract text and metadata from File/URL/Stream
var localFileContents = tika.Parse(stringToFile);
var webPageContents = tika.Parse(new Uri("https://google.com"));
var streamDocResults = tika.Parse(new FileStream(file, FileMode.Open, FileAccess.Read));

//Detect Language from string
var lang = tika.GetLanguage(textString);

Nuget

This project produces two nugets:

How To Update

Start out by taking a look at the Developer Guide.

Source Reference

This project is an individual fork and extension of TikaOnDotNet. It has upgraded .NET, Visual Stuio, FAKE And NUnit3 framwork dependencies as well as using newer Tika java version. There have been additional individual feature upgrades in the wrapper package Noggle.TikaOnDotNet.Parser. Original version was done by KevM as TikaOnDotNet.Text package.