
Read and write Excel, Word and PDF documents in Julia




Taro is a utility belt of functions to work with document files in Julia. It uses Apache Tika and Apache POI (via JavaCall) to process the files. Current functionality includes the ability to read a DataFrame off an Excel sheet and the ability to extract text and metadata from a wide variety of document formats. It also uses Apache FOP to generate PDF from XSL-FO files.

Build Status





On installation, the tika-app-1.4.jar file will be downloaded from Maven Central


using Taro


Read Excel files

Taro.readxl(filename::String, sheet, region::String; 
        header::Bool = true, nastrings::Vector = ASCIIString["", "NA"], 
        truestrings::Vector = ASCIIString["T", "t", "TRUE", "true"],
        falsestrings::Vector = ASCIIString["F", "f", "FALSE", "false"], colnames::Vector = UTF8String[])

The sheet parameter can be String in which case it is interpreted as the sheet name. Alteratively, it could be an Integer which would be (a 0- based) sheet number.

The sheet parameter can be omitted, in which case the first sheet (index 0) in the workbook is selected.

Taro.readxl(filename::String, region::String; optional_config...)

The readxl function returns a dataframe from the contents of an MS Excel file. The sheet and region containing the data should be specified. By default, a header row is expected, which must consist only of strings. The header keyword argument should be set to false if no header is present in the data.

Extract raw content from document files


The extract function retrieves document metadata and the body text of a document. It returns a Dict of metadata name value pairs, and a String with the text of the document. Supported formats include MS Office, Open Office and PDF documents.

Generate PDF files using FOP

Taro has an interface to the Apache FOP project. This allows you to generate professional quality PDF files from XSL-FO layout definition templates. Please see the FOP Documentation for details.

Taro.fo(inputFoFileName::String, outputPDFFileName::String)

Convert the input fo file to a PDF.


julia> testfile = joinpath(Pkg.dir(),"Taro","test","df-test.xlsx");

julia> Taro.readxl(testfile, "Sheet1", "B2:F10")
8x5 DataFrame:
            H1     H2     H3     H4    H5
[1,]       "a"    1.0    1.0    1.0 "a a"
[2,]       "b"    2.0    2.0    1.0 "b b"
[3,]       "c"     NA    3.0    0.0 "c c"
[4,]       "d"    4.0     NA     NA "d d"
[5,]       "e"    5.0    5.0    1.0 "e e"
[6,]        NA    6.0    6.0    1.0   " "
[7,]       "g"    7.0    7.0    1.0 "g g"
[8,]       "h"    8.0    8.0    1.0 "h h"

julia> Taro.readxl(testfile, "Sheet1", "B3:F10"; header=false)
8x5 DataFrame:
            x1     x2     x3     x4    x5
[1,]       "a"    1.0    1.0    1.0 "a a"
[2,]       "b"    2.0    2.0    1.0 "b b"
[3,]       "c"     NA    3.0    0.0 "c c"
[4,]       "d"    4.0     NA     NA "d d"
[5,]       "e"    5.0    5.0    1.0 "e e"
[6,]        NA    6.0    6.0    1.0   " "
[7,]       "g"    7.0    7.0    1.0 "g g"
[8,]       "h"    8.0    8.0    1.0 "h h"

julia> Taro.readxl(testfile, "Sheet1", "B3:F10"; header=false, nastrings=[" "])
8x5 DataFrame:
            x1     x2     x3     x4     x5
[1,]       "a"    1.0    1.0    1.0  "a a"
[2,]       "b"    2.0    2.0    1.0  "b b"
[3,]       "c"     NA    3.0    0.0  "c c"
[4,]       "d"    4.0     NA     NA  "d d"
[5,]       "e"    5.0    5.0    1.0  "e e"
[6,]        NA    6.0    6.0    1.0     NA
[7,]       "g"    7.0    7.0    1.0  "g g"
[8,]       "h"    8.0    8.0    1.0  "h h"

julia> testfile = joinpath(Pkg.dir(),"Taro","test","WhyJulia.docx")

julia> meta, body = Taro.extract(testfile);

julia> meta["Last-Save-Date"]

julia> typeof(body)
UTF8String (constructor with 1 method)

julia> length(body)

Build Status