github.com/danieldk/conllx-utils

CoNLL-X utilities


Keywords
cycle, conll, corpora, partitioning, treebanks
Install
go get github.com/danieldk/conllx-utils

Documentation

CoNLL-X Utilities

Introduction

This is a set of utilities to modify files in the CoNLL-X tabular files. The focus of this package is interoperability with TCF (Text Corpus Format). However, the majority of the utilities are also useful outside TCF. The package contains the following programs:

  • conllx-cleanup: replace most Unicode punctuation characters by by ASCII equivalents.
  • conllx-compare: compare sentences on particular columns.
  • conllx-cycle: find dependency trees with (non-self) cycles.
  • conllx-tdz-expandmorph: expand morphology features in TüBa-D/Z.
  • conllx-grep: print sentences that have a token matching a pattern.
  • conllx-merge: merge CoNLL-X files.
  • conllx-partition: partition a CoNLL-X file in N files.
  • conllx-sample: take a random sample from a CoNLL-X file.
  • conllx-text: convert CoNLL-X file to plain text.

Download

Downloads are available on the release page.

Usage

Executing a command gives usage information when --help is given as an argument.

Todo

A lot, including:

  • Partitioning is currently interleaving. Also support chunked partitioning.
  • Test with problematic inputs.
  • Merge specific columns from two CoNLL files.