org.bridgei2i:word2vec

A Clojure wrapper for the Medallia word2vecJava implementation


License
EPL-2.0

Documentation

clojure-word2vec

The word2vec tool by Mikolov et al enables us to create word vectors from a dataset containing text data. Unlike a binary present/absent representation used by a bag-of-words, these word vectors can be used to compare 2 words and see if they are related.

This is a Clojure wrapper of Java implementation of word2vec [available here] (https://github.com/medallia/Word2VecJava).

Installation

To include word2vec, add the following to your :dependencies section of project.clj

[Clojars Project]

Usage

First import clojure-word2vec.core into your namespace

(ns clojure-word2vec.examples
  (:require [clojure-word2vec.core :refer :all]
            [clojure.java.io :as io]))

Download a text corpus and place it in the resources folder. Here we'll download James Joyce's Ulysses from Project Gutenberg.

(def data
  (create-input-format "path/to/ulysses.txt"))

Create the model and train it, using the default hyperparameters

(def model (word2vec data))

The hyper parameters can be specified as arguments to word2vec.

(def model (word2vec data :window-size 15)

Find the closest words to a given word

(get-matches model "woman")

A longer introduction is available in the docs .

License

Copyright © 2015 Bridgei2i

Distributed under the Eclipse Public License version 1.0.