Ruby class for text analysis/tokenizing using regular expressions to parse webpage innertexts, HTML, files and strings.


Keywords
class, regular-expression, ruby, ruby-gem, tag-cloud, text-analysis, tokenizer
License
GPL-3.0
Install
gem install texty -v 1.0.8

Documentation

Build Status Gem Version Inline docs

Texty

A number of text tokenizing tools used to scan files and webpages with custom regular expressions. Can export hashes to JSON files, text files and generate tag clouds using neopoly/tagcloud. (Requires java if you want to export to tag clouds)

Installation

Add this line to your application's Gemfile:

gem 'texty'

And then execute:

$ bundle

Or install it yourself as:

$ gem install texty

Usage and documentation

Documentation is at http://www.rubydoc.info/gems/texty.

EXAMPLES:

test = Texty::TokenizeFile.new.file('index.html').to_json
  
# Tokenizes the file at the specified file path within the Core::DIR directory.
# Exports the resulting frequency hash to a JSON file.
test = Texty::TokenizeString.new.string('testing texty').to_json
  
# Tokenizes the given string.
# Exports the resulting frequency hash to a JSON file.

test = Texty::TokenizeFile.new
test.files.by_frequency_descending.top(5).display
  
# Tokenizes all files in the Core::DIR directory.
# Sorts the resulting hash by descending frequency order.
# Cuts the hash size to the top 5 words.
# Pretty prints the resulting frequency hash to the console.

test = Texty::TokenizeWeb.new
test.html('https://en.wikipedia.org/wiki/Iceland').to_json
  
# Tokenizes all of the HTML code on the Iceland wikipedia page.
# Exports the resulting frequency hash to a JSON file.
test = Texty::TokenizeFile.new.ignore_docs.files.to_json
  
# Tokenizes all files in the Core::DIR directory except those in the Ignore::DOCUMENTS array.
# Exports the resulting frequency hash to a JSON file.

test = Texty::TokenizeWeb.new
test.html('https://www.reddit.com').char_count
  
# Returns the total number of characters in the HTML code of the reddit homepage.
test = Texty::TokenizeWeb.new
test.html('https://www.reddit.com').char_frequency('e')
  
# Returns the total number of character 'e' occurrences in the HTML code of the reddit homepage.
test = Texty::TokenizeWeb.new
test.html('https://www.reddit.com').char_frequency
  
# Returns a hash consisting of the total occurrences of each letter of the alphabet in the HTML
# code of the reddit homepage.
test = Texty::TokenizeFile.new
test.only_extensions('.html','.css').ignore_chars
test.files.most_frequent(50).tag_cloud
  
# Tokenizes all .html and .css files in Core::DIR directory.
# Ignores single characters and cuts the hash size to 50, then exports to tag cloud.
test = Texty::TokenizeString.new
test.regex(/\w+/)
test.string('this is an example of a custom regex').to_h
  
# Uses a specified regex to tokenize the strings, then returns the resulting hash.
# frequency = {'this' => 1, 'is' => 1, ...}

You can either ignore words, files and extensions by using the attr_writers ignore_words, ignore_files and ignore_extensions or using the methods of the same name, but instead passing the files/words/extensions as parameters:

test = Texty::TokenizeString.new
test.ignore_words = %w(the fat at)
# test.ignore_words('the','fat','at') is equivalent.
test.string('The fat cat spat at the fat bat at last - bad cat!')
test.display

Which would display the following to the console:

~------
cat: 2
spat: 1
bat: 1
last: 1
bad: 1
~------

The same can be done to specify specific files or extensions to tokenize, by using attr_writers only_files and only_extensions, or the methods of the same name.

test = Texty::TokenizeFile.new
test.only_files = %w(index.erb canvas.js index_helper.rb)
# test.only_files('index.erb','canvas.js','index_helper.rb') is equivalent.
test.files.to_h

For sorting by frequencies according to numerical order, you can do the following:

asc = Texty::TokenizeFile.new.file('index.html')
asc.by_frequency(:ascending)
asc.to_h

# Or
des = Texty::TokenizeFile.new.file('index.html')
des.by_frequency(:descending)
des.to_h

For sorting tokens according to alphabetical order, you can do the following:

asc = Texty::TokenizeFile.new.file('index.html')
asc.by_alpha(:ascending)
asc.to_h

# Or
des = Texty::TokenizeFile.new.file('index.html')
des.by_alpha(:descending)
des.to_h

Note that the innertext_deep and html_deep are recursive and can take a very long time if used on large websites.

Tag clouds with neopoly/tagcloud

To create tag clouds, you will have to download the gem and required files from the GitHub repository for this project, as tag clouds are created by piping word frequencies and settings to Neopoly's tag cloud java application.

An example of a tag cloud of the 1000 most frequent words (ignoring common words like 'the', 'and' etc.), generated from 20+ PHP GitHub repositories, parsing 6,977 PHP files.

Depending on the size of the frequency hash being displayed and the size of the image, you'll most likely have to mess around with the font size settings in the config.properties file.


Another example of a tag cloud of the 1000 most frequent words (ignoring common words and single characters like 'a','x' etc.), generated from 21 C# GitHub repositories, parsing 7,523 files.

Note that this example demonstrates the problem of using tag clouds as an accurate representation of data. Note that 'ar' had 1,003,459 occurences and 'public' had near half that amount at '599993', yet 'ar' is only slightly larger. This is an issue for most tag cloud generation algorithms.

Also, the ignore_pairs method would have been useful when running this tokenization.

It would have skipped all of the two character words like 'ar', which are quite common in this example. More 'ignore' methods are explained in the documentation.


This also works with different languages. Note that this tagcloud was created with an older version of the gem, and it had trouble with casing of non-english unicode characters, as shown with 'Ég' and 'ég', appearing as two different words in this tag cloud rather than one. This is now fixed by using the 'unicode' gem.

This example cloud is in Icelandic, consisting of 900 most common words from a search of the lyrics of approximately 25-30 Sigur Rós songs from various albums:

icelandic = Texty::TokenizeFile.new
icelandic.ignore_common.ignore_pairs
icelandic.file('lyrics.txt').most_frequent(900)
icelandic.tag_cloud(10,100,'#ffffff','#042230','#71a4bc','#2a7069','#115a87')
  
# Tokenizes the specified 'lyrics.txt' file and ignores common words and two-letter character pairs.
# Cuts the hash size down to the 900 most frequent words.
# Exports the resulting hash to a tag cloud with configuration properties:
minSize = 10                                 (minimum font size)
maxSize = 100                                (maximum font size)
background = #ffffff                         (background colour)
colors = #042230,#71a4bc,#2a7069,#115a87     (font colours)

As you can see, the ignore_common and ignore_pairs methods only ignore English common words and character pairs, although this can be configured by changing the constants CHAR_PAIRS and CHARS in the ruby class.

Contributing

Bug reports and pull requests are welcome on GitHub at https://www.github.com/eonuonga/texty.