newloki/text-analysing-library

Little library for analysing text


Keywords
php, linguistic
License
XFree86-1.1

Documentation

#Analyse texts To compare different texts it is necessary to get key figures to compare the texts based on the given key figures. For example you can compare texts by the frequency of verbs and adjectives. There are many different possibilities for key figures to compare texts. This library should implement some algorithm to get different key figures. The main objective is to make each algorithm independend from the text and the language of the text and give an toolset to automatically compare texts.

#Why PHP for such a task? PHP is an interpreted language, this means it is not the fastest. But for analysing texts it is necessary to have power, especially for analysing many texts at once. So PHP is in reality not the best choice for performance critical tasks.

But then, why PHP?

I chose PHP for this library, because i'm not the worst in programming PHP. I'am faster in writing PHP and so I have more time to design a clean structure. But it is planned for the future to port this library to C++, but actually I'am not a professional in this language, so I wrote this library in PHP until she is finished and look after if it is useful to port it to another language. But you are free to begin to adapt concepts and implementation from this project and build your own library in another language.

##Unittests Each project is easier if there are unit test availeable. Therefore this library has tests too and this from the beginning. To check if all tests are still running, even if there where are some changes, you can run cd /path/to/project/Tests/ phpunit

The only dependence for tests is a runnable PHPUnit installation.

##todo's

  • Add tests for text class
  • Add tests for wordclass
  • Create a class representing each object in a text they are not a word (such as punctuations)
  • Add tests for interpunction class
  • Create a class representing each sentences in text
  • Add tests for sentences class
  • Implement cologne phonetics algorithm (word algorithm)
  • Add tests for cologne phonetics algorithm
  • Add mood parser, based on SentiWS
  • Add tests for mood parser

##Implemented algorithm ###Stemmer Stemmer are parsers they use the wordbase to get some key figures for each word. ####Porter The porter algorithm get the combination of consonants and vocals for each word. For example: The word linguistic is based on 6 consonants (C) and 4 vocals (V), his format is CVCCVVCCVC, that can be summerized to CVCVCV. This is one of the simplest phonetic analyses. It is usefull to compare the sound of words.

#Special thanks There are some peoples and institutions they have done great work before. Many of this work is only adapted and processed by me to bring it to work with this library. But nevertheless i would like to thank all these people they have published her work under a free licence to give other peoples the ability to work with there results. The following is a partial list of some projects they are used by this library and their licenses: *SentiWS (v 1.8b) under Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License