ml-nlp-tk

Tool for NLP - handle file and text


Keywords
nlp, file, io, string, text, mining
License
MIT
Install
pip install ml-nlp-tk==1.1.0

Documentation

🔨 ml-nlp-tk 🔧

Tools for NLP using Python

This repertory used to handle file io and string cleaning/parsing

Usage

Install:

pip install ml_nlp_tk

Before using :

from ml_nlp_tk import *

Features

File Handling

get_folders_form_dir(path)

Arguments

  • path(String) : getting all folders under this path (string)

Returns

  • path(String)(generator) : path of folders under arguments path Examples
for i in get_folders_form_dir('./corpus/')
    print(i)

'./corpus/kdd'
'./corpus/nycd'

get_files_from_dir(path)

Arguments

  • path(String) : getting all files under this path (string)

Returns

  • path(String)(generator) : path of files under arguments path Examples
for i in get_files_from_dir('./data/')
    print(i)

'./data/kdd.txt'
'./data/nycd.txt'

read_dir_files_yield_lines(path)

Arguments

  • path(String) : getting all files line by lines under this path (string)

Returns

  • line(String)(generator) : files line under arguments path
    Examples
for i in read_dir_files_into_lines('./data/')
    print(i)

'file1 sent1'
'file1 sent2'
...
'file2 sent1'
...

read_dir_files_into_lines(path)

Arguments

  • path(String) : getting all files line by lines under this path (string)

Returns

  • line(String)(generator) : files line under arguments path
    Examples
i = read_dir_files_into_lines('./data/')
print(i)

['file1 sent1','file1 sent2'...'file2 sent1'...]

read_files_yield_lines(path)

Arguments

  • path(String) : getting content in input file path (string)

Returns

  • path(String)(generator) : file line under arguments path
    Examples
for i in read_dir_files_into_lines('./data/kdd.txt')
    print(i)

'sent1'
'sent2'
...

read_files_into_lines(path)

Arguments

  • path(String) : getting content in input file path (string)

Returns

  • path(String)(generator) : file line under arguments path
    Examples
i = read_dir_files_into_lines('./data/kdd.txt')
print(i)

['sent1','sent2'...]

create_new_dir_always(dirPath)

it will replace old dir if exist,or create a new one
Arguments

  • dirPath(String) : dir location
    Examples
create_new_dir_always('./data/')

get_dir_with_notexist_create(dirPath):

it will create a new dir if not exist
Arguments

  • dirPath(String) : dir location that you want to make sure

Returns

  • path(String) : dir location with surely exist Examples
i = get_dir_with_notexist_create('./data/kdd')
print(i)

'./data/kdd'

write_json_to_file(json_str, loc)

Arguments

  • json_str(String) : json context in string

Returns

  • path(String) : output file path Examples
i = write_json_to_file("{"sent":"hi"}",'./data/kdd.json')
print(i)

"'./data/kdd.json'"

is_file_exist(path)

Arguments

  • path(String) : file location

Returns

  • result(Boolean) : file exist or not,true will be exist Examples
i = is_file_exist('./data/kdd.txt')
print(i)

true

is_dir_exist(file_dir)

Arguments

  • path(String) : dir location

Returns

  • result(Boolean) : dir exist or not,true will be exist Examples
i = is_dir_exist('./data/kdd')
print(i)

false

download_file(url,save_dir)

Arguments

  • url;(String) : download link
  • save_dir;(String) : save location
    Returns
  • result(string) : file downloaded location
    Examples
i = download_file('https://raw.githubusercontent.com/voidful/voidful_blog/master/assets/post_src/nninmath_3/img1','./data/')
print(i)

./data/img1

Text cleaning/parsing

remove_httplink(string)

remove http link in context
Arguments

  • string(String) : a string may contain http link

Returns

  • result(String) : string without any http link

Examples

y = remove_httplink("http://news.IN1802020028.htm 今天天氣http://news.we028.晴朗"))
print(y)

今天天氣 晴朗

split_lines_by_punc(lines)

make lines in array form into sentences array
it split line base on any punctuation
Arguments

  • lines(String Array) : lines array

Returns

  • sentences(String Array) : split all line base on punctuations
    Examples
y = split_lines_by_punc(["你好啊.hello,me"]))
print(y)

['你好啊', 'hello', 'me']

split_sentence_to_ngram(sentence)

it will split sentence into n-grams as many it can

be careful with sentence length,long sentence will have worse performance

Arguments

  • sentence(String) : a string with no punctuation

Returns

  • ngrams(String Array) : ngrams array

Examples

split_sentence_to_ngram("加州旅館")

['加','加州',"加州旅","加州旅館","州","州旅","州旅館","旅","旅館","館"]

split_sentence_to_ngram_in_part(sentence)

it will split sentence into n-grams with diff start point as many it can

be careful with sentence length,long sentence will have worse performance

Arguments

  • sentence(String) : a string with no punctuation

Returns

  • ngrams(Array) : 2D array with diff start in ngram

Examples

split_sentence_to_ngram_in_part("加州旅館")

[['加','加州',"加州旅","加州旅館"],["州","州旅","州旅館"],["旅","旅館"],["館"]]

spilt_text_in_all_ways(sentence)

it will try to find all possible segments way to split sentence
Arguments

  • sentence(String) : input sentence

Returns

  • seg list(String Array) : all segments in a array

Examples

spilt_text_in_all_ways("加州旅館")

['加 州 旅 館', '加 州 旅館', '加 州旅 館', '加 州旅館', '加州 旅館', '加州旅 館', '加州旅館']

spilt_sentence_to_array(sentence,splitText=False)

use to split sentences in different kind of language Arguments

  • sentence(String) : input sentence
  • splitText(boolean,optional) : split Chinese in char

Returns

  • segment array(String Array) : word array
spilt_sentence_to_array('你好 are  u 可以')

['你好', 'are', 'u', '可以']

spilt_sentence_to_array('你好 are  u 可以',True)

['你', '好', 'are', 'u', '可', '以']

join_words_array_to_sentence(words_array):

Arguments

  • words_array(String Array) : input array

Returns

  • sentence(String) : output sentence Examples
join_words_array_to_sentence(['你好', 'are', "可以"])

你好are可以

passage_into_chunk(passage, chunk_size):

split a passage in particular size
if part of a sentence excite chunk size, it still put hole sentence into it
Arguments

  • passage(String) : input passage
  • num_of_paragraphs(int) : num of character in one chunk

Returns

  • chunk array(String Array) : passage in chunk size Examples
passage_into_chunk("xxxxxxxx\noo\nyyzz\ngggggg\nkkkk\n",10)

['xxxxxxxx\noo\n', 'yyzz\ngggggg\n']

is_all_english(text)

Arguments

  • text(String) : input text Returns
  • result(Boolean) : whether the text is all English or not Examples
is_all_english("1SGD")
is_all_english("1SG哦")

True
False

is_contain_number(text)

Arguments

  • text(String) : input text

Returns

  • result(Boolean) : whether the text contain number or not Examples
is_contain_number("1SGD")
is_contain_number("SG哦")

True
False

is_contain_english(text)

Arguments

  • text(String) : input text
    Returns
  • result(Boolean) : whether the text contain english or not Examples
is_contain_english("1SGD")
is_contain_english("123哦")

True
False

is_list_contain_string(text)

Arguments

  • str(String) : input text
  • list(String list) : input string
    Returns
  • result(Boolean) : whether the text is a part of list item
    Examples
is_list_contain_string("a", ['a', 'dcd'])
is_list_contain_string("a", ['abcd', 'dcd'])
is_list_contain_string("a", ['bdc', 'dcd'])

True
True
False

full2half(text)

Arguments

  • string(String) : input string which needs turn to half

Returns

  • (String) : a half-string

Examples

full2half(",,")

,,

half2full(text)

Arguments

  • text(String) : input string which needs turn to full

Returns

  • (String) : a full-string Examples
half2full(",,")

,,

Vectorize

Vectorize implemented following paper :
Baseline Needs More Love:On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms

doc2vec_aver(pretrained_emb, emb_size, context)

average pooling
Arguments

  • pretrained_emb(object) : pre-trained word embedding that able to get vector in this form : pretrained_emb['word']
  • emb_size(int) : size of pre-trained word embedding
  • context(list) : input doc in list - each item of list must able to gain vector in pretrained_emb like : pretrained_emb[context[0]]

Returns

  • document vector(list) : vectorized context

Examples

from gensim.models import Word2Vec
pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')
size = pretrain_wordvec.vector_size
context = "測試文本哈哈哈"
ml_nlp_tk.doc2vec_aver(pretrain_wordvec, size, jieba.lcut(context))

doc2vec_max(pretrained_emb, emb_size, context)

max pooling in each dim
Arguments

  • pretrained_emb(object) : pre-trained word embedding that able to get vector in this form : pretrained_emb['word']
  • emb_size(int) : size of pre-trained word embedding
  • context(list) : input doc in list - each item of list must able to gain vector in pretrained_emb like : pretrained_emb[context[0]]

Returns

  • document vector(list) : vectorized context Examples
from gensim.models import Word2Vec
pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')
size = pretrain_wordvec.vector_size
context = "測試文本哈哈哈"
ml_nlp_tk.doc2vec_max(pretrain_wordvec, size, jieba.lcut(context))

doc2vec_concat(pretrained_emb, emb_size, context)

concat average pooling and max pooling result
Arguments

  • pretrained_emb(object) : pre-trained word embedding that able to get vector in this form : pretrained_emb['word']
  • emb_size(int) : size of pre-trained word embedding
  • context(list) : input doc in list - each item of list must able to gain vector in pretrained_emb like : pretrained_emb[context[0]]

Returns

  • document vector(list) : vectorized context Examples
from gensim.models import Word2Vec
pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')
size = pretrain_wordvec.vector_size
context = "測試文本哈哈哈"
ml_nlp_tk.doc2vec_concat(pretrain_wordvec, size, jieba.lcut(context))

doc2vec_hier(pretrained_emb, emb_size, context, windows)

average pooling in sliding windows then max pooling
Arguments

  • pretrained_emb(object) : pre-trained word embedding that able to get vector in this form : pretrained_emb['word']
  • emb_size(int) : size of pre-trained word embedding
  • context(list) : input doc in list - each item of list must able to gain vector in pretrained_emb like : pretrained_emb[context[0]]
  • windows(int) : size of sliding windows in array

Returns

  • document vector(list) : vectorized context Examples
from gensim.models import Word2Vec
pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')
size = pretrain_wordvec.vector_size
context = "測試文本哈哈哈"
ml_nlp_tk.doc2vec_hier(pretrain_wordvec, size, jieba.lcut(context))

cosine_similarity(vector 1, vector 2)

cal cosine similarity between two vector Arguments

  • vector(list) : vector

Returns

  • cos similarity(float) : similarity of two vector Examples
from gensim.models import Word2Vec
pretrain_wordvec = gensim.models.KeyedVectors.load_word2vec_format('wiki.vec', encoding='utf-8')
size = pretrain_wordvec.vector_size

input1 = ml_nlp_tk.doc2vec_concat(pretrain_wordvec, size, "DC")
input2 = ml_nlp_tk.doc2vec_concat(pretrain_wordvec, size, "漫威")
ml_nlp_tk.cosine_similarity(input1,input2)

Random Utility

random_string(length)

Arguments

  • length(int) : length with random string

Returns

  • randstr(String) : size will be length in "0123456789ABCDEF" Examples
random_string(10)

D6857CE0F4

random_string_with_timestamp(length)

Arguments

  • length(int) : length with random string

Returns

  • randstr(String) : size will be length + timestamp length(10) Examples
random_string_with_timestamp(1)

1435474326D

random_value_in_array_form(array)

random value with range in array form
int,float : [min,max]
string : [candidate1,candidate2...]

Arguments

  • range(array) : range in array form

Returns

  • random result(depend on input) : a random value under input condition Examples
# for string
y = random_value_in_array_form(["SGD","ADAM","XDA"])
print(y)

'ADAM'

# for int
y = random_value_in_array_form([1,12])
print(y)

4

# for float
y = random_value_in_array_form([0.01,1.00])
print(y)

0.34