DeLune Python Object Storage and Search Engine
pip install delune==0.4b20
Table of Contents
DeLune (former Wissen) is a simple fulltext search engine and Python object (similar with noSQL document concept) storage written in Python for logic thing and C for a core index/search module.
I had been studed Lucene earlier version with Lupy and CLucene. And I had maden my own search engine for excercise.
Its file format, numeric compressing algorithm, indexing process are quiet similar with Lucene earlier version (I don't know about recent versions at all). But querying and result-fetching parts is built from my imagination. As a result it's entirely unorthodox and possibly inefficient (I am a typical nerd and work-alone programmer ;-)
DeLune is a kind of hybrid of search engine and noSQL document database.
DeLune stores python objects with pickle-compresses serializing, then if you use DeLune as python module, you can store and get document derectly.
DeLune may be useful when it is allowed a few minutes gap on updating, inserting and deleting requests and operations. For example, it will be good for your legacy contents or generated by your own not by customer.
As most fulltext search engines, DeLune do always and only append data, no modification for existing files. So inserting, updating and deleting ops need high disk writing cost. Sometimes one small deletion op may trigger massive disk writing for optimization (even deleting cost itself is very low).
Anyway, if you need realtime changes on your data, DO BOT USE DeLune or complement with another type of NoSQL or RDBMS.
DeLune supports storing multiple documents for polymorphic use cases like listing and detail views. It is inefficient for storage usage, but helps reading performance.
DeLune's searching mechanism is similar with DNA-RNA-Protein working model can be translated into 'Index File-Temporary Small Replication Buffer-Query Result'.
And it provides storing, indexing and searching RESTful API through Skitai App Engine,
DeLune contains C extension, so need C compiler.
pip install delune
On posix, it might be required some packages,
apt-get install build-essential zlib1g-dev
All field text type should be str type, otherwise encoding should be specified.
Here's an example indexing only one document.
import delune
from delune.bin import indexer
dln = delune.connect ("/home/deune")
col = dln.create ("mycol", ["mycol"], 1)
with col.documents as D:
song = "violin sonata in c k.301"
birth = 1756
d = D.new (100) # document ID
d.content ([song, {'composer': 'mozart', 'birth': birth}])
d.field ("default", song, delune.TEXT)
d.field ("birth", birth, delune.INT16)
d.snippet (song)
D.add (d)
D.commit ()
indexer.index (dln)
result = col.documents.query ("violin")
Result will be like this:
{
'code': 200,
'time': 0,
'total': 1
'result': [
[
['violin sonata in c k.301', {"composer": 'wofgang amadeus mozart', 'birth': 1756}], # content
'<b>violin</b> sonata in c k.301', # auto snippet
14, 0, 0, 0 # additional info
]
],
'sorted': [None, 0],
'regex': 'violin|violins',
}
DeLune's document can be any Python objects picklalbe, delune stored document zipped pickled format. But you want to fetch partial documents by key or index, document skeleton shoud be a list or dictionary, but still inner data type can be any picklable objects. I think if your data need much more reading operations than writngs/updatings, DeLune can be as both simple schemaless data storage and fulltext search engine. DeLune's RESTful API and replication is end of this document.
When indexing it's not necessory to configure, but searching should be configured. The reason why DeLune allocates memory per thread for searching and classifying on initializing.
delune.configure (
numthread,
logger,
io_buf_size = 4096,
mem_limit = 256
)
Finally when your app is terminated, call shutdown.
delune.shutdown ()
Although quick start, we user indexer.index method for indxing documents, delune provide indexer as backend service.
# one timne indexing in console
delune index -v /home/delune
# indexing every 5minutes in console
delune index -v /home/delune -i 300
# indexing every 5 minutes as daemon
delune index -dv /home/delune -i 300
# restart indexing daemon every 5 minutes as daemon
delune index -v /home/delune -i 300 restart
# stop indexing daemon
delune index stop
# status of indexing daemon
delune index status
import delune
dln = delune.connect ("/home/delune")
As result, delune check anf create directories.
/home/delune/delune/config
/home/delune/delune/collections
col = dln.create ("mycol", ["mycol"], 1)
col.save ()
As result, collection created like this.
/home/delune/delune/config/mycol : JSON file contains configure options
/home/delune/delune/collections/mycol
If you use multiple disks for increasing speed or capacity of collection.
First of all mount your disks to /home/delune/delune/collections,
/home/delune/delune/collections/hdd0
/home/delune/delune/collections/hdd1
Then create collection.
col = dln.create ("mycol", ["hdd0/mycol", "hdd1/mycol"], 1)
col.save ()
As a result, collection will be created like this.
/home/delune/delune/collections/hdd0/mycol
/home/delune/delune/collections/hdd1/mycol
Your segment filess of collection will be created these directories randomly (with considering free space of disks).
There're 2 way for configuring tour collections.
First, use col.config dictionalry.
col = dln.create ("mycol", ["mycol"], version = 1)
col.config
>> {
'name': 'mycol',
'data_dir': ["mycol"],
"version": 1,
'analyzer': {
"max_terms": 3000,
"stem_level": 1,
"strip_html": 0,
"make_lower_case": 1,
"ngram": 1,
"biword": 0,
"stopwords_case_sensitive": 1,
"ngram_no_space": 0,
"contains_alpha_only": 0,
"stopwords": [],
"endwords": [],
},
'indexer': {
'optimize': 1,
'force_merge': 0,
'max_memory': 10000000,
'max_segments': 10,
'lazy_merge': (0.3, 0.5),
},
'searcher': {
'max_result': 2000,
'num_query_cache': 1000
}
}
You just change values as you want.
Another way is set options when creating collection.
col = dln.create (
"mycol",
["mycol"],
version = 1,
max_terms = 5000,
strip_html = 1,
force_merge = 1,
max_result = 10000
)
For more detail for analyzer, indexer and searcher options, see Low Level API section.
with col.documents as D:
for code, title in my_codes:
d = D.new (code) # code is used as document ID
d.content ([code, title])
d.field ("code", code, delune.STRING)
d.field ("default", title, delune.TEXT)
D.add (d)
D.commit ()
It is important to understand, above operation actually dosen't make any change to your collection. It just saves your documents at:
/home/delune/delune/collections/mycol/.que/
If you commit multiple time, que files will be created as you commit.
d = D.new ()
Note that in this case you canmoy update/modify your documents.
If your document has ID,
with col.documents as D:
for code, title in my_codes:
D.delete (code)
D.commit ()
Else,
with col.documents as D:
D.qdelete ("milk")
D.commit ()
It will be deleted all documents contain 'milk'.
If you run delune indexer, these saved documents will be automatically indexed. Or you can index mannually,
delune index -v /home/delune
dln = delune.connect ("/home/delune")
col = dln.load ("mycol")
with col.documents as D:
D.search ("violin")
col.documents.truncate ("mycol")
col.documents.commit ()
col.drop (include_data = True)
You can make remote delune resource.
New in version 0.12.14
You can use RESTful API with Skitai App Engine for your remote machine.
First of all, you need to install skitai by,
pip3 install -U skitai
Then copy and save below code to app.py.
import os
import delune
import skitai
if __name__ == "__main__":
pref = skitai.pref ()
pref.use_reloader = 1
pref.debug = 1
config = pref.config
config.resource_dir = "/home/delune"
skitai.trackers ('delune:collection')
skitai.mount ("/", delune, "app", pref)
skitai.run (
workers = 2,
threads = 4,
port = 5000
)
And run,
app.py
So you can access to http://<your IP address>:5000/v1
For more detail about API, see app.py.
And like local, you shoud run indexer,
delune index -dv /home/delune -i 300
This will index committed documents every 5 minutes.
It is exactly same as local API except connect parameter. parameter should starts with "http://" or "https://" and ends with version string like "v1"
dln = delune.connect ("http://192.168.0.200:5000/v1")
col = dln.create ("mycol", ["mycol"], 1)
col.save ()
...
Note that you need not reun indexer background at your local machine any more.
You can run replica server for distributed search or backup.
# replicate every 5 minutes from http://192.168.0.200/v1
delune replicate -o http://192.168.0.200/v1 -i 300
As a result, all remote delune resources will be replicated with exactly same directory structure.
Before you test DeLune, you should know some limitation.
from delune.lib import logger
logger.screen_logger ()
# it will create file '/var/log.delune.log', and rotated by daily base
logger.rotate_logger ("/var/log", "delune", "daily")
Analyzer is needed by TEXT, TERM types.
Basic Usage is:
analyzer = delune.standard_analyzer (
max_term = 8,
numthread = 1,
ngram = True or False,
stem_level = 0, 1 or 2 (2 is only applied to English Language),
make_lower_case = True or False,
stopwords_case_sensitive = True or False,
ngram_no_space = True or False,
strip_html = True or False,
contains_alpha_only = True or False,
stopwords = [word,...]
)
DeLune has some kind of stemmers and n-gram methods for international languages and can use them by this way:
analyzer = standard_analyzer (ngram = True, stem_level = 1)
col = delune.collection ("./col", delune.CREATE, analyzer)
indexer = col.get_indexer ()
document.field ("default", song, delune.TEXT, lang = "en")
Except English stemmer, all stemmers can be obtained at IR Multilingual Resources at UniNE.
- ar: Arabic
- de: German
- en: English
- es: Spanish
- fi: Finnish
- fr: French
- hu: Hungarian
- it: Italian
- pt: Portuguese
- sv: Swedish
If ngram is set to True, these languages will be indexed with bi-gram.
- cn: Chinese
- ja: Japanese
- ko: Korean
Also note that if word contains only alphabet, will be used English stemmer.
The other languages will be used English stemmer if all spell is Alphabet. And if ngram is set to True, will be indexed with tri-gram if word has multibytes.
Methods Spec
- analyzer.index (document, lang)
- analyzer.freq (document, lang)
- analyzer.stem (document, lang)
- analyzer.count_stopwords (document, lang)
Collection manages index files, segments and properties.
col = delune.collection (
indexdir = [dirs],
mode = [ CREATE | READ | APPEND ],
analyzer = None,
logger = None
)
Collection has 2 major class: indexer and searcher.
For searching documents, it's necessary to indexing text to build Inverted Index for fast term query.
indexer = col.get_indexer (
max_segments = int,
force_merge = True or False,
max_memory = 10000000 (10Mb),
optimize = True or False
)
For add docuemtn to indexer, create document object:
document = delune.document ()
DeLune handle 3 objects as completly different objects between no relationship
DeLune serialize returning contents by pickle, so you can set any objects pickle serializable.
document.content ({"userid": "hansroh", "preference": {"notification": "email", ...}})
or
document.content ([32768, "This is smaple ..."])
For saving multiple contents,
document.content ({"userid": "hansroh", "preference": {"notification": "email", ...}})
document.content ([32768, "This is smaple ..."])
You can select one of these by query time using nthdoc=0 or 1 parameter.
This field should be unicode/utf8 encoded bytes.
document.snippet ("This is sample...")
document also recieve searchable fields:
document.field (name, value, ftype = delune.TEXT, lang = "un", encoding = None)
document.field ("default", "violin sonata in c k.301", delune.TEXT, "en")
document.field ("composer", "wolfgang amadeus mozart", delune.TEXT, "en")
document.field ("lastname", "mozart", delune.STRING)
document.field ("birth", 1756, delune.INT16)
document.field ("genre", "01011111", delune.BIT8)
document.field ("home", "50.665629/8.048906", delune.COORD6)
Avalible Field types are:
- TEXT: analyzable full-text, result-not-sortable
- TERM: analyzable full-text but position data will not be indexed as result can't search phrase, result-not-sortable
- STRING: exactly string match like nation codes, result-not-sortable
- LIST: comma seperated STRING, result-not-sortable
- FNUM: foramted number, value should be int or float and format parameter required, format is "digit.digit" that number of digit interger part with zero leading, and number of float part length. It make possible to search range efficiently.
- COORDn, n=4,6,8 decimal precision: comma seperated string 'latitude,longititude', latitude and longititude sould be float type range -90 ~ 90, -180 ~ 180. n is precision of coordinates. n=4 is 10m radius precision, 6 is 1m and 8 is 10cm. result-sortable
- BITn, n=8,16,24,32,40,48,56,64: bitwise operation, bit makred string required by n, result-sortable
- INTn, n=8,16,24,32,40,48,56,64: range, int required, result-sortable
Note1: You make sure COORD, INT and BIT fields are at every documents even they havn't got a value, because these types are depend on document indexed sequence ID. If they have't a value, please set value to None NOT omit fields.
Note2: FNUM 100.12345 with format="5.3" is interanlly converted into "00100.123" and negative value will be -00100.123 and MAKE SURE your values are within -99999.999 and 99999.999.
Repeat add_document as you need and close indexer.
for ...:
document = delune.document ()
...
indexer.add_document (document)
indexer.close ()
If searchers using this collection runs with another process or thread, searcher automatically reloaded within a few seconds for applying changed index.
For running searcher, you should delune.configure () first and creat searcher.
searcher = col.get_searcher (
max_result = 2000,
num_query_cache = 200
)
Query is simple:
searcher.query (
qs,
offset = 0,
fetch = 10,
sort = "tfidf",
summary = 30,
lang = "un"
)
For deleting indexed document:
searcher.delete (qs)
All documents will be deleted immediatly. And if searchers using this collection run with another process or thread, theses searchers automatically reloaded within a few seconds.
Finally, close searcher.
searcher.close ()
violin composer:mozart birth:1700~1800
search 'violin' in default field, 'mozart' in composer field and search range between 1700, 1800 in birth field
violin allcomposer:wolfgang mozart
search 'violin' in default field and any terms after allcomposer will be searched in composer field
violin -sonata birth2:1700~1800
birth2 is between '1700' and '1800'
violin -sonata birth:~1800
not contain sonata in default field
violin -composer:mozart
not contain mozart in composer field
violin or piano genre:00001101/all
matched all 5, 6 and 8th bits are 1. also /any or /none is available
violin or ((piano composer:mozart) genre:00001101/any)
support unlimited priority '()' and 'or' operators
(violin or ((allcomposer:mozart wolfgang) -amadeus)) sonata (genre:00001101/none home:50.6656,8.0489~10000)
search home location coordinate (50.6656, 8.0489) within 10 Km
"violin sonata" genre:00001101/none home:50.6656/8.0489~10
search exaclt phrase "violin sonata"
"violin^3 piano" -composer:"ludwig van beethoven"
search loose phrase "violin sonata" within 3 terms
Upgdare linraries
pip3 install -U skitai quests delune
Then restructuring directories
DELUNE_ROOT="/home/delune"
mkdir "$DELUNE_ROOT/delune"
mv "$DELUNE_ROOT/models/.config" "$DELUNE_ROOT/delune/config"
mv "$DELUNE_ROOT/models" "$DELUNE_ROOT/delune/collections"
Edit your all config, remove models/ fro your data_dir option.
"data_dir": ["models/mycols"]
=> "data_dir": ["mycols"]
If you use RESTful API service, remove index or mirror related code lines at your app app launch script.
Finally, run indexer.
delune index -dv /home/delune -i 300
0.4 (June 2, 2018)
- officially seized developing naivebayes classifier & learner
- integrated local and remote indexing and searching APIs
- directory structure is NOT compatible with version 0.3x
0.3 (Sep 15, 2017)
- fix wildcard & range search
- fix snippet thing
- add stem API
- add index field aliasing to document
- add string range searching, add new field type: ZFn
- add multiple documents storing feature. as a result, DeLune can read only for Wissen collections
0.2 (Sep 14, 2017)
- fix minor bugs
0.1 (Sep 13, 2017)
- change package name from Wissen to DeLune
0.13
- fix using lock
- add truncate collection API
- fix updating document
- change replicating way to use sticky session connection with origin server
- fix file creation mode on posix
- fix using lock with multiple workers
- change wissen.document method names
- fix index queue file locking
0.12
- add biword arg to standard_analyzer
- change export package name from appack to package
- add Skito-Saddle app
- fix analyzer.count_stopwords return value
- change development status to Alpha
- add wissen.assign(alias, searcher/classifier) and query(alias), guess(alias)
- fix threads count and memory allocation
- add example for Skitai App Engine app to mannual
0.11
- fix HTML strip and segment merging etc.
- add MULTIPATH classifier
- add learner.optimize ()
- make learner.build & learner.train efficient
0.10 - change version format, remove all str*_s ()
0.9 - support Python 3.x
0.8 - change license from BSD to GPL V3