Landmark Machine Learning
Unsupervised Learning
from landmark_ml.learning import RuleLearnerAllSlots
page_dir = '~/tmp/html_pages/'
rules = RuleLearnerAllSlots.run(page_dir)
print json.dumps(json.loads(rules.toJson()), sort_keys=True, indent=2, separators=(',', ': '))
Clustering
On HTML directory alone
python -m landmark_ml.learning.PageClusterer [HTML_DIRECTORY]
On HTML with CDR directory and apply extractions
python -m landmark_ml.runclustering -d directory_above_html [OPTIONAL_SINGLE_SITE]
landmark-ui
On HTML with CDR directory and apply extractions and copy topython -m landmark_ml.runclustering -d directory_above_html -o optional_webapp_projects_dir [OPTIONAL_SINGLE_SITE]