Upstream is June 5 👉 RSVP

page_clustering
Release 0.0.1

Online k-means clustering of web pages

Keywords: crawler, scrapy, scrapely, web, data-science
License: Other
Install: pip install page_clustering==0.0.1

Documentation

Description

A simple algorithm for clustering web pages. A wrapper around KMeans. Web pages are converted to vectors, where each vector entry is just the count of a given tag and class attribute. The dimension of the vectors will change as new pages with new tags or class attributes arrive. Also a simple outlier detection is available and enabled by default. This allows for rejecting web pages that are highly improbable to belong to any cluster.

Usage

import page_clustering

clt = page_clustering.OnlineKMeans(n_clusters=5)
# `pages` must have been obtained somehow
for page in pages:
    clt.add_page(page)
y = clt.classify(new_page)
for page in more_pages:
    clt.add_page(page)
y = clt.classify(yet_another_page)

Demo

wget -r --quota=5M https://news.ycombinator.com
python demo.py news.ycombinator.com

Dependencies: 0
Dependent packages: 0
Dependent repositories: 19
Total releases: 1
Latest release: May 31, 2016
First release: May 31, 2016
Stars: 29
Forks: 6
Watchers: 4
Contributors: 1
Repository size: 468 KB
SourceRank: 8

Source repo 2FA enabled: TEXT!
Package manager 2FA enabled: TEXT!
Is security responsive: TEXT!
Dependencies are managed: TEXT!
Issue-free release available: TEXT!
Succession plan available: TEXT!
Package manager 2FA enabled: TEXT!

Releases

0.0.1: May 31, 2016

Contributors

See all contributors

Something wrong with this page? Make a suggestion

Export .ABOUT file for this package

Last synced: 2021-02-19 05:34:27 UTC

page_clustering
Release 0.0.1

Release 0.0.1

0.0.1

Documentation

Description

Usage

Demo

Stats

Development practices

Releases

Contributors

page_clustering Release 0.0.1

Release 0.0.1 Toggle Dropdown 0.0.1

Documentation

Description

Usage

Demo

Stats

Development practices

Releases

Contributors

page_clustering
Release 0.0.1

Release 0.0.1

0.0.1