This module detects which links inside a page are pagination links. It works by manually marking inside a web page at least one link as a pagination link. The algorithm then uses label propagation and a gaussian kernel with Levenshtein edit distance as a measure of similarity to determine which other links are pagination links. There is a small demo included to show you how to use and test it.
python setup.py develop
Dependencies: numpy and scrapely
pip install -r requirements.txt
cd tests python demo.py https://news.ycombinator.com Enter link to follow (tab autocompletes): news?<TAB> Enter link to follow (tab autocompletes): https://news.ycombinator.com/news?p=2 <RET> 0) Quit 1) Enter link directly 2) https://news.ycombinator.com/news?p=3 3) https://news.ycombinator.com/news 4) https://news.ycombinator.com/newest 5) https://news.ycombinator.com/jobs 6) https://news.ycombinator.com/ask Select link to follow: 2 <RET>