ptt.py

Installation
Quick Start
- Command Line Interface
- Use as Library
Requirements
[class] ArticleSummary (alias: Summary)
- example
- attribute
- API
[class] ArticleListPage
- example
- attribute
- API
[class] ArticlePage (alias: Article)
- example
- attribute
- API
[class] Pushes
- example
- attribute
[namedtuple] Msg
Self-defined Exceptions
Utility Functions

Installation

$ pip install ptt.py

or you can clone this repo directly.

$ git clone https://github.com/dokelung/ptt.py.git

Quick Start

Command Line Interface

If you already installed ptt.py with pip, just use command ptt to start crawling:

# $ ptt -b <board name> -i <start_page> <end_page> -f <output format>  

# dump article jsons in the latest page of board "gossiping"
# -1 means last page and -2 means second last page...
$ ptt -b gossiping -i -1 -1 -f json

or you can run library module as script:

$ python -m ptt -b gossiping -i -1 -1 -f json

If you want to use it directly, just cd to repo dir and run script ptt.py:

$ cd ptt.py
$ python ptt.py -b gossiping -i -1 -1 -f json

Use option -h to get more details:

$ ptt -h
usage: ptt [-h] -b BOARD [-d DIR] [-f {json,csv}]
           (-a ID | -i START END START END)

ptt.py

optional arguments:
  -h, --help            show this help message and exit
  -b BOARD, --board BOARD
                        board name
  -d DIR, --destination DIR
                        destination
  -f {json,csv}, --format {json,csv}
                        output format (default: json)
  -a ID, --aid ID       article id
  -i START END START END, --index START END START END
                        start/end index

Use as Library

from ptt import Board

latest_page = Board('gossiping')

for summary in latest_page:
    if summary.isremoved:
        continue
    article = summary.read()
    print(article.dump_json())

Requirements

Python3.3 or later
Requests
BeautifulSoup

class ArticleSummary: alias to Summary

example

# iterate all article summaries from specified board
for summary in Board('gossiping'):
    print(summary.title, summary.url)
    
# read article from summary
if not summary.isremoved:
    article = summary.read()

attribute

Attr Name	Type	Note	Example
title	str	title of Article	`'[協尋] 12月18日晚上9點前後高雄市明誠路鼎'`
category	str	string in syntax `'['` and `']'` of title	`'協尋'`
url	str	url of the Article without ptt domain name	`'/bbs/Gossiping/M.1513683634.A.2F5.html'`
board	str	board name of Article	`'Gossiping'`
aid	str	Article ID	`'M.1513683634.A.2F5'`
date	str	string of Article date	`'7/24'`
author	str	string of Article author (only author id)	`'jokerndmc'`
score	str	string of score or `'爆'` for score>99 or `'X'` for score<0	`'20'`
mark	str	Article mark	`'M'`
removeinfo	str	remove infomation written in title	`'(本文已被刪除) [SamuraiJack]'`
isreply	bool	`True` if `'Re:'` in title else `False`	`True`
isforward	bool	`True` if `'Fw:'` in title else `False`	`False`
isremoved	bool	`True` if Article has been removed else `False`	`True`

API

API Name	Return Type	Note
read()	ArticlePage	return corresponding `ArticlePage` if it is not removed

class ArticleListPage

example

# get page-20 of specified board
lst_page = ArticleListPage.from_board('gossiping', 20)

# you can also use the alias "Board" instead
lst_page = Board('gossiping', 20)

# get the newest page of specified board by given no page index
lst_page = Board('gossiping')

# iterate all article summaries of a article list page
for summary in lst_page:
    print(summary)
    
# get first article summary
summary = lst_page.get_article_summary(0)

attribute

Attr Name	Type	Note	Example
board	str	board name of this ArticleListPage	`'Gossiping'`
idx	int	index of ArticleListPage	`29585`
related_urls['board']	str	latest article list page url of the board	`'/bbs/Gossiping/index.html'`
related_urls['man']	str	精華區 url of the board	`'/man/Gossiping/index.html'`
related_urls['previous']	str	preivious article list page url (`None` if not exists)	`'/man/Gossiping/index29584.html'`
related_urls['next']	str	next article list page url (`None` if not exists)	`None`
related_urls['oldest']	str	oldest article list page url	`'/bbs/Gossiping/index1.html'`
related_urls['newest']	str	newest article list page url	`'/bbs/Gossiping/index.html'`
previous	ArticleListPage	`ArticleListPage` of `related_urls['previous']`
next	ArticleListPage	`ArticleListPage` of `related_urls['next']`
oldest	ArticleListPage	`ArticleListPage` of `related_urls['oldest']`
newest	ArticleListPage	`ArticleListPage` of `related_urls['newest']`
article_summaries	generator of ArticleSummary	ArticleSummary generator of this ArticleListPage

API

API Name	Return Type	Note
get_article_summary(index)	ArticleSummary	get `AritcleSummary` by given index

class ArticlePage: alias to Article

example

# get article by board name and aid
article = ArticlePage.from_board_aid('gossiping', 'M.1513683634.A.2F5')

# you can also use the alias "Article" instead
article = Article.from_board_aid('gossiping', 'M.1513683634.A.2F5')

# dump json string with aid and author
string = article.dump_json('aid', 'author')
print(string)

attribute

Attr Name	Type	Note	Example
title	str	title of Article	`'[協尋] 12月19日晚上9點前後高雄市明誠路鼎'`
category	str	string in `'['` and `']'` of title	`'協尋'`
board	str	board name of Article	`'Gossiping'`
aid	str	Article ID	`'M.1513683634.A.2F5'`
date	str	string of Article date	`'Tue Feb 16 20:15:23 2016'`
datetime	datetime	datetime format of date	`datetime.datetime(2017, 12, 19, 19, 40, 31)`
author	str	string of Article author	`'jokerndmc (小人物)'`
ip	str	author's ip	`'115.82.209.7'`
signature	str	signature string of the author
pushes	Pushes	Pushes is a class which collects all pushes in article
content	str	main content of article using html format
isreply	bool	`True` if `'Re:'` in title else `False`	`False`
isforward	bool	`True` if `'Fw:'` in title else `False`	`False`

API

API Name	Return Type	Note
dump_json(*attrs, flat=False)	str	dump json string with specified attrs
dump_csv(*attrs, delimiter=',')	str	dump csv string with specified attrs

class Pushes

example

# get simple expression (list of dictionary) of a Pushes
>>> pushes.simple_expression
[...
 {'content': '幫高調，雖然機會不高但還是希望可以找到！',
  'ipdatetime': '12/19 22:22',
  'type': '推',
  'user': 'aquami'},
 {'content': '住附近 突然發現有鼎吉路',
  'ipdatetime': '12/21 01:34',
  'type': '推',
  'user': 'sh981215'},
  ...
]

attribute

Attr Name	Type	Note	Example
article	`ArticlePage`	ArticlePage of these pushes
msgs	list	list of `Msg`(self-defined namedtuple)
count['all']	int	total msg in Pushes	`38`
count['score']	int	positive msg count - negative msg count	`23`
count['like']	int	positive msg count	`26`
count['boo']	int	negative msg count	`3`
count['neutral']	int	neutral msg count	`9`
simple_expression	list	list of dictionaries which are used to model every `Msg`

namedtuple Msg

collections.namedtuple('Msg', ['type', 'user', 'content', 'ipdatetime'])

Self-defined Exceptions

Error: Base class for all exceptions raised by this module
InValidBeautifulSoupTag: Can not create ArticleSummary because of invalid bs tag
NoGivenURLForPage: Given None or empty url when build page
PageNotFound: Can not fetch page by given url
ArtitcleIsRemove: Can not read removed article from ArticleSummary

Utility Functions

parse_std_url: Parse standard ptt url
parse_title: Parse article title to get more info
parse_username: Parse user name to get its user account and nickname

example

>>> parse_std_url('https://www.ptt.cc/bbs/Gossiping/M.1512057611.A.16B.html')
('https://www.ptt.cc/bbs', 'Gossiping', 'M.1512057611.A.16B')

>>> parse_title('Re: [問卦] 睡覺到底可不可以穿襪子')
('問卦', True, False)

>>> parse_username('seabox (歐陽盒盒)')
('seabox', '歐陽盒盒')

ptt.py
Release 1.0.0a1

Release 1.0.0a1

1.0.0a2

1.0.0a1

Documentation

ptt.py

Installation

Quick Start

Command Line Interface

Use as Library

Requirements

class ArticleSummary: alias to Summary

example

attribute

API

class ArticleListPage

example

attribute

API

class ArticlePage: alias to Article

example

attribute

API

class Pushes

example

attribute

namedtuple Msg

Self-defined Exceptions

Utility Functions

example

Stats

Development practices

Releases

Contributors

ptt.py Release 1.0.0a1

Release 1.0.0a1 Toggle Dropdown 1.0.0a2 1.0.0a1

Documentation

ptt.py

Installation

Quick Start

Command Line Interface

Use as Library

Requirements

class ArticleSummary: alias to Summary

example

attribute

API

class ArticleListPage

example

attribute

API

class ArticlePage: alias to Article

example

attribute

API

class Pushes

example

attribute

namedtuple Msg

Self-defined Exceptions

Utility Functions

example

Stats

Development practices

Releases

Contributors

ptt.py
Release 1.0.0a1

Release 1.0.0a1

1.0.0a2

1.0.0a1