PDF Client
This is a python client library to provide a more pleasant experience with pdf-server
Table of Content
Install
The latest release is pdf-client-2.0.1, released on 10 Sep 2016.
To install the package using pip:
$ pip install pdf-client
Quickstart
Config & API Wrappers
First, create a configuration file config.json
in your project directory:
{
"base_url": "http://<YOUR_DOMAIN>/api/v1/",
"auth_class": "HTTPBasicAuth",
"auth_args": ["<MY_USERNAME>", "<MY_PASSWORD>"]
}
Then, create a main.py
and try the following:
from pdf_client import config
from pdf_client.api import book
config.load_from_file('config.json') # load configuration
book_list = book.List().execute() # send HTTP request to RESTful API
print(book_list)
The result will be a list
of dict
s, for example
[{'title': 'Sample Book', 'root_section': 779, 'id': 2}]
Multithreaded Text Processing
First, make sure you have a configuration file as said in the previous section.
Then, create a class that extends TextProcessor
:
from pdf_client.multithread.processor import TextProcessor
class ExampleProcessor(TextProcessor):
def process(self, text, section_id):
# do some stuff here
# ...
return text
Finally, create a worker and start:
from pdf_client import config
from pdf_client.multithread.worker import MultiThreadWorker
from demo import ExampleProcessor # the one we just created
config.load_from_file('config.json') # load configuration
processor = ExampleProcessor() # the text processor we just created
worker = MultiThreadWorker(processor=processor
book=3, # book id
threads=10, # total no. of threads
create=True, # create a new version
name='New Version') # new version name
completed = worker.start() # start the worker and return an iterator
for future in completed: # the iterator will loop in the order of completion
section_id, text = future.result()
print('Completed section ID: {id}'.format(id=section_id))
Module pdf_client.config
load_from_file()
This method loads a json file as global configuration.
There are three fields in the json file:
-
base_url
: the base url of the pdf server. (e.g. "http://127.0.0.1:8000/api/v1/" if you are running the django server on localhost) -
auth_class
(optional): one of the classes inrequest.auth
package -HTTPBasicAuth
,HTTPDigestAuth
orHTTPProxyAuth
-
auth_args
(optional): arguments to be passed into the constructor ofauth_class
Update Configuration at Runtime
There are methods to be called at runtime to load/update the global configuration:
set_base_url(base_url)
-
set_auth(auth)
- see "Authentication" from packagerequests
-
set_basic_auth(username, password)
, as a shortcut
Class MultiThreadWorker
Workflow
Text processing jobs are pretty similar. They always do the following:
- Specify a root node to start with, then recursively,
- Get the immediate text from a section with a specific "source" version ID
- Process/digest the text
- Optionally, post the processed text back to the server, with a specific "target" version ID.
Hence, the MultiThreadWorker
in pdf_client.multithread.worker
module implements the typical workflow, and handles all the details for you.
Parameters in Constructor
Parameter | Type | Explanation |
---|---|---|
processor |
TextProcessor |
An object of any subclass of TextProcessor
|
threads |
int |
Total number of threads used in parallel. The default value is 10. |
book |
int |
The book id to start as the root section. If this parameter is left blank, the section parameter must be present. |
section |
int |
The section id to start as the root section. If book is present, this parameter will be ignored. |
source |
int |
The source version id, used by the worker to get the text content from the server. If blank, the first version id ("Raw" by default) returned in the /version/list/ API will be used. |
target |
int |
The target version id, used by the worker to post the processed texts back to the server. If left blank, the worker will check the create parameter. If both this parameter and create are blank, the worker is in read-only mode, and no text will be posted to the server. |
create |
bool |
Set it True to create a version on the server as target . If target is present, this parameter will be ignored, and no version will be created on the server. |
name |
string |
The name of the version to be created. This parameter must be present together with create . |
start()
Call this method to start the worker immediately.
It will return an iterator over concurrent.futures.Future
objects, in the order of completion. Under the hood, it returns the result of concurrent.futures.as_completed()
. Let's consider this example:
...
completed = worker.start()
for future in completed:
section_id, text = future.result()
print("Completed {id}".format(id=section_id))
Although the text processing jobs are submitted to the worker threads by pre-order tree traversal, they may complete in a different order, since network IO may take different amount of time. So the for
loop here gives whichever Future
completes first, and blocks when waiting for the threads, until there is one completed.
However, if you pass incorrect values or combinations of parameters to the constructor, an empty list
will be returned. To check this, you can either enable logging (see the section below), or do this:
completed = worker.start()
if not completed:
# do something to handle error
else:
# do things as expected
Logging
Since the text processing may take a really long time, you can use logging to monitor the progress or record anything that went wrong.
DON'T use print()
to show the progress while looping through the Future
s. Instead, enable INFO-level logging of this module:
import logging
import pdf_client
logging.basicConfig()
logging.getLogger(pdf_client.multithread.worker.__name__).setLevel(logging.INFO)
The logs generated in this module are pretty comprehensive. They contain the progress and any exception that occured.
Package pdf_client.api
This package provide wrapper modules for the APIs in the pdf-server project.
Basically, for whatever parameter in the URLs, just pass them to the constructor in order. For example,
from pdf_client.api import version, book, content
version.Detail(3) # ==> /version/detail/3/
book.Toc(2) # ==> /book/toc/2/
content.Immediate(3260, 3) # ==> /content/immediate/3206/3/
For json data in request message body, use keyword arguements in the constructor. For example,
version.Update(5, name="Another Name")
Please refer to Appendix to see more example on the argument keywords.
Then, by calling execute()
on the object you created, the library will send the HTTP request to the RESTful server. It will return the python list
or dict
object (or just string
for content
module) if the RESTful API returns anything.
If the operation is successful and the API does not return anything (e.g. delete a version), it will return True
. If anything goes wrong (any exception, error, or the API returns a different status code than expected) within the execute()
, it will return False
Appendix
More Examples on pdf_client.api
Modules
Create a version:
from pdf_client import config
from pdf_client.api import version
config.load_from_file('config.json')
version.Create(name="My New Version").execute()
Post text to the server:
from pdf_client import config
from pdf_client.api import content
config.load_from_file('config.json')
content.Post(3235, 20, text="my new text").execute()
More Examples on MultiThreadWorker
Constructor
Get the entire book in the default version:
worker = MultiThreadWorker(processor=ExampleProcessor(), book=3)
Or start from a specific section and then all its descendants:
worker = MultiThreadWorker(processor=ExampleProcessor(), section=680)
Specify how many threads to use:
worker = MultiThreadWorker(processor=ExampleProcessor(),
book=4,
threads=20)
Read a specific version:
worker = MultiThreadWorker(processor=ExampleProcessor(),
book=4,
source=18)
Write the processed texts to a specific version:
worker = MultiThreadWorker(processor=ExampleProcessor(),
book=4,
target=18)
Create a version to save the processed texts:
worker = MultiThreadWorker(processor=ExampleProcessor(),
book=4,
create=True,
name="My New Version")
Put everything together:
worker = MultiThreadWorker(processor=ExampleProcessor(),
threads=20,
book=4,
source=6,
create=True,
name="My New Version")
Or maybe
worker = MultiThreadWorker(processor=ExampleProcessor(),
threads=20,
section=576,
source=12,
target=20)
Template: Multithreaded Text Processing
import re
import logging
import pdf_client
from pdf_client import config
from pdf_client.multithread.worker import MultiThreadWorker
from pdf_client.multithread.processor import TextProcessor
class MyProcessor(TextProcessor):
def process(self, text, section_id):
# do something here
return text
def main():
# enable INFO level logging
logging.basicConfig()
logging.getLogger(pdf_client.multithread.worker.__name__).setLevel(logging.INFO)
# load global config
config.load_from_file('config.json')
worker = MultiThreadWorker(processor=MyProcessor(), book=3, create=True, name="Another version!")
completed = worker.start()
for future in completed:
section_id, text = future.result()
# handle the results
if __name__ == '__main__':
main()
Template: Download a Whole Book
from pdf_client import config
from pdf_client.api import book
from pdf_client.api import content
def main():
# load global config
config.load_from_file('config.json')
# specify what to download
book_id = 1
version_id = 4
# get book details
book_data = book.Detail(book_id).execute()
root_section = book_data['root_section']
# get aggregate text
text = content.Aggregate(root_section, version_id).execute()
# save to file
with open(my_book['title'] + '.txt', 'w+') as file:
file.write(text)
if __name__ == '__main__':
main()
Template: Download All Sections of a Book in Individual Files
import logging
import pdf_client
from pdf_client import config
from pdf_client.multithread.processor import TextProcessor
from pdf_client.multithread.worker import MultiThreadWorker
class SectionDownloader(TextProcessor):
def process(self, text, section_id):
with open("{id}.txt".format(id=section_id), 'w+') as file:
file.write(text)
return text
def main():
# enable INFO level logging
logging.basicConfig()
logging.getLogger(pdf_client.multithread.worker.__name__).setLevel(logging.INFO)
# load global config
config.load_from_file('config.json')
# create a worker and start
worker = MultiThreadWorker(processor=SectionDownloader(), book=3, threads=20)
worker.start()
if __name__ == '__main__':
main()