microframework for parsing sites, a simple interface and flexibility will help you quickly start parsing sites.Convenient parsing, can be used as an application Django or independently


Keywords
django, parsing, webdevelopment, website
License
BSD-3-Clause
Install
pip install djparsing==0.3.7

Documentation

djparsing

Convenient parsing, can be used as an application Django or independently

parser of the first data block (by date this is new) and saving in the specified table

Requirements

  • python (3.4, 3.5, 3.6б 3.7)
  • django (1.8, 1.9, 1.10, 1.11)
  • lxml (4.1.1)
  • cssselect (1.0.1)
  • Pillow

Quick start

Install:
    pip install djparsing
Using:
class MyModel(models.Model):
    title = models.CharField(max_length=256)
    text = HTMLField(blank=True)
    source = models.URLField(max_length=255, blank=True)
    create_date = models.DateTimeField(auto_now_add=True)
    img = models.ImageField(blank=True, null=True)
    flag = models.BooleanField(default=False)
    
from djparsing.core import Parser, init
from djparsing import data

@init(model='MyModel', app='my_app')
class MyParserClass(Parser):
    body = data.BodyCSSSelect()
    text = data.TextContentCSSSelect()
    source = data.AttrCSSSelect(attr_data='href') #Or set the first argument AttrCSSSelect('href')
    title = data.TextCSSSelect()
    img = data.ImgCSSSelect('src') #The default is src, so the argument is optional. can ImgCSSSelect()
    
    class Meta:
        coincidence = ['Python', 'Django', 'Питон', 'ML'] # a list of words for the condition that the data fit
        field_coincidence = 'title' # field to which a list of words is used
    
pars_obj = MyParserClass(
        body='.content-list__item',
        text='.post__body_crop > .post__text',
        source='a.post__title_link',
        title='a.post__title_link',
        img='.post__body_crop > .post__text img',
        url='http://site/'
        )
pars_obj.run()

Note: a model for saving data can be specified in Meta class

class Meta:
    model = MyModel # decorator @init is not needed

Inheritance:

class MyChildParserClass(MyParserClass):
    my_field = data.TextCSSSelect()

Note: fields from the base class, and also the Meta class is inherited. You can override

If you need to install an additional field in the database:

pars_obj.add_field['flag'] = True
pars_obj.run()  #if you do not need to save to the database and print the data to the log, 
                # add the argument log -> run(log=True) and redefine the method log_output(self, result):
Example:
@init(model='MyModel', app='my_app')
class MyParserClass(Parser):
    body = data.BodyCSSSelect()
    text = data.TextContentCSSSelect()
    
    def log_output(self, result): # if you do not override the method, the result will be output to the terminal
        pass # and work further with the result

If you do not want to write data to the database or output to the log, use:

data = pars_obj.run(create=False)

Note: Also a must create=False, when you are not working with django and base

Attributs

start_url
# initialize the path to the URL with the data block.
# This is needed when the list of objects is on the page, and the data is on another page 
BodyCSSSelect(start_url='div.description.float-right > a')

Note: in the attribute with the URL should be href

add_domain
# if the URL in the attribute does not have a domain
# set add_domain=True, by default False

BodyCSSSelect(start_url='div.description.float-right > a', add_domain=True)
save_start_url

when you need to save additional data in the field, such as the start URLs of objects, add the ExtraDataField field (save_start_url = True)

body_count

how many objects are parsing

Example:
class MyParserClass(Parser):
    start = BodyCssSelect(start_url='ul.quest-tiles > li.quest-tile-1 > div.item-box > div.item-box-desc h4  a',
                          add_domain=True,
                          body_count=4)
    source = ExtraDataField(save_start_url=True)

It works on this site, all this on the channel