seleniumprocessor
A simple library to set up Selenium processes
Description
This library allows you to easily set up a process based on Selenium. Thanks to the use of a specific format, it is possible to easily define processes to be passed to Selenium.
Installation
pip install seleniumprocessor
Install a Selenium web driver, e.g., the Chrome WebDriver
Available methods
initiate_connection(webdriverfile, url, to, loginrequired=True, headless=False)
, returning a selenium.webdriver.chrome.webdriver.WebDriver
object allowing browser control
-
webdriverfile
is the path of the Selenium web driver file -
url
is the url to open -
to
is the timeout to wait, regarding page loading -
loginrequired
specifies if a manual login from the user is required (True
) or not (False
) -
headless
specifies if the browser has to be executed in headless mode (True
) or not (False
)
run_process(brw, url_home, to, p, backtohome_begin=True, backtohome_end=True, checkfilterpassed_callback=None)
, returning an object, as specified in the process p
-
brw
theselenium.webdriver.chrome.webdriver.WebDriver
object used to control the browser -
url_home
the home page url -
to
the timeout used to wait the home page load -
p
the list of actions in the current process -
backtohome_begin
specifies if the browser should be redirected to the home page at begin of the method (True
) or not (False
) -
backtohome_end
specifies if the browser should be redirected to the home page at end of the method (True
) or not (False
) -
checkfilterpassed_callback
identifies a callback function used to check filters defined in the processp
, returing a boolean value (True
if the filter is passed,False
otherwise)
Objects structure
The main process object is a list of actions to sequentially execute on the process. Each action is represented by an array map with the following fields:
-
name
: the name identifying the DOM objects to find -
class_name
: the class name identifying the DOM objects to find -
index
(optional): in case of multiple DOM objects with the same class (or in case a DOM object which is not the first one has to be considered), it is possible to specify the index of the DOM object, in the list of DOM objects using the same class -
sleep
(optional): the sleep timeout used after the action is performed -
filter
: a string passed to thecheckfilterpassed_callback
for filtering actions -
action_parameters
(optional): its definition depends on theaction
field -
action
: the action to execute:-
click
: to perform a click on the DOM object -
click-repeated
: to perform a repeated click on the DOM object, until the object is present (useful withsleep
, e.g., for pages loading portions of a lists, with a final button to load additional results); the optionalaction_parameters
parameter represents the class name of the objects to count: when the object is unchanged, repeated clicks will be interrupted -
navigate
: to navigate by clicking a specific sequence of objects, by their text value; theaction_parameters
parameter represents the>
separated navigation path -
scroll_to
: to scroll to the specific element -
empty_value
: to empty thevalue
property of the DOM object -
store_text
: to store data on the returning object generated by therun_process
method; theaction_parameters
parameter represents the name of the property on the object -
send_keys
: to send a key input to a specific DOM object -
select
: to select a specific value of a specific combo-box DOM object, where the value is specified in theaction_parameters
parameter -
foreach
: to loop on all the DOM objects retrieved to execute repeated actions
-
-
context
(optional): in case theforeach
action is used, the context of all sub-items to be found will refer to the parent DOM object used in the loop; in this case, to consider the whole page, it is possible to specifywhole_page
ascontext
Sample usage
@auino
Get all repositories of# import the library
import seleniumprocessor
# define initial variables
URL_HOME = 'https://github.com/auino'
SLEEP_TO = 3
# initiate a connection on auino GitHub page (not requiring a login)
brw = seleniumprocessor.initiate_connection('./chromedriver', URL_HOME, 3, False)
# define the process to be executed
p = [
{'class_name':'UnderlineNav-item', 'index':1, 'action':'click', 'sleep':SLEEP_TO}, # clicking on the Repository tab, the second one, on top of the page
{'class_name':'source', 'action':'foreach', 'action_parameters':[ # looping on all repositories
{'class_name':'wb-break-all', 'action':'store_text', 'action_parameters':'name'}, # storing the repository name
{'class_name':'color-text-secondary', 'action':'store_text', 'action_parameters':'description'} # storing the repository description
]}
]
# run the process
data = seleniumprocessor.run_process(brw, URL_HOME, SLEEP_TO, p, backtohome_begin=False)
# showing resulting data
print(data)
Google Scholar
Get all publications of a given user fromimport seleniumprocessor
# define initial variables
USERPROFILE = 'UlbGEQwAAAAJ'
URL_HOME = 'https://scholar.google.com/citations?user={}'.format(USERPROFILE)
SLEEP_TO = 3
# initiate a connection on auino GitHub page (not requiring a login)
brw = seleniumprocessor.initiate_connection('./chromedriver', URL_HOME, 3, False)
# define the process to be executed
p = [
{'id':'gsc_prf_in', 'action':'store_text', 'action_parameters':'name'}, # storing researcher's name
{'class_name':'gs_lbl', 'index':-1, 'action':'click-repeated', 'action_parameters':'gsc_a_tr', 'sleep':SLEEP_TO}, # clicking the button at the end of the page, to extend the list of publications
{'class_name':'gsc_a_tr', 'action':'foreach', 'action_parameters':[ # looping on all publications
{'class_name':'gsc_a_at', 'action':'store_text', 'action_parameters':'title'}, # storing the publication name
{'class_name':'gs_gray', 'index':0, 'action':'store_text', 'action_parameters':'authors'}, # storing the authors of the publication
{'class_name':'gs_gray', 'index':1, 'action':'store_text', 'action_parameters':'venue'}, # storing the venue of the publication
{'class_name':'gsc_a_ac', 'action':'store_text', 'action_parameters':'citations'}, # storing the number of citations of the publication
{'class_name':'gsc_a_h', 'action':'store_text', 'action_parameters':'year'}, # storing the year of the publication
]}
]
# run the process
data = seleniumprocessor.run_process(brw, URL_HOME, SLEEP_TO, p, backtohome_begin=False)
# showing resulting data
print(data)
TODO
- Improve code readability
- Extend supported objects structure