
An ArXiV scraper to retrieve records from given research areas in mathematics and detect some trends in hyper-specialization and growth rate increase of scientific production in those fields.

hyper-specialization, scraper, api, arXiv
pip install Arxivtrends==0.0.2



An arXiv scraper to retrieve records from given research areas in mathematics and detect some trends in hyper-specialization and rate increase of scientific production in those fields.


Use the package manager pip (or pip3 for python3):

$ pip install arxivtrends

Alternatively, download the source and use

$ python install

To update the module using pip:

$ pip install arxivtrends --upgrade


Let's import arxivtrends and create a scraper to fetch all preprints in Fourier analysis (for other fields see below):

import arxivtrends
scraper = arxivtrends.Scraper(macro_field='Partial differential equations of elliptic type')

The instantiation of the class Scraper with the parameter macro_field set to 'Partial differential equations of elliptic type' returns a dictionary-like object containing all the information (authors, title, submission date, etc.) about the arXiv preprints whose Mathematics Subject Classification (MSC) falls under the category Partial differential equations of elliptic type.

Once scraper is built, we can start the parsing process and extract the information we want for each preprint: submission date, list of authors and number of pages.

output_df = scraper.scrape()

While scrape() is running, it prints its status:

Total number of papers scraped: 100
Total number of papers scraped: 200

Finally the extracted information is saved both into the pandas DataFrame output_df and into a .csv file. The latter option may be useful in case of overnight running and kernel shutdown after a certain time of inactivity, as the parsing process may last up to a few hours (see the script

Once the parsing is complete, we can call the data visualization methods (see the script and see what the data can tell us. For example, the below call to the method plot_N_authors_papers() shows the number of uploaded arXiv preprints with at least 3 authors, year by year:

plot_N_authors_papers(output_df, 3)


Research Areas

Currently available option for the parameter macro_field: Harmonic analysis on Euclidean spaces (MSC codes: 42A05 - 42C40), Abstract harmonic analysis (MSC codes: 43A05 - 43A90), Partial differential equations of elliptic type (MSC codes: 35J05 - 35J85), Partial differential equations of fluid mechanics (MSC codes: 76A02 - 76S05).
