Web Robots

Web robots that automates the process of obtaining full text papers (and other interactions with the web)

These functions spawn a web-browser to search external websites and retrieve papers and files for collation into an underlying document store. Developers using Alhazen must abide by data licensing requirements and third party websites terms and conditions. Users of this code should ensure that they do not infringe upon third party privacy or intellectual property rights through the use of this code.


source

retrieve_pdf_from_doidotorg

 retrieve_pdf_from_doidotorg (doi, base_dir, headless=False)

source

execute_search_on_biorxiv

 execute_search_on_biorxiv (search_term)

source

extract_reconstructed_nxml

 extract_reconstructed_nxml (html)

source

clean_and_convert_tags

 clean_and_convert_tags (soup, tag)

source

get_html_from_pmc_doi

 get_html_from_pmc_doi (doi, base_file_path)

Given a DOI, navigate to the PMC HTML page and reconstruct NXML from that