PDF Text Extractor Utility
Extracts unstructured text from scientific papers published as PDF files .
LAPDFBlockParser
LAPDFBlockParser (text_kwargs:Optional[Mapping[str,Any]]=None)
Parse PDF
using PyMuPDF
.
LAPDFBlockLoader
LAPDFBlockLoader (file_path:str)
Load PDF
files using PyMuPDF
into representative .
LAPDF_FeatureBlock
LAPDF_FeatureBlock (p:int, x0:float, y0:float, x1:float, y1:float, text:str, nlines:int, sizes:dict, fonts:dict, pos_err:float=0.05)
A block of text with spatial features occuring in a PDF full-text article.
CumulativeTextFeature
CumulativeTextFeature (name:str)
HuridocsPDFParser
HuridocsPDFParser (text_kwargs:Optional[Mapping[str,Any]]=None, host='localhost')
Parse PDF
using Huridocs
(https://github.com/huridocs/pdf_paragraphs_extraction).
HuridocsPDFLoader
HuridocsPDFLoader (file_path:str, host='localhost')
Load PDF
files using Huridocs
(https://github.com/huridocs/pdf_paragraphs_extraction).