PDF Text Extractor Utility

Extracts unstructured text from scientific papers published as PDF files .

LAPDFBlockParser

 LAPDFBlockParser (text_kwargs:Optional[Mapping[str,Any]]=None)

Parse PDF using PyMuPDF.

source

LAPDFBlockLoader

 LAPDFBlockLoader (file_path:str)

Load PDF files using PyMuPDF into representative .

source

LAPDF_FeatureBlock

 LAPDF_FeatureBlock (p:int, x0:float, y0:float, x1:float, y1:float,
                     text:str, nlines:int, sizes:dict, fonts:dict,
                     pos_err:float=0.05)

A block of text with spatial features occuring in a PDF full-text article.

source

CumulativeTextFeature

 CumulativeTextFeature (name:str)

source

HuridocsPDFParser

 HuridocsPDFParser (text_kwargs:Optional[Mapping[str,Any]]=None,
                    host='localhost')

Parse PDF using Huridocs (https://github.com/huridocs/pdf_paragraphs_extraction).

source

HuridocsPDFLoader

 HuridocsPDFLoader (file_path:str, host='localhost')

Load PDF files using Huridocs (https://github.com/huridocs/pdf_paragraphs_extraction).