Search results
Results from the WOW.Com Content Network
poppler-utils is a collection of command-line utilities built on Poppler's library API, to manage PDF and extract contents: pdfattach – add a new embedded file (attachment) to an existing PDF; pdfdetach – extract embedded documents from a PDF; pdffonts – lists the fonts used in a PDF
reStructuredText (RST, ReST, or reST) is a file format for textual data used primarily in the Python programming language community for technical documentation.. It is part of the Docutils project of the Python Doc-SIG (Documentation Special Interest Group), aimed at creating a set of tools for Python similar to Javadoc for Java or Plain Old Documentation (POD) for Perl.
Desktop application to split, merge, extract pages, rotate and mix PDF documents. PDF Studio: Proprietary: Yes Yes Yes Yes Full feature PDF editor. Poppler-utils: GNU GPL: Yes Yes Unix Yes Converts PDF to other file format (text, images, html). pstoedit: GNU GPL: Yes Yes Unix Yes Converts PostScript to (other) vector graphics file format. QPDF ...
Sigil is free, open-source editing software for e-books in the EPUB format. As a cross-platform application, Sigil is distributed for the Windows, macOS, Haiku and Linux platforms under the GNU GPL license. Sigil supports code-based editing of EPUB files, as well as the import of HTML and plain text files.
The hOCR format is most commonly used in order to make searchable PDF files or as an extracted metadata of the PDF file. In order to create searchable PDF files we can use a scanned document image and a .hocr file of the particular image. We can use the following open source tools in order to achieve that.
Layout analysis software, that divide scanned documents into zones suitable for OCR Graphical interfaces to one or more OCR engines Software development kits that are used to add OCR capabilities to other software (e.g. forms processing applications, document imaging management systems, e-discovery systems, records management solutions)
Dictionary Builder is a Rust program that can parse XML dumps and extract entries in files; Scripts for parsing Wikipedia dumps – Python based scripts for parsing sql.gz files from wikipedia dumps. parse-mediawiki-sql – a Rust library for quickly parsing the SQL dump files with minimal memory allocation
Template filling: Extracting a fixed set of fields from a document, e.g. extract perpetrators, victims, time, etc. from a newspaper article about a terrorist attack. Event extraction: Given an input document, output zero or more event templates. For instance, a newspaper article might describe multiple terrorist attacks.