PDF Entity Annotation Tool (PEAT)
While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled da...
Saved in:
Published in | Journal of open source software Vol. 10; no. 108; p. 5336 |
---|---|
Main Authors | , , , , , , , , |
Format | Journal Article |
Language | English |
Published |
United States
Open Source Initiative - NumFOCUS
08.04.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning. |
---|---|
AbstractList | While different text mining approaches – including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning. While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning.While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning. |
Author | Shams, Dahnish Wilkins, A. Amina Watford, Sean Taylor, Michele M. Angrish, Michelle Stahl, Christopher G. Jewell, Brian C. Shapiro, Andy Markey, Kristan J. |
Author_xml | – sequence: 1 givenname: Christopher G. orcidid: 0000-0002-2070-1555 surname: Stahl fullname: Stahl, Christopher G. – sequence: 2 givenname: Kristan J. orcidid: 0000-0003-3911-2969 surname: Markey fullname: Markey, Kristan J. – sequence: 3 givenname: Brian C. orcidid: 0000-0003-3712-6523 surname: Jewell fullname: Jewell, Brian C. – sequence: 4 givenname: Dahnish orcidid: 0000-0002-9859-0859 surname: Shams fullname: Shams, Dahnish – sequence: 5 givenname: Michele M. orcidid: 0000-0001-5049-0499 surname: Taylor fullname: Taylor, Michele M. – sequence: 6 givenname: A. Amina orcidid: 0000-0001-9292-833X surname: Wilkins fullname: Wilkins, A. Amina – sequence: 7 givenname: Sean orcidid: 0000-0003-0888-5029 surname: Watford fullname: Watford, Sean – sequence: 8 givenname: Andy orcidid: 0000-0002-5233-8092 surname: Shapiro fullname: Shapiro, Andy – sequence: 9 givenname: Michelle orcidid: 0000-0002-4956-4806 surname: Angrish fullname: Angrish, Michelle |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/40547228$$D View this record in MEDLINE/PubMed https://www.osti.gov/servlets/purl/2573694$$D View this record in Osti.gov |
BookMark | eNpN0E1PAjEQgOHGaASRk3ezesKYxX53eyQIakIiBzw3pczGJUuL23Lg37sCGk8zhyeTzHuFzn3wgNANwUNKCBZP6xDjEAvG5BnqUq5ErrGU5__2DurHuMYYk0JSScgl6nAsuKK06KK7-fM0m_hUpX028j4km6rgs0UIdTaYT0aLh2t0Udo6Qv80e-hjOlmMX_PZ-8vbeDTLHRFS5o6XBRTMMasUVxqksnbFsSpKqZTVQlFt9crpQoPmREu7XGpWMEWdAuo4sB66P94NMVUmuiqB-3TBe3DJUKGY1LxFgyPaNuFrBzGZTRUd1LX1EHbRMEqZZJgL3dLbE90tN7Ay26ba2GZvfn9vweMRuKZt2ED5Rwg2h7jmJ645xGXfqiFnZw |
Cites_doi | 10.1117/12.2005608 10.1145/3620665.3640366 10.1016/j.envint.2021.107025 10.1093/nar/gkaa333 10.5281/zenodo.1212303 |
ContentType | Journal Article |
CorporateAuthor | Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States) |
CorporateAuthor_xml | – name: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States) |
DBID | AAYXX CITATION NPM 7X8 OIOZB OTOTI |
DOI | 10.21105/joss.05336 |
DatabaseName | CrossRef PubMed MEDLINE - Academic OSTI.GOV - Hybrid OSTI.GOV |
DatabaseTitle | CrossRef PubMed MEDLINE - Academic |
DatabaseTitleList | MEDLINE - Academic PubMed |
Database_xml | – sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISSN | 2475-9066 |
ExternalDocumentID | 2573694 40547228 10_21105_joss_05336 |
Genre | Journal Article |
GrantInformation_xml | – fundername: Intramural EPA grantid: EPA999999 |
GroupedDBID | AAFWJ AAYXX ADBBV AFPKN ALMA_UNASSIGNED_HOLDINGS BCNDV CITATION GROUPED_DOAJ M~E OK1 NPM 7X8 OIOZB OTOTI |
ID | FETCH-LOGICAL-c1566-c4f8e83c3a77479e67aad4078f677a95729a9dc989e94196abb938372c7e2c4e3 |
ISSN | 2475-9066 |
IngestDate | Mon Aug 11 02:20:42 EDT 2025 Tue Jun 24 17:31:47 EDT 2025 Sat Jun 28 01:34:02 EDT 2025 Tue Jul 01 05:14:13 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | true |
Issue | 108 |
Language | English |
License | http://creativecommons.org/licenses/by/4.0 |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c1566-c4f8e83c3a77479e67aad4078f677a95729a9dc989e94196abb938372c7e2c4e3 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 USDOE AC05-00OR22725 |
ORCID | 0000-0001-5049-0499 0000-0002-5233-8092 0000-0002-4956-4806 0000-0003-3712-6523 0000-0003-3911-2969 0000-0002-9859-0859 0000-0003-0888-5029 0000-0002-2070-1555 0000-0001-9292-833X 0000000150490499 0000000220701555 0000000298590859 0000000339112969 0000000308885029 000000019292833X 0000000249564806 0000000337126523 0000000252338092 |
OpenAccessLink | http://dx.doi.org/10.21105/joss.05336 |
PMID | 40547228 |
PQID | 3223630459 |
PQPubID | 23479 |
ParticipantIDs | osti_scitechconnect_2573694 proquest_miscellaneous_3223630459 pubmed_primary_40547228 crossref_primary_10_21105_joss_05336 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2025-04-08 2025-Apr-08 20250408 |
PublicationDateYYYYMMDD | 2025-04-08 |
PublicationDate_xml | – month: 04 year: 2025 text: 2025-04-08 day: 08 |
PublicationDecade | 2020 |
PublicationPlace | United States |
PublicationPlace_xml | – name: United States |
PublicationTitle | Journal of open source software |
PublicationTitleAlternate | J Open Source Softw |
PublicationYear | 2025 |
Publisher | Open Source Initiative - NumFOCUS |
Publisher_xml | – name: Open Source Initiative - NumFOCUS |
References | Xu (Xu:2013) 2013 Honnibal (Honnibal:2020) 2020 (gate) 2022 White (White:2021) 2021 Ansel (Ansel:2024) 2024 Electron Authors (electron) 2025 Stenetorp (Stenetorp:2012) 2012 Islamaj (Islamaj:2020) 2020; 48 Tyurin (highlight) 2022 Mozilla (pdfjs) 2025 Shindo (Shindo:2018) 2018 International Organization for Standardization (pdf) 2020 Walker (Walker:2022) 2022 |
References_xml | – year: 2022 ident: gate article-title: GATE: A full-lifecycle open source solution for text processing – year: 2013 ident: Xu:2013 article-title: Graph-based layout analysis for PDF documents publication-title: Imaging and Printing in a Web 2.0 World IV doi: 10.1117/12.2005608 – year: 2024 ident: Ansel:2024 article-title: PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation publication-title: 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24) doi: 10.1145/3620665.3640366 – year: 2022 ident: Walker:2022 article-title: Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr publication-title: Environ Int. doi: 10.1016/j.envint.2021.107025 – year: 2021 ident: White:2021 article-title: Publications Output: U.S. Trends and International Comparisons publication-title: National Center for Science and Engineering Statistics – year: 2020 ident: pdf article-title: ISO 32000-2:2020 document management – portable document format – part 2: PDF 2.0 – volume: 48 year: 2020 ident: Islamaj:2020 article-title: TeamTat: a collaborative text annotation tool publication-title: Nucleic Acids Research doi: 10.1093/nar/gkaa333 – year: 2018 ident: Shindo:2018 article-title: PDFAnno: a Web-based Linguistic Annotation Tool for PDF Documents publication-title: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) – year: 2022 ident: highlight article-title: React-pdf-highlighter publication-title: GitHub repository – year: 2020 ident: Honnibal:2020 article-title: spaCy: Industrial-strength Natural Language Processing in Python. doi: 10.5281/zenodo.1212303 – year: 2012 ident: Stenetorp:2012 article-title: brat: A web-based tool for NLP-assisted text annotation publication-title: Proceedings of the demonstrations session at EACL 2012 – year: 2025 ident: electron article-title: Electron – year: 2025 ident: pdfjs article-title: PDF.js |
SSID | ssj0001862611 |
Score | 2.2879326 |
Snippet | While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid... While different text mining approaches – including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid... |
SourceID | osti proquest pubmed crossref |
SourceType | Open Access Repository Aggregation Database Index Database |
StartPage | 5336 |
SubjectTerms | annotation Python text extraction |
Title | PDF Entity Annotation Tool (PEAT) |
URI | https://www.ncbi.nlm.nih.gov/pubmed/40547228 https://www.proquest.com/docview/3223630459 https://www.osti.gov/servlets/purl/2573694 |
Volume | 10 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3db9MwELegvPAC43NlgDJpD6DIo0tsx34sXcuYxJjUVtpb5LiOChsp6lJN21_PnZ2PUgYCXpLKkRzr7vrL_c53Z0L2Mi6ynuGC2lnGKYNPMs3AUaZcWJ3xAymMa-rz6UQcTdnxGT9rj7dy1SVltm9ubq0r-R-twhjoFatk_0GzzaQwAL9Bv3AFDcP1r3R8ejgKh1hnex32i2JRJQ5OsK0mOI6nw_6kJvq_up94bFboQ_dwy8sr3abBggc6v9hoPRB-2G-j18tzH-l2EIEFUM2zY1unWb9fInYMmifjeRXWOdTz4svlfD3eEHGXpiJbdoqrG_vVfcT8Jt-fnIYnq2-jz4PpuAWwiCWcqp6oWl3fMlYjcG_d0vzLNqEdiSq2wfiKtZdYQLzRQNt9kgF_YqHYXXIvAtYQrTFsF3JD9uZOZG5W4Ss23dzv2pl_8lE6C8Da3_MP54dMtsiDSoNB31vDI3LHFo_Jw_pwjqDC6idkF4wj8MYRtMYRoHEEb9A03j4l09FwMjii1XkY1CDLpobl0srYxBp89kRZkWg9w33YXCSJVhx4klYzo6SyigGy6ixTGICITGIjw2z8jHSKRWG3SRDNBNYw59zIiOUHXForJYB7DP9pqZnskr1aAul33_YkBbroBJWioFInqC7ZQemk4K1hy2GDuVmmTCs1dMluLbQUQAt3onRhF6vLFL4iscA9etUlz700m9cAg8AGpvLFH-feIfdb43xJOuVyZV-Be1hmr53SfwDRpV-r |
linkProvider | Directory of Open Access Journals |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=PDF+Entity+Annotation+Tool+%28PEAT%29&rft.jtitle=Journal+of+open+source+software&rft.au=Stahl%2C+Christopher+G.&rft.au=Markey%2C+Kristan+J.&rft.au=Jewell%2C+Brian+C.&rft.au=Shams%2C+Dahnish&rft.date=2025-04-08&rft.pub=Open+Source+Initiative+-+NumFOCUS&rft.issn=2475-9066&rft.eissn=2475-9066&rft.volume=10&rft.issue=108&rft_id=info:doi/10.21105%2Fjoss.05336&rft.externalDocID=2573694 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2475-9066&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2475-9066&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2475-9066&client=summon |