The objective of this project is to locate and identify reference numbers in a pdf document.
Sponsored By: Westinghouse Electric Corporation
Team Members
Alex Chen | Gavin Fisher | William Gault | Alex Talbot | Chester Cai | Aditi Gupta | | | | | |
Project Poster
Click on any image to enlarge.
Project Summary
Overview
Westinghouse faces the unique problem of identifying and extracting object reference information from different documents. A particular document may reference other documents, and this information needs to be recorded into a standard data format. However, some of these documents are not simple text files, but may be pictures or handwritten documents instead. This information must also be efficiently captured, processed, and stored in a standard data format such as an excel sheet.
Objectives
The team’s objective is to iterate through folders of documents, convert image formatted documents into text, extract all necessary reference information, and store this information in an excel file. The team aims to complete this objective with efficient speed and perfect accuracy.
Approach
– Establish weekly team meetings to discuss deliverables, delegate tasks, and evaluate results.
– Set up weekly meetings with the sponsor to relay approach and discuss any changes to the task.
– Decide in which language to write software.
– Establish secure host for sensitive documents.
– Analyse test documents provided by the sponsor to understand layout and necessary information.
– Outline process flow of passing documents between different tasks within the code.
– Decide and confirm with sponsor which OCR software to utilize.
– Write code for iterating through test image files and applying OCR software to each.
– Write code for extracting reference data from post-OCR text and convert to Excel.
– Analyse speed of running the program and check for accuracy against original documents.
– Optimize run-time speed and accuracy.
Outcomes
– The sponsor will save many hours of tedious work.
– The project reduced time spent on document analysis, transfer work, and revision confirmation.
– This project introduced powerful OCR technology to a process that was previously laborious and completely manual.
– Developed full-stack software capable of autonomous processing of document references as well as revision numbers.
– The software runs in a reasonable amount of time.
– The product accurately captures the referenced documents and revision numbers.




