DocHive saves journalists time spent on data extraction from PDFs | opensource.com

abernard102@gmail.com 2013-02-10

Summary:

A group of journalists are announcing the launch of their breakthrough open source solution for the problem many writers and journalists have of how to take data in PDFs or images and easily convert it to a spreadsheet or other usable format. Editor Charles Duncan Pardo and his team of reporters at the Raleigh Public Record are like many small newsrooms—they don't have the staff to do data entry for hundreds of pages of information, nor the budget to hire some unfortunate college student to do it for them. He says: 'This is a problem we at the Record have been trying to overcome for more than two years. The story started with Wake County campaign finance returns. The returns are filed as paper, and staff at the Wake County Board of Elections scan them in and put the images online. The problem is, the only way to view the data is to look at it page by page, and the only way to analyze it is to go through by hand and enter the data into a spreadsheet one row at a time.'  Duncan created DocHive with his brother, fulltime programmer Edward Duncan. It uses XML to break a page up into smaller sections, separating each into its own image file, then uses optical character recognition technology (OCR) to read the couple words or numbers and insert it into a text file. DocHive will be officially released on February 28 at the annual Computer Assisted Reporting conference organized by Investigative Reporters & Editors and the National Institute for Computer-Assisted Reporting. The code will live on GitHub and the Record is setting up a Wiki on their server to share templates and for documentation. Their choice of which license to use has not yet been determined.

Link:

http://opensource.com/business/13/2/open-source-app-for-journalists

From feeds:

Open Access Tracking Project (OATP) » abernard102@gmail.com

Tags:

oa.new oa.data oa.comment oa.tools oa.github oa.dochive oa.journalism

Date tagged:

02/10/2013, 09:08

Date published:

02/10/2013, 04:08