|
| |
Current Research
|
 |
Digital Text
|
Retrodigitization is the process of digitizing
text that originated in print.
The larger mathematical publishing
houses have had retrodigitization programs
under way for several years, and
many famous older journals will be
available as electronic files in the near
future. For smaller publishers and for
scientific societies, the situation is
much less clear.
With mathematical text, the retrogigitization
process proceeds through several
steps:
(1) scanning to obtain an image;
(2) compensating for faults in the original
document or those introduced by the
scanning process;
(3) using OCR software to recover text
and mathematics;
(4) repackaging the results into logical
units. |
 |
The aim of this project is to improve on
steps (2) and (3). A suite of more than
50,000 scanned images, consisting of
all pages from the 1949 -1996 issues of the
Canadian Journal of Mathematics, has
been made available by the Canadian
Mathematical Society for this purpose. In
the short term the objective will be to improve
the software available to enhance
images (taking into account the peculiarities
of mathematical text). This includes
deskewing, despeckling, and balancing the
optical properties of bitonal images containing
mathematics. Comparative runs on
large samples are needed.
This task is particularly suited for an HPC
environment such as the one made available
by MRnet. In the long term, any software
developed will be made available
publicly. The will allow smaller societies
and publishers to add their material to the
growing body of retrodigitized material. |
|