The Digital Warehouse

Book publishing is in a process of transformation brought about by the availability of new technologies for creating, distributing and consuming content. With the proliferation of the internet, the availability of broadband internet connectivity, the explosion of mobile devices and the coming of age of the download generation, the consumer has come to expect that they can get content of any kind, in any form, when they want it, with no restrictions. Content is king in this new age and the package that brings the content is no longer important to the consumer. Publishers need to understand that consumers want their content to be easily accessed, easily moved from one device to another and easily shared (if that is what they want to do). Faced with these new challenges, publishers must find innovative ways to meet those challenges in order to remain relevant and competitive in today’s ever changing marketplace. This article will serve as an introduction to one of those innovations, the digital warehouse.

Key Functions
The digital warehouse is designed to store and distribute content specifically designed for the internet and the ebook. A digital warehouse should provide control of copyright, quality, access, and distribution. The key functions of a digital warehouse include;
• Receive and convert content,
• Store content,
• Display content in a secure fashion,
• Allow access to content according to the publisher’s intent,
• Deliver content according to publisher’s intent,
• Track and report on content activity.

The following illustration gives an overview of the digital warehouse system.


Conversion
Placing and storing publishers’ content in the digital warehouse requires that content be prepared, digitized and tagged with XML. A tag is a keyword or term assigned to a piece of information. XML (EXtensible Markup Language) is the markup technology; markup identifies the specific type of data content and acts as a container for that data. XML tags are not predefined. You must define your own tags and XML is designed to be self-descriptive. Strong XML tagging is the key to accessing and using the digital distribution models now available and those not yet developed. Content should be converted into as many formats as possible so that it can be easily repurposed for any device available now or in the future. Generally the same PDF files that are sent to the printer can be used to convert the files into the various formats for digital distribution. If physical titles are submitted, a high resolution scan can be performed, followed by OCR (Optical Character Recognition), proofing and post-scan processing.

Controls
All digital titles should be run through a system of checks to make sure that the content is valid. LibreDigital, for example, has a system that accepts any file form, sorts them, coverts other formats to PDF and then places them into an ISBN 13 specific file structure. The system also notes any anomalies it cannot correct and stores them in a review table. A process follows in which each page of a title is verified as uncorrupt, containing correct fonts etc. and unnecessary information eliminated. Once that is done the content is cropped and centered properly. The newly formatted titles are then checked once again for any irregularities before being tagged. Once the book is digitized it will exist in three forms – image, text and metadata (metadata describes characteristics about the data).

Scanning
Scanning of physical titles can either be destructive or non-destructive. Destructive scanning (removing the binding) yields a better, low-cost result. OCR software should produce scanned content that is at least 95% accurate, preferably higher. Obviously, the better the level of character recognition accuracy, the less time needed to correct the text. After scanning, extensive proofing should be done to ensure the highest quality reproduction. Once this step is completed, the book can then go through a post-scan image processing which would do the following –
• Determine average page size and standardize all pages to that size
• Clear the background
• Shift gray fonts to black
• Replace any color or grayscale imagery on the page
• Center the content with fixed headspace
• Compress the image for internet display

Once the scanning process has been completed, the file needs to be tagged. If the publisher provides tagged XML for their titles, that information can be used to create the initial tag representation. A good system performs an analysis of the book and tags each page. This should include covers, the jacket (if present), the copyright page, table of contents, chapter starts and all interior pages marked with page numbers. A publisher can determine how and where the tags should be located within each document and .tag content in a manner that gives the most opportunity to leverage that content for a variety of revenue streams. As digitization becomes more the norm, tagging can and will occur during the editorial process to best enable digital delivery methods for new content. After tagging, the title should go through a quality assurance before being sent back to the publisher for review. The publisher should have the ability to make comments and corrections and submit the title back for rework if necessary.

Tracking
Keeping track of titles sent to the digital warehouse is extremely important for the publisher. The warehouse should be able to provide a tracking and reporting system so the publisher always knows what stage of the digitization process each title is in. Minimally the publisher should know;
• When the titles are received and accepted for scanning or processing
• What type of scan was performed
• Phase of scanning or digitization the title is in
• Phase of tagging the title is in
• Phase of quality assurance the title is in
• If rework needed, the rework schedule and when completed
• After publisher approval, when title accepted into system
• What to do with Physical books (if used in scanning) when process is complete

Storage and Distribution
Once the publisher’s titles have been prepared and accepted into the warehouse, they are ready for storage and distribution. Storage should be in as many formats as possible allowing the distribution to as many outlets as available. This could include PDF, all ebook formats, POD, audio, image formats, flash, and video. Updating and maintaining the content will be important for managing the content as new functionality is added to the system, for allowing information to be pulled and used in reports and for making sure each format has the appropriate rights attached to it to enable the use of the content in a variety of delivery methods.

Digital Rights Management
Currently, an integral part of distribution is rights management. Digital right management (DRM) defines what content can be viewed via the web, how it is displayed, what the viewer can do with the content, who can receive content, how much of the content they can receive and what they can do with that content once they receive it. A DRM system monitors incoming requests for content, authenticates those requests, and distributes the content according to the rules the publisher has set up for content distribution – blocked, controlled or complete access. The system should be set up so that the rules can be applied easily to each title coming into the warehouse and the publisher should have the ability to edit and modify the degree of control granted.

Reporting
Finally, a reporting and tracking ability is essential. Managing and adjusting files over time as needs change or new situations arise is essential and can only be accomplished with a complete reporting and tracking system. The publisher should be able to customize the reports improving their usefulness and effectiveness.

Publisher Considerations
What are the advantages of a digital warehouse?
• All files in one place
• Can reduce costs
• Confidence in file format to meet 3rd party requirements
• Publisher can control how files are accessed, displayed and distributed.
• Uniform product quality
• Easily change what content is available in what manner
• Maximize return on investment
• Expand markets for both backlist and frontlist – take advantage of the long tail
• Expand customer base

What are the disadvantages of a digital warehouse?
• All files in one place
• Similar to having one distribution model for physical books
• Could be expensive to implement for smaller publishers
• Could be expensive to maintain for smaller publishers
• With wrong mix of DRM could actually hinder digital sales
• Poor tracking and reporting will leave publisher in the dark as to how a title is actually doing in the marketplace.

A digital warehouse gives the publisher the ability to meet the demands of the new digital marketplace. It enables the publisher to deliver content in any form, at any time, to anyone on any device. This delivery is limited only by the rules the publisher has established for content display and delivery.

For more detailed information see the following:

BiblioVault
Book Industry Study Group
CodeMantra
Ingram Digital Group
LibreDigital
MPS Technologies

Home | Your Challenge | The Future is Yours | We Can Help
Who We Are | Testimonials | Articles | Links | Contact Us

(732) 892-1140 | ©2006 Smerillo Associates | All Rights Reserved
Site Design by Splendor Design Group