Digital Archiving at the Bahá'í World Centre LibraryScriptum: Newsletter for Bahá'í Librarians & Information Professionals, 8
Abstract: The Bahá’í World Centre has moved into a graphics computer environment based on Microsoft products. This has allowed the Library to experiment with methods of fulfilling its mandate of being “the central depository of all literature published on the Faith…” with regards to collecting electronic or digital materials. These are items received by email, saved from the Internet, or copied from other formats. The Library attempts to work as much as possible with the tools provided as part of the Microsoft suite so as to lessen problems of incompatibility for its users. While strict archiving of the original with no changes would logically seem the goal, in reality some changes have to be made to many of the documents archived to ensure that the “experience” is emulated correctly. Potential problems of access caused by future changes of software and technology are not being addressed at this time.
In its letter of 31 August 1987, the Universal House of Justice spoke of the Bahá’í World Centre Library, saying that it is "the central depository of all literature published on the Faith…”
Since its creation as a professionally run entity in 1977, the Library has been assiduously fulfilling its mandate to be an international depository library by actively encouraging all Bahá’í publishers to send the required number of published items to the World Centre. It increasingly included sound and video recordings in electronic format such as magnetic tape or disks, and computer programs on magnetic disks. As new digital publishing formats became available, such as Compact Discs (CD’s), these were also collected.
During the 1980s the World Centre gained access to email, and Library items started to arrive in the form of emails, either as attachments or ASCII text. Such items were printed if possible, and added to the collection, or simply saved to the hard drive for future attention.
A direct, full-time connection to the Internet was established by 1994, and access to remote locations on the World Wide Web became available through such programs as File Transfer Protocol (FTP), Gopher and Lynx. Lynx rapidly became the program of choice, and Web server software was added to the World Centre’s computing environment allowing departments and individuals to create internal Web pages. At the same time, the first networked Graphical monitors in the form of X-terminals and a few networked PC’s started to make an appearance at the World Centre.
The Library’s first Web pages were created in July of 1995, primarily with Lynx access in mind. These Web pages were pages of links to external sites.
During 1998, the Bahá’í World Centre commenced the transition from a character-based system to a Graphical User Interface (GUI) environment. The infrastructure and hardware was installed during late 1998 and early 1999, putting PC’s on each workstation in place of the Visual Display Units (VDUs’) previously used. At first these were simply used to access the Unix environment through terminal emulation software.
The next stage of the Computer Technology Transition (CTT) was the roll-out of the basic Microsoft Suite - Word for word processing, Excel for spread sheets, Outlook for mail, Internet Explorer (IE) for Internet browsing, PowerPoint for presentation, and the like. By late 1999, Microsoft FrontPage for Web page editing was released to “Webmasters” in each Department so that the Intranet could be further developed.
The Library’s existing Web was moved into the new environment and its new potential quickly exploited. However, the limitations were also quickly realised - one being the propensity for Microsoft FrontPage “themes” to suddenly run rampant over all archived Web pages, thus truly destroying their original look and feel. This was solved by creating a second Library web called "Libarchive", in which no “themes” would be used. Once this was in place, the experimental archiving of electronic or digital items began in earnest.
A sub-directory of the Library web has been created called “Electronic Collections”. This is available from the Library’s home page (fig 1.).
This contains a number of sub-directories to the differing types of the collection (fig. 2.).
Documents are saved temporarily to the C: drive of the staff member’s work station, (fig. 3.) then copied to the “Libarchive” Web server, and later deleted from the temporary location. Links to the items are placed in the Library Web pages pointing to the actual location of the item in the Libarchive Web.
The items are collected in four major ways.
1. Email or email attachments.
Many Bahá'í newsletters are being received by email. The actual formats vary from simple ASCII text, to HTML, to attached Word Processing document in Word, WordPerfect or similar programs, to Portable Document Format (PDF) files and Microsoft Publisher files . Working within the Microsoft Suite has allowed ASCII newsletter and HTML newsletters containing images, to be opened in through Microsoft Outlook Web access and saved as HTML files. When copied to the "Libarchive" Web server, and linked to from the Library Web, the user clicks on the link, and the document is seen as though viewed through the Outlook Web access mail system.
Attached word-processed documents are saved in their native format to the temporary sub-directory on the C: drive. If their native format is not Word, an attempt is made to open them, convert them to the current version of Word and re-save them. The converted format is then copied to the Web server so that the local user will have easy access as Microsoft’s Internet Explore will simply open Microsoft Word documents. The original format is lost unless it cannot be converted in which case it is saved to the Web Server “as-is” awaiting future developments.
A number of back issues of newsletters saved in the years prior to the advent of the GUI environment remain. These need to be copied to the Web server and linked appropriately.
2. Saved individually from the Web.
Many Web-based items, such as online versions of newspapers, are saved using IE’s inbuilt “Save file as” function. This function saves the web page in one file, and any images visible on the page (which exist on the same server as the page being saved) in a subdirectory with the same name plus the addition of “*_files” (see fig.4).
The links to images imbedded within the main file being saved are all re-written during the “save as” procedure to point to the images located in the sub-directory. Both the files and the sub-directory are then copied to the “Libarchive” web server, and linked to from the Library web page. Users then see the web page largely as it was seen on the day it was saved, except for any advertising windows that point to locations outside of the original server. These continue to work within the archived environment, but point to whatever advertisement is being shown there on the day the user is viewing the archived document. If the company or server offering the advertiser disappears or changes, these links degrade and are replaced with blank boxes. Similarly, links within the main page that point to other locations continue to point to the external location and degrade over time.
Figure 4: Web pages saved using IE's save file as function.
Files are organised by date within a subdirectory for each online resource, by virtue of a naming convention such as:
If an item is withdrawn from the collection, it and its associated “*_files”, are deleted from the “Libarchive”, as are the link from the Library’s web page.
3. Entire Web sites or sub-sites saved from the Web.
One desirable outcome is to save “snapshots” of Web pages produced by Bahá’ís, especially those produced by institutions of the Faith, on a regular basis. To this end a free program called WebStripper was downloaded. This program allows the user to enter the Uniform Resource Locator (URL) of the desired web site, set some desired parameters, then “strip” the entire site (within the pre-set parameters) to a pre-determined sub-directory onto a local or networked hard-drive. The saving procedure can take a number of hours depending on the speed of the connection.
The architecture of the original Web site is maintained but not all items are saved correctly, e.g. Java applets cause some problems.
Figure 5: "Web Stripper” being used to archive the Australian Bahá'í Community's web site
Saving snapshots of web-sites on an ongoing basis will provide future researchers with information about the level of expertise of communities, and the technology at their disposal, in much the same way that the quality of paper, binding, inks and printing methods speak volumes about the capacity and ingenuity of publishers in the traditional media.
This will naturally take large amounts of computer memory, just as saving items published in physical formats take large volumes of shelving space.
4. Materials digitised from other formats.
When computer diskettes are received containing published items, the diskette is copied to the network drive as a form of long-term conservation. The diskettes or CD’s themselves degrade over time, and are likely to be “orphaned” as the technology required to read them are abandoned. When saved to the network drive they then become part of the current system which needs to be migrated forward with each update in computing software. There is no guarantee that this will succeed, but it is felt that there is a greater chance of future access being possible by following this path than by attempting to remember to retrieve and migrate the documents on diskettes.
Programs on diskettes such as early versions of Multiple Author Refer System (MARS), which rely on certain hardware and software configurations, have not yet been copied to the network. Even if they are copied, it is highly unlikely that they will work, so all that will remain for future historians are “fingerprints” of that earlier publishing endeavour.
No systematic process is in place to digitise existing published items, but if a copy of a printed item is scanned during the course of daily work, a copy is placed on the Web server so that it will be available to all. The Library will initiate such action when particular topics are “hot” (such as that of “Epochs” in January 2001). By providing access to items in digital form, the Library hopes to head off multiple requests for photocopies.
Web radio broadcasts are a particular challenge. Due to constrained resources it is necessary to block access to streaming audio and video sites during World Centre work hours. Software capable of saving such files has not yet been brought into the institutional environment. To date, three BBC programs mentioning the Faith have been saved by individuals by taping it from the air, or from the BBC web site on home PCs. These tapes have then been digitised by an individual using free software and saved on the “Libarchive” web. “Screen Shots” of the BBC pages on the days were taken and saved as image files on the web. FrontPage’s picture editing function was then used to create a “hot-spot” on the image that would link to the digital sound file.
In this way the experience of users listening to the BBC Web-broadcast has been emulated to a certain extent.
Access to the electronic collection is currently limited to users at the World Centre. However if the experiment becomes standard Library procedure, a decision will be made as to whether or not to open the collection (or parts of it) to the world through the Library’s existing public web page1.
Other National Library experiments
According to a session given at the 66th Annual General Conference of the International Federation of Library Associations and Institutions (Jerusalem, August 2000), there are two major ways that other National Libraries are attempting to save Web based publications.
1. Download and archive all web pages ending with the National county code [e.g. *.se for Sweden] thus saving a "snapshot" of the Country's Web at stated intervals for future researchers to "mine". This entails the hope that future software will still be able to read it when it is needed. [Cf. Sweden - The Kulturarw3 Project - The Royal Swedish Web Archiw3e - An example of "complete" collection of web pages2 ]
2. Use standard library collection development guidelines to analyse the Web and selectively collect items deemed to be of lasting value. This entails both the commitment of future resources to migrate items through future software generations so that they are always available to researchers [c.f. Australia - the Pandora Project3 ], and the commitment to continue to analyse the collection and weed those items whose value has diminished, per standard library practice.
Deposit Libraries (e.g. Library of Congress, the British Library, Bibliothèque nationale de France) require huge amounts of shelving [LC - 850 km4 ] and will continue to do so. Now in addition, they are requiring huge amounts of computer space to fulfil their mandates
"To retain in perpetuity a copy of all material published ... in order to ensure that [the people] will have access to the accumulated knowledge, activities and achievements ... in all forms of human endeavour" [c.f. Pandora Project]
This conference session was both a source of hope and despair. Hope, that the pressure of researchers using the world's major libraries will ensure that backwards compatibility in software will be a demand supplied by commercial interests. Despair, that every day published items about the Faith are being lost because the limited resources of the world's only Bahá'í deposit library curtails our ability to search, find, analyse, save and catalogue Web based publications.
The future International Bahá'í Library will be the world's leading Library institution. In order to build the electronic collection for that future library, the Bahá'í World Centre Library needs to move towards a combination of these two philosophies:
1. All domains including the word "bahai" to be downloaded at stated intervals.
2. All Web pages mentioning the Bahá'í Faith or individual Bahá'í to be analysed and those deemed of lasting value to be saved.
Achieving these ends will require increased resources in the years to come - technical, financial, and especially human - in the form of intelligent, imaginative, experienced and devoted graduates of Library and Information Schools from around the world.