Metadata

Here's a simple description in Powerpoint form of library metadata:

documents/metadata-demystified.pdf

Here's an example of managing a metadata project:

documents/metadata-strategy.pdf

Metadata Standards

From Jewel (AISTI consultant who assisted SFI in Feb. 2004)

comm.nsdl.org/download.php/196/PC3Draft1b.pdf

(If you have problems accessing the paper, let me know.)

I'll see if I can get a hold of the final version, but in the meantime, you may wish to look at section 1.2.13.1, which provides a definition of native metadata, and other types of metadata for a collection.

This document may be useful as you redesign your systems/db in the next phase. ;-) I will also read through the entire document; at this point, I've skimmed through it. At that point, if you don't mind I may make some further recommendations for SFI, based on what we have discussed.

SFI Ideas for an Information System (chronological order)

Early Discussion of Open Archives at SFI-2002

Date: Mon, 4 Nov 2002 13:29:19 -0700 (MST) From: Margaret Alexander <mba@santafe.edu> Subject: eprint server To: ellen@pele.santafe.edu, gumerman@pele.santafe.edu, rkbv@pele.santafe.edu, grr@pele.santafe.edu Cc: cmachado@aisti.org, mba@pele.santafe.edu

Last week George, Ronda, Tim, Ginger, and I had a preliminary meeting about putting SFI's working papers on a new e-print server that is being developed by our consortium (called AISTI) of sci-tech libraries. During the course of a D.C. meeting with other libraries about e-print servers, most speakers concurred that policy issues took longer than technical ones.

MIT Libraries developed a repository for the electronic scholarship of its university called dSpace. In the process of creating dSpace, they compiled a list of institutional questions/policies that had to be answered or written before the electronic repository went into operation. Here they are:

1. Who is qualified to submit electronic content? At SFI this question is taken care of by our policies for working paper submission. (In the future, SFI may not want to limit its submissions to just working papers. However, our "first step" in this endeavor should be working papers.)

Ronda: I agree. We can, however, readdress the working paper policies as needed.

2. What is the character of submissions? At SFI this question is taken care of by our policies for working paper submissions. (At other repositories learning objects such as course materials, theses, and dissertations may be submitted.)

3. What forms of submissions are accepted? At SFI this question is taken care of by our policies for working paper submissions.

4. What are our expectations for retention? This question probably applies to learning objects which are obsolete when a course is completed. At SFI we might ask what would happen if we decide to withdraw a working paper--is this covered in our current policy? At UCLA the policy is that the citation (metadata) cannot be removed but the content can be.

Ronda:I agree that we may want to delete the electronic versions of the papers but not the actual citing from the archive. (Currently, we leave the number but delete the title and author to avoid confusion.) I think there needs to be a comments field or option which specifies that this paper has been withdrawn, and maybe by whom such as "withdrawn by author."

5. How do repository policies interact with other policies? At SFI we may be simple enough administratively to ignore this concern.

Ronda: One possible conflict is that student and nonfaculty members must have their working papers reviewed by an SFI faculty member or Science Board member. If the archive is truly open (anyone can submit), then the designation as an SFI working paper would have to be delayed until this review took place and the individual's connection to SFI verified.

6. Are there privacy issues? At SFI, this question might entail asking each individual working paper author if it's o.k. to submit SFI back files to the new e-print server.

Ronda:Since the authors have already consented to listing the paper as an SFI working paper, and we are simply changing how our papers are presented, I believe we can make the transition to another server by notifying the authors of our intent rather than getting permission in writing from each one. They can contact us if they object. Note that some past authors may be difficult to contact.

If you think about it, these papers have been in the public domain for some time now so they are not proprietary. And as long as the server remains open, we are not violating the original "agreement" with the authors. We just need to remember that we don't own the copyright for any of the papers.

7. What metadata will we require? Is the metadata at SFI consistent, complete, and accurate? If not, who will do the work?

Ronda:For SFI's working papers, we have a great deal of information in a FileMaker database and this data has been checked. We do not have electronic copies of all papers back to 1989, nor do we monitor the quality of these papers per se (we simply check to see if they print correctly). We will have to decide later whether we want to recreate key historical papers, particularly those that are frequently requested, and what to do about quality control.

The conversion of our database to the metadata is an issue

The following need to be asked of the AISTI eprint server:

8. What if the eprint server fails? At SFI, since we already have a working paper site, we are protected.

Ronda':Not if we defer to the server instead. I thought this is what George said at our meeting--it would replace what we have. We will continue to track key information about each paper and the abstract, but will have to decide whether to, and how to, archive electronic copies. And we would need a plan for how to make this information available should the eprint server fail.

9. What if revenue comes from the eprint server? We also need to make certain that the eprint server remains open and does not require licensing.

10. Who incurs the costs of running the eprint server over the long-term? Initially, the eprint server has been funded by LANL with a grant from AISTI.

11. How will the eprint server be used? For example, can outside researchers use the repository as a "sand box" for their own research?

Ronda:I think having a repository that accepts a variety of papers is fine as long as we protect those papers classified as SFI working papers. Hence, I think it would be helpful to have some classification of articles by broad topics, such as complexity, as well as key words and specific paper series. For example, if an author submits a paper, it is not listed as an SFI wp until someone designated at SFI agrees or verifies this.

I am very excited about this project. I hope the researchers are as enthusiastic. Good luck at your presentation.

Let me know your thoughts on how these questions should be answered. On Nov. 13 I will introduce the eprint server at the noon time faculty meeting.

Margaret

Complexity Portal-2003

Date: Sun, 16 Nov 2003 14:19:37 -0700 From: "Marcus G. Daniels" <mgd@santafe.edu> To: Brent Jones <brent@santafe.edu> CC: "James P. Crutchfield" <chaos@santafe.edu>, Kevin Drennan <kevin@santafe.edu>, Supriya Krishnamurthy <supriya@santafe.edu>, John Miller <miller@santafe.edu>, Sarah Knutson <srk@santafe.edu>, walter@santafe.edu, Ginger Richardson <grr@santafe.edu>, mba@santafe.edu Subject: Re: Updated Report Brent Jones wrote:

> I would also suggest that Margaret be there, not for technical input, > but because she has ideas and concerns of how SFI should fit in with > the rest of the electronic library world. We need to make sure we're > all thinking along the same lines. I think we need to meet as a > group, and not just Kevin, Marcus and myself, as this is pretty much > the core of SFI as the Complexity Portal.

FYI, I installed DSpace on an Econophysics machine at SFI. (I understand that Laura, coordinating with Margaret, is learning how to upload Working Papers into an offsite DSpace server.)

DSpace is a Java server package / web app that maintains digital library collections. The main web page is http://www.dspace.org, but the DSpace-in-use page is https://dspace.mit.edu/index.jsp. Roughly, DSpace is to digital collections (e.g. PDFs, Quicktime, MP3s) as Millennium (the software hosted at UNM for tracking SFI's library collection) is to books and journals. Millennium is a large and complex commercial software package used by libraries around the world. It deals with maintaining the database of the library collection, tracking circulation (including over the web), provides optional modules for cataloging, features for ordering/reciept/payment/budgeting of library material. The company that makes Millennium, Innovative Interfaces Inc., is sort of the Microsoft of the field, and the software is not cheap.

The analogy is not perfect, as the overlap between journals and digital journals is not a terribly important one. Some things change, e.g. circulation changes to some extent, due to the possible no-cost redistributability of digital works. Other things don't change, the determination of `subject' categorizations. It's quite common to use conventional library software like Millennium to provide library patrons with a web interface to collections that subsume digital works.

One practical problem for SFI is that we don't have direct control over the Millennium software configuration. UNM runs that. If you go to the Libros web site and find some digital file reference, many of these references are not `links', they are just ad-hoc embeddings of URLs that aren't clickable. Perhaps at UNM there are software clients in the library configured such that these references are clickable, but at least the web interface isn't perfect in this regard. So one question is the extent to which the digital works are cataloged, and another question is whether the Millennium web module is configured in a way that can be integrated into the SFI Complexity Portal.

At SFI, we have uncataloged works, and the cataloging that is done is outsourced. By that, I mean that Working Papers are basically invisible to the library world. Google will pick things up based on keywords as the abstracts of the Working Papers are on the SFI website, but a normal literature search (say as performed by a reference librarian doing his everyday job for an assistant professor) might not. An ISI citation search (a commercial indexer of journals) won't find Working Papers, and neither will a canonical OCLC search. OCLC is a consortium of worldwide libraries (including Library of Congress) that manage shared cataloging records and tools.

We also don't list all of SFI's digital subscriptions with Libros so that's another obstacle to using the existing UNM Millennium server as a tool to query SFI's digital holdings. With regard to the Complexity Portal web site, the way it would work would be that a user would submit a request via the Z39.50 protocol to Millenium, and it would send back USMARC records that could be taken apart and formatted as HTML. This approach has two advantages. First, it integrates SFI into the library community as good citizen, and secondly it provides a single interface to query the physical and digital holdings of SFI. Z39.50 is standard protocol libraries can use to query each other and there are open-source packages for using it. The larger computing committee report has details on this.

Cataloging has three purposes. The first is to put identifiers on a work, e.g. a physical home, authors, publishers, dates, page ranges, etc. Secondly, it adds classification information so that patrons of a library know where to go looking for a work in library. The call number encodes information about what subfield the work relates to and then additional numbers to discriminate it from other works in that subfield within that library. Thirdly, the cataloging records can be extended with additional information, such as detailed subject headings. A work can have multiple abstract subject attributes, and systematically encoding this information can help people find works of interest. Finally, and most importantly, the collective process of cataloging by librarians across different libraries and the sharing of this information in a systematic way leads to deeper insights into how works relate to one another.

The systematic way this for the most part done using MARC records. MARC records encode a lot of information, and it is a labor intensive process. It's labor intensive because even if it is obvious what the subject is (What subject is a book with a title like "Landscape Terrain Variation" -- unless you skim the book, is it about how to design your flowerbeds or is it about the problems a population might encounter in dealing with changes in their habitat?), there is still a need to look up the subjects in tables and assign the right codes. A typical MARC record is about a page of text. To give an idea of the resolution of the subject headings, Stuart Kauffman's _Origin's of Order_ is just listed under Biology, but Melanie Mitchell's _Introduction to Genetic Algorithms_ is listed under various Genetics headings, Mathematical Models, Computer Simulation, and Algorithms. Stephen Wolfram's _New Kind Of Science_ (gasp), is listed under Cellular automata and Computational Complexity. It's possible to introduce new subject headings, and there are various committees at MARC, OCLC and the Library of Congress, that evolve the standard set, but one would have to contact these organizations and make an argument. (Say, if you wanted to introduce "Econophysics" as a subject heading.) The current Library of Congress guide to subject headings has 270,000 entries.

Basically, the upside of coding MARC records is that it is the bread and butter of librarians; it is well understood. The generation of MARC records potentially can be useful for search, due to the thought given to the cataloging. I say "generation" to discriminate from "use". If SFI cataloged the Working Papers, `deep' subject headings could be given that would be meaningful and useful to SFI-affiliated researchers (or anyone else for that matter). However, if a work exists and anyone else has it, it is the case that one can consume other library's MARC records for local use (and most frequently, the Library of Congress MARC records). The downside for _generation_ of new records (i.e. labor, not the submission itself) is that the cost of entry is high enough that less complex alternatives can be attractive.

One alternative in use by DSpace is the Dublin Core. Dublin Core doesn't require all of the detail of MARC records in order to be correct, but it can encode and use it if the information is provided. With DSpace, contributors enter some basic information about a document (author, title, e-mail), and then upload a paper. Then, as an optional step, there is a tool in the package to do cataloging (a different person can do this). The point is, with DSpace what you get may or may not be cataloged and may or may not be cataloged completely (relative to what would be done in a conventional package like Millenium by librarians). This characteristic could be an advantage or a disadvantage. Note that existing DSpace servers (e.g. see the MIT DSpace URL above) don't replace the main library catalog, they just augment it -- presumably (digital) material that doesn't warrant the level of cataloging attention given to material in the main library. Material like doctoral theses, working papers, and internal documents.

DSpace has several potential advantages, but neither has really been realized yet in the current version. The first, more general idea, is that it can be told about different kinds of digital contents. In principle, DSpace could be educated how to take apart an archive of simulation, and run in it some web context. This could be difficult or impossible to do with a package like Millenium because of the server-side customizations that could be needed. There doesn't seem to be any code in DSpace as of the current version to facilitate this. It's just an idea. Secondly, DSpace aims to provide searching of full text contents in the next version. This is not the domain of the main Millenium package, but there seem to be some products from the company that may provide this to some extent.

In terms of exactly what DSpace is and how it is set up, here is some background. DSpace is an open source Java project being lead by MIT Libraries and Hewlett Packard. It runs inside of the Tomcat JSP server. JSP, or Java Server Pages, is a way to write dynamic web pages in Java. It's primary competitors are PHP (kind of low-tech in comparison), and ASP (from Microsoft). Tomcat is free from Apache. DSpace uses JDBC (Java -> SQL) interfaces for maintaining its databases. In other words, DSpace stores stuff in a RDBMS. They suggest Postgres (which is how I set it up), but MySQL should work too. So, it ought to be possible to query the DSpace database from outside of DSpace and collect or install information.

Specifically, it should be possible with a little bit of technical work to import the whole of the cleaned working paper bibliography into DSpace in batch (rather than full re-entry), and keep the Endnote version in sync through a procedure similar to the retrival of citations from ISI (rather than co-entry into Endnote & Filemaker & DSpace).

The use of Millenium and/or DSpace as a tool for the Complexity Portal implies certain priorities in terms of how SFI thinks about the accessibility of resources needed by scientists at SFI and how the work done by SFI scientists is made available to the world. Some of these tools introduce technical constraints the committee should be thinking about.

SIS Overview-Feb. 2006

SFI SCIENTIFIC INFORMATION SYSTEM by Sperling Martin

DETAILED REQUIREMENTS DEFINITION

HIGHLIGHTS

February 6, 2006

Table of Contents

SIS DETAILED REQUIREMENTS DEFINITION OVERVIEW 2 Background 2 SUMMARY OF CURRENT FUNCTIONAL REQUIREMENTS 2 Research Content Processes 2 Content Creation 3 Content Dissemination 4 Collaboration 5 Calendar 5 Workflow 6 Workflow - Visitors 6 Workflow - Events 6 Workflow - Content Dissemination 6 Business Network and International Programs 7 CONTENT ARCHITECTURE REQUIREMENTS 7 Repository Content Architecture Overview 8 Metadata 8 CONTENT SPECIFICATIONS 8 People Content 8 Research Content 9 Institutional Content 10 Educational Content 10 Business Network Content 10 International Program Content 11 ACCESS SUPPORT SERVICES 11 Search Engine 11 SIS IMPLEMENTATION AND PLANNING CONSIDERATIONS 12 TECHNOLOGY OVERVIEW 12 SIS Tool Set 12 Process Integration Services to be Implemented 13 NEXT STEPS 13 Provisional SIS Project Implementation Plan 13 OTHER REQUIREMENTS FACTORS 14 OAI 14 DRM 14 OTHER MATERIAL IN THE DRD REPORT 14

SIS DETAILED REQUIREMENTS DEFINITION OVERVIEW

Background

“The need for Scientific Information System (SIS) is compelling; to truly function as an “institute without walls,” SFI must be able to connect to and leverage the knowledge of its far-flung research community.” (SFI Proposal For A Scientific Information System)

The Detailed Requirements Definitions phase of the SFI Scientific Information System project was launched in August 2005 to develop system requirements at a next level of detail beyond the functional objectives described in the May 2005 SFI Proposal For A Scientific Information System. The main goal of the Detailed Requirements Definition project is to explain the concepts and required features of the prospective SIS so that system integration organizations can determine how they will bid on such a project.

The primary focus of the SIS must be on the science and research work of SFI, not on any particular technology or the Institute's operation. The guiding theme is about SFI doing good science -- that is the number one priority – all else is secondary. But, communicating the SFI mission effectively will ensure that the Institute maintains the resources and support to continue as a cutting-edge research institute.

One of the most immediate needs for the SFI is to be able to use information tools to operate efficiently. The administrative data management processes and procedures currently in place to support internal operations and the overall SFI mission do not accomplish that as effectively as they should. The Institute is both an academic endeavor as well as a business so the requirements elucidated in the Detailed Requirements Definition cover science information needs as well as operational components of the Santa Fe Institute.

SUMMARY OF CURRENT FUNCTIONAL REQUIREMENTS

The requirements described in the following sections identify the desired system functionality and content specifications that will enable SIS users to accomplish their tasks. Functional requirements are specifications of the actions that the system must perform. The requirements listed here are described in greater detail in the full Detailed Requirements Definition document, draft 3.0.1. How the functions will be implemented is part of the subsequent software design phase. Explicit software specifications will come as part of the system design process leading to the implementation of the SIS. The Detailed Requirements Definition document is a step between the functional overview of the SIS Proposal and the detailed software specifications to be developed next.

Research Content Processes

The following tabulates the essential requirements for the processes and technology that will connect to the SIS repository of Research Content.

RR-1: Access to SFI hosted research documents and related material in full-text form structured under the SIS XML architecture.

RR-2: Provide access to SFI archived or work-in-progress content in unit object formats as a master information component for dissemination.

RR-3: Provide robust content integrity maintenance procedures. Updates and backup processes that follow rigorous operational procedures.

RR-4: Support content treatment for content dissemination with associated applications such as editorial tools that enforce adherence to the XML structures and SFI editorial standards.

RR-5: Access to content according to links and relationships to existing SFI content sub-structures – driven by the SIS metadata.

RR-6: Access to the repository directly navigable from the SFI home page through the SIS search engine.

RR-7: Support means to extract digital objects or elements by type and user class according to defined workflow criteria.

RR-8: Support virtual library processes including serving as a repository for primary source material used in research, by affiliated institutions.

RR-9: Allow for flexible packaging of content for dissemination in defined digital and print forms as well as for re-use and repurposing into new content formats and new media.

RR-10: Permit dynamically integrated static and real-time data forms into content accessed for dissemination processes.

RR-11: The SIS repository must be designed to accommodate other internal and external digital assets that are accessible through standard XML-based content exchange protocols.

RR-12: Allow access to all existing SIS content and work products – work in progress version tracking to include the current version and at least the most recent past version (current -1).

RR-13: Provide transparent access to MOU-based external content repositories at other sites – external content source and ownership being obvious.

RR-14: Repository management, directory, and content controls should include audit trails for all edit and content treatment transaction cycles – integrated with workflow processes.

Content Creation

Content creation processes are aimed toward the preparation and packaging of SIS content for any SFI information products of services.

CC-1: Utilize structured authoring/editing tools that enforce XML compliance to facilitate editorial interaction with the repository content that may be disseminated in some structured, presentational method.

CC-2: Any authoring tools should provide viewing, navigation, and editing functions that are specific to structure, as well as content while offering the user a predefined view of the data.

CC-3: Facilitate the communication between SFI and external authors for working on in-process content objects -- providing authors with content in their preferred formats from the master publication repository for working on content object updates is a secondary requirement.

CC-4: Spelling checker based on standard and SFI customized dictionaries – these are usually built in functions that allow special terms to be added to a supplemental spelling dictionary – including all defined SFI Metadata terms.

CC-5: Access to SFI Metadata for content indexing - identify/mark index terms or referents for link value association with the SFI metadata. - maximize use of editorial aids and indexing tools – consider evaluation and deployment of automatic abstracting tools.

CC-6: Capability to insert comments in the work-unit-in-process. Inserting comments that do not affect the content of the text of a work-unit. Support author/writer/editor electronic sticky notes.

CC-7: Highlight additions and deletions for content updates, including associating those changes as part of developing and validating hyperlinks and classification under SFI’s metadata. Allow references and cross-publication links to be created and validated during editing sessions.

CC-8: Support remote authoring and editing to manage and coordinate collaborative content creation processes with the SIS workflow tool.

CC-9: With appropriate authoring tools, prepare content with or without document structure tags visible – function as a tag display toggle switch.

CC-10: Allow full integration of materials and content by specified content units – chapters, sections, sidebars, etc. as defined in the XML architecture and support the integration of all content forms including images, video, audio, etc. that can be packaged for dissemination processes.

CC-11: Support extensive footnotes and citation references with format rules and consistency guidelines enforced through editing templates. Allow integrated use of tools such as EndNotes.

CC-12: Access to a content work-unit for internal editorial review under collaborative workflow control with proper check off and validation/confirmation of each electronic hand-off

CC-13: Displayed and hidden (non-displayed) comments (e.g., questions, routing directions, etc.) in the work-units as part of the review process.

CC-14: Templates should be developed that enforce consistent metadata treatment of the content and be integrated or accessible by the authoring and editing tools.

Content Dissemination

These requirements cover the content treatment processes geared to dissemination of SFI information products and services.

CD-1: All content packaging and dissemination processes must work with an XML content architecture.

CD-2: Facility to set up style sheets and product content packaging templates for print and electronic products - manage diverse application instances of rendering rules for content presentation.

CD-3: Capability to handle complex information structures including still and moving images, mathematical expressions, chemical notations, extensive footnotes and citation treatment.

CD-4: Must work with compound documents -- including integrated rich media.

CD-5: Editing and content treatment capabilities must be available to handle rich media forms – digital video and image editing.

CD-6: Basic audio editing should be part of the dissemination content preparation and packaging capabilities to allow content editors to prepare conference session recording sets, extracts for sound bites and other audio material for dissemination.

CD-7: Publishing process editing tools should include functions to validate links and to flag erroneous links/circular references, etc.

CD-8: Full design control for content display formats for web-delivered content. Design control features would directly link to dissemination template development functions.

CD-9: The publication processes of the SIS must not preclude the use of outside composition vendor services as necessary for print product. Any change to content made by outside vendors must be reflected in the SIS master version of that content.

CD-10: Based on calendar or workflow tool triggers, feed content to dissemination targets using templates that are presentation form optimized – content in XML wrapper for composition vendors or to publishers in compatible delivery forms.

CD-11: Support webcasting and “webinars” consisting of rich media content (seminars, workshops, etc.) – also support links to SIS e-Learning tools.

CD-12: Disseminated content objects will be maintained in the repository with editorial history data as part of their metadata values.

Collaboration

The following highlights the main requirements for the collaboration processes of the SIS.

CS-1: Allow a user to grant permissions and share information related with other users on the system and share information using permission level features.

CS-2: Create personal, public and project workgroup folders.

CS-3: Collaboration system integration with the email system to move messages and attachments from an inbox to project folders

CS-4: Create contact lists including importing from other Institute repository sources

CS-5: Export contact lists to other formats and provide shareable distribution lists, and calendar entries.

CS-6: Create reminders and tasks that link to the Calendar and Workflow systems.

CS-7: Support moderated discussion groups and weblogs that are tied to project folders.

Calendar

The requirements for the calendar processes that are core to the day-to-day operation of the SFI are covered here.

AC-1: Multiple calendar display options.

AC-2: Efficiently support scheduling of resources, events and appointments.

AC-3: Easy creation of appointments by user class levels of access.

AC-4: Direct links to SFI email system for event and notice postings and feedback.

AC-5: Display free/busy periods.

AC-6: Quick overview displays of schedules of events, participants and resources.

AC-7: Prevent booking conflicts of SFI resources – with e-mail notification for resolution to Calendar administrator and designated back up.

AC-8: Easy means to create personal and integrated calendars. Integrated calendars means dynamically combing two or more personal calendars by project or research endeavor.

AC-9: Automatically notify users – individuals, workgroups, defined user list members, user classes, etc. about events as well as trigger “tickle” reminders on set schedules or priority notification cycles that are calendar system definable.

AC-10: Publish select calendar displays to web site or made available on request from designated user classes.

AC-11: Tightly coupled to the workflow system for content creation, dissemination and production processes.

Workflow

The main SFI administrative and operational support workflow requirements are covered here. During SIS implementation, additional workflow components may be incorporated as they are agreed upon.

Workflow - Visitors

EV-1: For visitors, track requests and approvals including funding.

EV-2: Create invitations and follow up communications with defined templates.

EV-3: Resources allocated and support services established – housing, computational services, travel, etc.

EV-4: Arrival services invoked including welcoming and final infrastructure coordination. Web site inclusion confirmed at the time of arrival.

EV-5: Special service needs would be noted in the original data postings for the visitor such as visas and transportation.

EV-6: Departure services and follow-up initiated by the calendar system

Workflow - Events

EV-7: Means to determine availability and allocate SFI resources for an event.

EV-8: Request and process SFI Management approval and budget (coordinate with SFI Finance from the SIS repository).

EV-9: Set provisional dates and a schedule for the type of event: meeting, seminar, workshop, work group, SFI sponsored conference, non-SFI sponsored event. After dates confirmed, bolt-down the event.

EV-10: Post event notice on website with link to online registration via PayPal or equivalent.

EV-11: Secure external resources such as rooms, transportation, catering, receptions and restaurants, etc. This will be driven by system-generated checklists for event management and on-site administrators.

EV-12: Participant packages developed – from invitation letters to follow-up requests for feedback and delivery of event content collateral.

EV-13: Confirm on-site services – A/V services coordination, support for webcasting and other remote presentation services, name tags, local site logistics, post-event clean-up, etc.

Workflow - Content Dissemination

WD-1: Editorial process access control for read/write/review status -- also to work-in-progress publication and workflow status.

WD-2: Control retention of previous versions for edit history, including date/time stamp for work-in-process editing sessions.

WD-3: Functional and procedural check-off workflow control features for reporting purposes and for ad hoc inquiries.

WD-4: Access tracking controls for content version history and editorial revisions.

WD-5: Database management, directory, and content controls. Provision should be made for audit trails for all edit and content management transaction cycles.

WD-6: Centralized management of data related to each work unit or sub section. Track what is "checked-out" of the editorial database and who has it, and when is due to be checked-in.

WD-7: Create snapshots and reports that show the status of projects/work units-in-progress.

WD-8: Allow for simple status updates by all qualified production process or workgroup users.

WD-9: Content management for products including automatic extraction of sections to be edited externally and included in an electronic "work unit folder" including any required editorial style template.

WD-10: Ability to retain current and one previous versions including transaction and date audit trail for each production unit.

WD-11: Maintain any externally supplied original electronic work-units in the "as received condition."

WD-12: Allow for retention of any number of interim working copies, if desired, with appropriate identification as to what changes were made and by whom.

WD-13: Easy to use tool for task definition and indications of workflow step and process interdependencies. The workflow definition tool can be used to link specific people and tasks to the calendar system the will define the various production cycle timelines.

Business Network and International Programs

This section summarizes the functional requirements for those two SFI operations. They are incorporated here due to their similar requirement sets. But program-specific requirements can be added as necessary.

BI-1: Capability to maintain external networks of contacts unique to mission specific interests. These focused contact services should include various means to stay connected including email alerts triggered by an SFI even.

BI-2: Access to the enhanced SIS people information components in the repository particularly relevant digital images, video and audio clips.

BI-3: Means to manage event participant information post-event to maintain follow up contact and allow such follow up to include ongoing connections that encourage participants to share highlights about accomplishments post-event.

BI-4: Access to content packaging and dissemination tools for mission-specific outreach purposes.

BI-5: Enhanced and easily identified parts of the SFI master Website – access may be open or available only to registered users or event participants.

BI-6: Allow special external networks under the master SFI network tools that can be used to support mission specific collaboration and other contact facilitating processes including weblogs.

CONTENT ARCHITECTURE REQUIREMENTS

A key to the success of an operational SIS is a repository content architecture that supports the variety of functions envisioned for SIS users. The SIS content will include material ranging from conventional office communications to complex research documents that can be comprised of information objects of textual material, mathematical and chemical expressions, and various forms of graphical images as well as rich media. Data models will need to be developed as part of the implementation phase to reflect the SIS XML architecture.

Repository Content Architecture Overview

To accomplish the goal of supporting an extensive array of mixed content objects, it is recommended that the SIS repository architecture be based on XML. A well-designed XML content architecture will:

• Facilitate the creation, dissemination, storage, and re-use of discrete content units. • Support dynamic connections between content units or objects. • Associate workflow and collaboration links or other information management procedures to content objects. • Create virtual views of the content units stored in the repository that can feed various dissemination services.

The SIS content architecture will work with an integrated workflow management capability that supports task definition, process tracking and task driven content access for collaborative processes.

Metadata

A well-developed and properly maintained metadata will be the ontological roadmap to the SIS content.

• Leverage the value of SFI’s knowledge assets. • Provide for efficient, high precision access for the widest use and sharing of SFI content. • Maximize interoperability, unambiguous electronic interchange and content mining applications

MD-1: Metadata development and maintenance must be through an easy-to-use tool – the tool must be capable of consistently enforcing concept relationships and notifying the user of inconsistencies.

MD-2: An SIS metadata scheme is to be XML Dublin Core compliant.

MD-3: The metadata tool must support the vetting of self-archived metadata or metadata wrappers from imported content to assure Dublin Core compliance as a minimum

MD-4: The SIS will need to establish a procedure in the metadata tool to test the effectiveness of metadata as new domain coverage is developed before it is deployed to the SIS.

MD-5: Support persistent URLs – each URL as persistent as the digital object it points to.

MD-6: Rich media will need content indexing processes that provide dynamic internal content references perhaps modeled on an SGML Hypermedia metadata framework or other schemes for treating such content.

MD-7: Metadata definitions should be developed with some consideration for their eventual incorporation into an XML Topic Map framework.

MD-8: Accommodate DOIs and other standard form mechanisms for accomplishing digital rights management.

CONTENT SPECIFICATIONS

This section summarizes the information elements most likely to be part of the various content segments. This is a draft tabulation that must be refined as part of the SIS implementation process.

People Content

PC-1: Name, user class, contact data – (all entries have this content)

PC-2: Biographical thumbnail*

PC-3: Institutional / organization affiliation – and status (current / former)

PC-4: Statement of current research interest*

PC-5: Links to collaborators*

PC-6: A CV – in the entrant’s own format or structured to an SFI-defined template*

PC-7: Links to publications in the SIS repository*

PC-8: Links to publications in external repositories*

PC-9: Links to other authored content forms – audio/video, archived web presentations, presentations at other sites

PC-10: Links to SFI working papers*

PC-11: Links to other works-in-progress as permitted

PC-12: Photo in an industry acceptable format*

PC-13: Link to researchers home page if applicable

PC-14: Link to researcher’s home institute*

PC-15: Staff descriptors include a statement of SFI role and responsibilities.

PC-16: Current or former SFI collaborators, contacts or sponsors.

PC-17: Links to institute forms stored in other repository segments – applications, records of donor contribution, etc.

PC-18: Confidential information associated with a person – comments, fellowship status, application resolution, etc. would only be available to authorized users and stored in a separate controlled access segment for SFI management and human resources.

(*= Suggested minimum data value set for faculty and researchers)

Research Content

This content is the primary intellectual asset of the SIS repository and consists of:

RC-1: Papers, books and monographs, published articles, proceedings.

RC-2: Separate rich media objects including still images, moving images/video, animation, and audio.

RC-3: Material from SFI events such as workshops, conferences, colloquia, meetings, etc.,

RC-4: Work in progress research – documentary materials, research notebooks, presentations, link arrays to related research, data collections, project-specific software and all other research data representations (e.g. captured display images of research process results, instrumentation readings, field data, survey forms, images, maps, etc. in digital form) that a researcher or workgroup member has associated with a project.

RC-5: SFI-produced software used in research projects. License information would be kept as part of the Institutional content segment.

RC-6: Datasets gathered as part of research data collection and computational production - with a description of the data and applied SIS metadata.

Institutional Content

Security access control is critical for this content so there will be defined access levels to insure privacy on sensitive data. This content, however, will not contain anything that is specific to human resources administration, other personnel data, compensation information and employment agreements. Other sensitive administration and management content including institute financial information and donor contribution data also will be maintained separately.

IC-1: SFI institutional and administrative documents

IC-2: Event management information that includes past and prospective future events and covers everything from class of event (workshop, meeting, conference, etc.) to dates, location and logistical support information.

IC-3: Institute infrastructure and support information such as for visitors, housing resources and tabulations of equipment locations.

IC-4: Donation and contributor descriptive information – solicitation campaign materials. Campaign results tabulations. (Actual financial data in terms of donations and contributions will be stored in the Institute’s financial database separate from the SIS.)

IC-5: Agreements, contracts, grant proposals, MOUs and those documents fundamental to the Institute’s operation and business development.

IC-6: Email and attachments may be another class of data for this segment, but more likely that content would physically reside in the SFI email application specific repository that can be linked to from the master SIS repository as necessary.

IC-7: Copyright and content right to use information for the content objects in the SIS repository or linked to from the SIS access processes as well as lists about SFI information made available to other organizations is all a part of this content segment.

IC-8: SFI acquired software registration and license information plus related reference and training documentation.

IC-9: Forms – web-based: requests, applications, business, procedural, research-based (surveys, etc.)

Educational Content

EC-1: e-Learning packages of content and procedures in each package’s native process language. So long as the tools are SCORM compliant the SIS content should be usable for courseware and learning module development.

EC-2: Summer school program content and ancillary information – syllabi, instructor guides, bibliographies and other material used as teaching aides

EC-3: Any Learning Management System defined content structures.

EC-4: Course reading lists, bibliographies and reference material.

Business Network Content

BN-1: The names and contact information about current and past business network members. Additional information includes level of membership, timeframes and any special research interests.

BN-2: Access level controls for content that is the result of a Business network event that will operate as time line embargo management.

BN-3: Fundraising summary data by past or prospective campaign that is culled from people / organization content sources elsewhere in the repository.

International Program Content

IP-1: Identification of SFI affiliates - country and location.

IP-2: Website registrants from outside the USA for notification about future events.

IP-3: Access to the SIS people segment for all attendees at International SFI sponsored events.

ACCESS SUPPORT SERVICES

The SIS implementation will result in a new website and ancillary support processes. Part of that new architecture should be an advanced content access tool.

Search Engine

SE-1: Based on Open Source technology -- ideally, an open source search engine that can be easily enhanced with advanced navigation and link traversing processes – navigate through link domains to direct and related topology or theme clusters or domains.

SE-2 Multi language capable – Unicode compliant.

SE-3: Full XML and Web Services compliance and an API library to optimize any middleware based extended functionality.

SE-4: Easy integration with leading edge browsers to support additional processes such as advanced visualization and link traversing through an appropriate API or middleware solution.

SE-5: Exploit the SIS Metadata to allow semantic / probabilistic inference to optimize content treatment using access enhancing ontology approaches.

SE-6: Search by metadata and/or object content – full text as well as rich media (so long as any rich media is represented by the metadata).

SE-7: Retrieval accuracy that optimizes relevance.

SE-8: Retrieval speed across multiple databases for federated searching – including non-SFI content where there are established access license or MOU relationships.

SE-9: Scalable / transparent effect of multiple user load – no significant transaction speed impact - nominal target is sub-second response on the internal SFI network for basic transactions – complex search result presentation rendering should be no more than 3X basic transaction speed.

SE-10: Content load real-time indexing for transparent dynamic updating.

SE-11: Rendered results to obviously identify content type and source.

SE-12: An ideal search engine should provide some set of user-preferred search command modes – classic command mode consistent with Z39-50, purely natural language, and perhaps variations such as QBE, guided searching, or question answering.

SE-13: Provide for dynamic ranking and weighting or API connects to have those functions incorporated in the search results.

SE-14: Support user profile sets for automatic alerting based on user or system defined triggers – such as a repository update event or a metadata mining process.

SE-15: Well-designed user interface with full set of customization tools to allow for SFI-designed user-level dynamic UIs for access control and presentation. At a minimum, some parts of the site should be available in certain other languages.

SE-16: The ability to serve as a gateway for federated search processes.

SE-17: Post search result process to refine results with weighting or ranking features as well as organize results by researcher name, titles, metadata values, etc.

SIS IMPLEMENTATION AND PLANNING CONSIDERATIONS

Building the SIS will impact staff and management. The IT team in particular will be impacted and plans should be put in place to bolster that resource for the SIS implementation. An IT staff resource expansion or reconfiguration of resources likely will be needed to support the fully operational SIS.

Most significant, however, is the path to a long-term operational SIS includes a commitment to system and repository resource maintenance. From experiences with similar projects, it is not unusual for annual ongoing system support costs to be 25-50% or so of the upfront implementation costs. The impact of open source as compared to proprietary technology may create a different post-implementation cost picture. That scenario would reflect the typical low (zero?) cost of acquired open source tools and the unavoidable expenses of their ongoing support that is basically no different than if the system were an all-proprietary solution. There will be baseline expenses for sustaining and nurturing the SIS that the Institute must plan on no matter what the implementation costs are.

There will be benefits to the Institute from the SIS in terms of improved operational efficiencies and greater opportunity to leverage SFI intellectual assets that may even support new revenue generation endeavors. And, most important, the core users of the SIS will have new and more effective means to support their science research challenges.

TECHNOLOGY OVERVIEW

Some baseline technology requirements have been identified to be adopted across all SIS components, these are:

• Implemented repository architecture, data models, metadata and information structures must be based on open standards. • The SIS implementation should minimize imposed technology changes on legacy content. • The implemented technology must be platform neutral for the widest scope of prospective users.

The selected technology must be proven, scalable, flexible and reliable. The technology must be compliant to any SFI standards for information technology. The system should be easy to learn and maintain with a minimum of SFI support. Ease of maintenance and operational support is essential for the SIS environment.

The SIS will need to operate from secure servers. There will be many levels of content that will have inclusion or exclusion access controls by user class or designated user lists or even individuals. Some information will always have restricted access and will be private to limited user sets. All of these factors require a flexible yet robust and secure technology base for the SIS.

SIS Tool Set

The SIS, when it is fully implemented, will consist of the content segments described above, software components and integration processes that the SIS implementation vendor will develop. This “component glue” will provide the user, content and process linkages necessary for the SIS to meet its operational goals. The categories of tools that are available as Open Source or proprietary offerings fall into the following categories.

• Content Treatment • Content Management • Collaboration • Dissemination • Workflow • E-Learning • Search Engine • Database Systems

Process Integration Services to be Implemented

• Administrative Services • Access Support Services • Back Office Services

NEXT STEPS

The Detailed Requirements Definition will inform the solicitation of bids from organizations with the background and experience needed to construct the SIS. Upon vendor selection this document will help drive the design and software specifications that decompose the requirements into a structure that can be related to a designated technology solution set.

SFI should consider a staged implementation to the SIS. It is recommended that the initial phase include the acquisition of the content management tool for all repository components of the SIS. From that and the associated calendaring tools the new administrative and business development capabilities can begin to be developed.

The SIS implementation can consist of two major phases. The whole implantation process should be modular. To undertake the whole effort as a monolithic project will not be as effective or ultimately as successful as a well-developed phased implementation. It is possible that upon evaluation of the implementation vendor’s plans, two phases can be decomposed into smaller staged steps.

Provisional SIS Project Implementation Plan

1. Approval of the Detailed Requirements Definition 2. Provisional budget approval 3. Develop an RFP based on the Detailed Requirements Definition 4. Identify prospective vendors 5. Invite bids from most suitable vendors 6. Select vendor 7. Work with the vendor to develop the SIS design 8. Acquire Phase 1 system components and any other required technology 9. Work with the vendor to develop system software specifications 10. Begin the new website reengineering 11. Design and develop the administrative and Institute support services capabilities 12. Design and start the development of the Business Network and International Programs services 13. Launch the metadata development effort 14. Start content conversion / transformation for the research component to the new architecture – start with the working papers 15. Start the workflow task definition process for content dissemination processes. 16. Phase 1 vendor and SFI installation and start up including parallel operations 17. Training – Phase 1 18. Launch first phase operational support 19. Begin using the SIS Phase 1 20. Revise as necessary the operations of the Phase 1 features and functionality 21. Acquire Phase 2 system components and any other required technology 22. As Phase 2 portions are implemented conduct testing and develop preliminary training materials 23. Phase 2 vendor and SFI installation and start up including parallel operations 24. Launch Phase 2 operational support integrating with ongoing operations from Phase 1. 25. Training – Phase 2 26. Begin using the SIS Phase 2 27. Confirm operational support and maintenance 28. Revise as necessary the operations of the SIS features and functionality 29. Design and develop the content creation and dissemination processes 30. Launch moderated discussion groups and weblogs. 31. Pilot the dissemination process with the Bulletin and working papers and begin conversion to other content dissemination requirements 32. Complete any remaining Administrative support, Business Network and International Programs services 33. Continue the repository build out and content digitization with the remaining archives, other source data such as selected video content and other legacy material 34. Web site completed with all new features such as remote presentations and multi-language support

OTHER REQUIREMENTS FACTORS

OAI

SFI should determine if and to what degree it wants to participate in the Open Archives Initiative (OAI). Providing complete SFI publications is one approach. That may a problem with content that has or will appear in copyrighted products. The OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) provides another means through an application-independent interoperability framework based on metadata harvesting. As long as the metadata is used for resource discovery and points back to the SIS, it may be a good starting point for SFI.

DRM

As part of the SIS design, SFI should consider implementing capabilities to support DRM protocols. Digital Object Identifier (DOI) is one increasingly common DRM method. Others may come into play that can be served by automated mechanisms as well as informal content-right-to-use arrangements that are established manually. For example, content right-to-use arrangements may be negotiated as part of speaker agreements. The specifics may allow the Institute to use the speaker supplied presentation and collateral material internally only, or to a wider audience with user class level access controls. And with speakers or visitors who have material to share, the system will have to have access level values maintained that may include rules for content removal from the SIS after an agreed-upon timeline. Other arrangements may call for timeframe triggers such as access embargos that some Business Network content on SIS would come under.

OTHER MATERIAL IN THE DRD REPORT

In the full Detailed Requirements Definition document, there are a few other sections that describe characteristics of the prospective SIS. The additional material includes:

• A review of relevant Open Source technology, • Definition of probable SIS user classes, • Content volume estimates, • And, a scenario outline describing what an SIS in action might look like.

Information Products at SFI: Context

November 2004

This is a paper produced by Ginger Richardson which summarizes the types of information and objectives of the SIS.

RATIONALE A professional and integrated information system is no longer a luxury but is the hallmark of a forward-looking, efficient organization. People expect state-of-the-art products from SFI in this domain but unfortunately the current interface does not reflect an institution at the cutting edge of science. As SFI continues the process of refining its research goals it is imperative that it present a current, representative spectrum of its intellectual work.

We also recognize that providing access to technology and information is one of the most important things this institution can do to nurture science in developing countries.

OBJECTIVES General: To develop an active, comprehensive knowledge repository and associated dissemination tools To facilitate scientific collaborations within our national and international communities

Technical: Increased perspective on the knowledge available---better search, categorization, discovery tools Improved information farming Collaborative content building—real-time communication, ad-hoc forums Improved presentation/dissemination media

STRUCTURING MEDIA Web: delivery venue for public relations, education, research dissemination Publication Artifacts: traditional/electronic Institutional Repository Site: published material, preprints, gray material, etc. Library Site Database Platforms

ACTION Consensus on an action plan Survey technology and develop solution framework Assemble team and resources Estimate costs Timeline and delivery schedule

Information Products at SFI: Survey of Current Available Data November, 2004

Intellectual Products Annual Research Reports Bibliography of work in refereed publications Bulletin Business Network discussion sites Modeling demos (a few sites) Library: Online catalog, online journals, searches, user services Research Topic summaries: synthetic, top-down report; parts outdated Researcher sites (on an individual basis, don’t exist for entire community) Working Papers and abstracts Workshop/Working Group Sites-abstracts, discussion sites

Information Business Network: Overview, members, events (online registration), projects, discussion sites Directories: Inhouse coordinates-phone, email, url, office location, computing guides Education Information/application materials for: Complex Systems Summer Schools Other schools Postdoctoral, graduate and undergraduate students Secondary Programs Events: Calendar, talks (seminars, colloquia, public lectures), workshops FAQ’s Gifting Opportunities General Information about SFI Institutional forms International Program: Fellowships, program information, link to CSSS People: Board of Trustees, Science Board, External Faculty, Resident Researchers, Staff (coordinates-phone, email, url, office location) Policies: Working group and workshop guidelines, reimbursements Visitors: Calendar of visitors, visitor guidelines and how to visit, travel info and driving directions, Santa Fe city info, accommodations, etc.

Available to Inhouse Readers Only Online report of computing problem or computing work order

Relatively speaking, there is a great deal of administrative information at the site, but considerably fewer intellectual products.

February 8, 2005

Next steps for developing knowledge resources at SFI

To: Santa Fe Institute Community

From: Ginger Richardson

Over the past several months several of us have had discussions about scholarly communications at SFI. We've even undertaken some ad hoc projects including first steps at redesigning our web site, work on a pilot module exploring a semantic classification system, and development of web-based curricula material for the CSSS. These have been good warm-up moves; we're producing deliverables requested by Bill Melton for the International Program, and we've got a much better sense of the scope of the issue and how to approach it. Our next step is to convene a group to explore Big Picture Issues about the nature of an SFI-wide institutional repository. Three big questions to consider are: o What should be the general boundaries of the institutional repository? (Scholarly content, administrative content, one or both?) o Second, what is a high-level description of the content ? In other words, what is the "stuff" (e.g. text, format documents (ala PDF), data sets, video, audio, simulations, discussion boards etc.) that will be in the repository and what not? o Finally, what do we think are the high level functional requirements for the content? That is generally, how do we want to be able to create and manage the content, and what do we want to be able to do with the content once in the repository? As noted, several of us have discussed these questions in one form or another, but not comprehensively with all interested parties at hand. I invite anyone who has input to join us for a meeting to this end on Tuesday, March 1 at 1:30 p.m. in the Medium Conference Room . My intention is that we gather three or at most four times over the next several weeks. We will produce a working report by April 15. Assuming that report inspires a critical mass of SFI backing, some hardy (self-selected) subset of us will move on to a detailed requirement phase that will flesh out more fully what the needs are in terms of creation/populating the IR; management; and use.

More on the project dynamics when we get together.

Please respond by Feb. 25 if you plan to attend.  In the meantime, here are a couple general information URLs about digital collections and getting started (thanks to Bae).

o http://www.imls.gov/pubs/forumframework.htm o http://www.nyu.edu/its/humanities/ninchguide o http://www.cdpheritage.org/resource/metadata/wsdcmbp/index.html

Please look them over before the meeting; if you have other relevant materials you'd like to distribute to the group prior to the first meeting, let me know

_______________________________________________

SIS_Requirements.pdf

SIS_Report.pdf

Crosswalks

2) MARC 21 to DC Crosswalk from the Library of Congress

This covers both qualified and unqualified DC.

MARC to DC: http://www.loc.gov/marc/marc2dc.html This one provides the mapping in the form of an easy-to-read table.

DC to MARC: http://www.loc.gov/marc/dccross.html

As for whether or not "report number" maps to dc:relation...MARC #088 is "rept.#"; therefore, your report numbers would map to dc:identifier rather than relation. See the back page of the copy of the TRI mappings that I gave you.

I have not been able to find a better explanation of each element, other than what is in the DCMI web site. I did find a page full of links to crosswalks, though: http://www.ukoln.ac.uk/metadata/interoperability/

Regards, Jewel

Preservation

A major document which discusses preservation is: DDIIGGIITTAALL PPRREESSEERRVVAATTIIOONN AANNDD PPEERRMMAANNEENNTT AACCCCEESSSS TTOO SSCCIIEENNTTIIFFIICC IINNFFOORRMMAATTIIOONN:: TTHHEE SSTTAATTEE OOFF TTHHEE PPRRAACCTTIICCEE Gail Hodge Information International Associates, Inc. Evelyn Frangakis CENDI Digital Preservation Task Group/ National Agricultural Library A Report Sponsored by The International Council for Scientific and Technical Information (ICSTI) And CENDI US Federal Information Managers Group February 2004 Revised April 2004

[cutting and pasting the title page from this document presents an interesting case in the perils of digital preservation.] It can be found in this directory for mba/institutional repository/digitalpreservation04.pdf

Here's the case for LOCKSS from an ACS press release: CLOCKSS Participation

At its 231st National Meeting in Atlanta, the Publications Division of ACS announced that it will become a member publisher of CLOCKSS (Controlled LOCKSS, which stands for "Lots of Copies Keeps Stuff Safe"), a collaborative archiving initiative including both publishing organizations and libraries. Over a two-year period, ACS Publications will be an active participant in the testing of the CLOCKSS technology and social model, to support a "large dark archive" that is both fail-safe and has a clear process for providing continuing access for orphaned materials.

The initiative, which began early in 2006, will use a representative portion of each publisher member's corpus. It will replicate this content, using a version of the existing LOCKSS software, within a small international network of institutions and libraries - thus providing the prospect of secure, long-term availability of scholarly content on a global scale. Management of the process and the content will be exercised by a joint board of publishers and librarians, to ensure that all decisions are community-based.

"We are delighted to be joining the CLOCKSS initiative as a member publisher," reported Bob Bovenschulte, President, ACS Publications. "ACS Publications was one of the first publishers to digitize its back file, starting with the very first issue of the Journal of the American Chemical Society published in 1879. This effort was a vital part of our mission to provide global access to our content. Participating in a collaborative and innovative effort like CLOCKSS is a logical next step, to ensure the long-term preservation of this material. We want our customers, authors, and readers to be confident that this content will be available for future generations of researchers."

For further information about CLOCKSS, please visit www.lockss.org/clockss/.

Project Management Ideas

From: Michael Leach <mrleach@fas.harvard.edu> Date: October 20, 2004 1:26:03 PM MDT To: Margaret Alexander <mba@santafe.edu> Subject: Re: Harvard Sciences Digital Library update

Margaret,

Each of the Harvard science libraries participating in the HSDL/Dspace pilot project developed support for the project before we actually set up the software on our servers. In general, we either:

A) spoke at departmental faculty meetings; or B) spoke at library faculty committee meetings; or c) communicated directly with specific research groups in each department.

(FYI, the departments covered in this process are: Molecular & Cellular Biology, Organismic & Evolutionary Biology, Physics, Astronomy, Mathematics, and Earth & Planetary Sciences.

Certain formats, collections, etc. were identified, with most being in digital form already. This included:

1) Two+ years of PhD theses from the Physics Dept., which were already online (but not searchable);

2) Sets of research articles identified by the "green" or "gold" status of the publishers;

3) Datasets (and their owners) looking for preservation;

4) A digital video collection from one department.

These are the materials now being uploaded into the pilot project version of Dspace.

Once the pilot project phase is complete, we will return to the faculty in each department and show them "what we have" and "what we can do for them", supported by the faculty already participating. We do intend to input materials that are only in print form, but that will not occur till later this month and in early November. There are public-domain materials in a number of libraries which fit our criteria for input.

The organization of this project has many simultaneous lines:

i) Training: offering training and skill development to staff in our libraries so they can create the metadata and upload the digital objects (given the paucity of materials currently available, this has proven a bit of a challenge--luckily the Dspace software is user friendly);

ii) Policy Development: This has become our most time-consuming job. Issues like replacement; "removal" of items; security group creation; etc. are taking up a lot of time. We are creating a document that will highlight these issues, and hope to have it available within the next 6 weeks;

iii) Outreach: members of the steering committee have been speaking to fellow librarians here at Harvard, meeting with other, like-minded groups at Harvard (e.g. our data center and the iCommons folks), and, of course, corresponding with folks (like you) outside of Harvard (there is a great opportunity to learn from each other here).

iv) Metadata issues: related to form of input, dealing with special characters and mathematical symbols, and the development or adoption of other metadata schemes for our Dspace (for instance, we are involved in examening dataset schemes for our scientific datasets, since Dublic Core is inadaquate for this).

There are other issues we have "run across" as well, but I do not think they impact upon your specific questions below.

Let me know if any of this information helps. If you have further questions, or comments, please share. Thanks.

Michael

Michael R. Leach Director Kummel Library of Geological Sciences and Physics Research Library Harvard University 24 Oxford St., Cambridge, MA 02138 USA Kummel: 617-495-2029 (voice); 617-495-4711 (fax) Physics: 617-495-2878 (voice); 617-495-0416 (fax) mrleach@fas.harvard.edu or leach@eps.harvard.edu or leach@physics.harvard.edu

Notes from a book on Effective Project Management----

these are Margaret's notes for the case at SFI Notes from Effective Project Management (Wysocki, 3rd. ed. Wiley, 2003):

Basic Project Management

There is a difference between wants and needs--we should carefully assess if the "want" is appropriate to our needs. Not everyone's wants should become needs.

70% of group projects fail because the group either doesn't adapt to changes that come along the way or aren't clear about how to manage change so that the project can be completed.

Three types of cases usually characterize projects: 1) when the goal and the solution are clear; 2) when the solution is not clear, but the goal is; and 3) when the goal is not clear. For #3 see Chapt. 19. [ I think we're in #3 with hints of #2 (for example, using a commercial product or open source software--though our discussion resulted in a preference for open source).]

In the case of #3 we make a guess at what the goal is and start a work cycle. Later on, we re-define the goal and make a new work cycle to achieve the goal. This may happen several times. See below for more details.

[This book assumes there is a client--in our case, we might want to identify who are the beneficiaries of our IT project. Keeping these people in mind throughout the project may keep us on track: inhouse, researchers, public.]

Every project has at least five constraints; 1.scope (scope creep is change to the original plan that needs to be managed; hope creep is when someone gets behind and hopes to catch up to everyone else; effort creep is when someone seems to expend a lot of effort but doesn't get the job done, feature creep are add-ons that the project doesn't need), 2. quality, 3. cost, 4. time, and 5. resources.

Here's the traditional way to define the scope: What's the problem/opportunity addressed; What's the goal; How do we accomplish it; How will we know if it's successful; What are the risks?--all of these can change during the course of the project. p. 23 has a list of the phases of a traditional project; p. 37 has a list of "candidate risk drivers" which is pretty interesting.

"Just in Time" Planning

The basic premise of adaptive project management is that the scope is variable and is adjusted at each iteration of the project. Management avoids parts of plans which are speculative and focuses only on activities which will clearly be part of the final solution. Later on, speculative portions of the project may be clear and can be incorporated in a later iteration. Planning is done in segments where each chunk of work only represents several weeks. Planning is a "just in time" activity because the goals of the project can change.

Overall planning begins with a Project Overview Statement which includes the following elements. This can be written by one person and presented to everyone else for their comments. It also provides the team with an outline of what's going on. 1. statement of the problem or opportunity--facts, truths, 2. general goals of what the project will accomplish---what are everyone's expectations 3. project objectives--how it will get done 4. outcomes--how will everyone be satisfied 5. risks

The above points should be placed in a document that anyone can read. #3 and #5 should be expanded and detailed for the team. At any point there should be two versions of the overview statement--a clear, simple one for everyone, including funders. A detailed one could be renamed for the team.

See attachment for a draft of the project overview statement.

The next step is to develop a work breakdown structure. In the work, it's important to identify any limiting factors which might hold up the project. These need to be addressed as they come up. For instance, it will always take some time to identify best practices unless the team member is already an expert. In our case, we will almost never be expert and a written statement of best practices for each component of our project should be mandatory. This will become the documentation for the project. Best practices should include policies and procedures. By the way, will we document our project? Look into this.

Copyright

Semantic Web

Bibliography: Books and papers on information systems

1. Print copies of relevant papers in the areas of metadata, copyright, institutional respositories, and electronic publishing are shelved next to Margaret's desk.

2.An HTML version of the Open Access Bibliography: Liberating Scholarly Literature with E-Prints and Open Access Journals (OAB) is now available.

http://www.digital-scholarship.com/oab/oab.htm

The HTML version of the book was created from the final draft using a complex set of digital transformations. Consequently, there may be minor variations between it and the print and Acrobat versions, which are the definitive versions of the book.

The OAB provides an overview of open access concepts, and it presents over 1,300 selected English-language books, conference papers (including some digital video presentations), debates, editorials, e-prints, journal and magazine articles, news articles, technical reports, and other printed and electronic sources that are useful in understanding the open access movement's efforts to provide free access to and unfettered use of scholarly literature. Most sources have been published between 1999 and August 31, 2004; however, a limited number of key sources published prior to 1999 are also included. Where possible, links are provided to sources that are freely available on the Internet (approximately 78 percent of the bibliography's references have such links).

Metadata

From Santa Fe Institute Events Wiki

Metadata

Metadata Standards

SFI Ideas for an Information System (chronological order)

Early Discussion of Open Archives at SFI-2002

Complexity Portal-2003

SIS Overview-Feb. 2006

Information Products at SFI: Context

Crosswalks

Preservation

Project Management Ideas

Copyright

Semantic Web

Bibliography: Books and papers on information systems