Metadata: Difference between revisions

Revision as of 20:00, 3 April 2006

Metadata

Here's a simple description in Powerpoint form of library metadata:

documents/metadata-demystified.pdf

Here's an example of managing a metadata project:

documents/metadata-strategy.pdf

Metadata Standards

From Jewel (AISTI consultant who assisted SFI in Feb. 2004)

comm.nsdl.org/download.php/196/PC3Draft1b.pdf

(If you have problems accessing the paper, let me know.)

I'll see if I can get a hold of the final version, but in the meantime, you may wish to look at section 1.2.13.1, which provides a definition of native metadata, and other types of metadata for a collection.

This document may be useful as you redesign your systems/db in the next phase. ;-) I will also read through the entire document; at this point, I've skimmed through it. At that point, if you don't mind I may make some further recommendations for SFI, based on what we have discussed.

Metadata Preservation

VICTORIA MCCARGAR, SEYBOLD REPORT = Since the mid-1990s, it has become increasingly clear that information stored digitally is terribly fragile...The task of identifying all the risk factors and putting preservation solutions in place has barely begun.

http://content.seyboldreport.com/TSR/subs/0421/disappearing_data.php

SFI Reports (chronological order)

Early Discussion of Open Archives at SFI-2002

Date: Mon, 4 Nov 2002 13:29:19 -0700 (MST) From: Margaret Alexander <mba@santafe.edu> Subject: eprint server To: ellen@pele.santafe.edu, gumerman@pele.santafe.edu, rkbv@pele.santafe.edu, grr@pele.santafe.edu Cc: cmachado@aisti.org, mba@pele.santafe.edu

Last week George, Ronda, Tim, Ginger, and I had a preliminary meeting about putting SFI's working papers on a new e-print server that is being developed by our consortium (called AISTI) of sci-tech libraries. During the course of a D.C. meeting with other libraries about e-print servers, most speakers concurred that policy issues took longer than technical ones.

MIT Libraries developed a repository for the electronic scholarship of its university called dSpace. In the process of creating dSpace, they compiled a list of institutional questions/policies that had to be answered or written before the electronic repository went into operation. Here they are:

1. Who is qualified to submit electronic content? At SFI this question is taken care of by our policies for working paper submission. (In the future, SFI may not want to limit its submissions to just working papers. However, our "first step" in this endeavor should be working papers.)

Ronda: I agree. We can, however, readdress the working paper policies as needed.

2. What is the character of submissions? At SFI this question is taken care of by our policies for working paper submissions. (At other repositories learning objects such as course materials, theses, and dissertations may be submitted.)

3. What forms of submissions are accepted? At SFI this question is taken care of by our policies for working paper submissions.

4. What are our expectations for retention? This question probably applies to learning objects which are obsolete when a course is completed. At SFI we might ask what would happen if we decide to withdraw a working paper--is this covered in our current policy? At UCLA the policy is that the citation (metadata) cannot be removed but the content can be.

Ronda:I agree that we may want to delete the electronic versions of the papers but not the actual citing from the archive. (Currently, we leave the number but delete the title and author to avoid confusion.) I think there needs to be a comments field or option which specifies that this paper has been withdrawn, and maybe by whom such as "withdrawn by author."

5. How do repository policies interact with other policies? At SFI we may be simple enough administratively to ignore this concern.

Ronda: One possible conflict is that student and nonfaculty members must have their working papers reviewed by an SFI faculty member or Science Board member. If the archive is truly open (anyone can submit), then the designation as an SFI working paper would have to be delayed until this review took place and the individual's connection to SFI verified.

6. Are there privacy issues? At SFI, this question might entail asking each individual working paper author if it's o.k. to submit SFI back files to the new e-print server.

Ronda:Since the authors have already consented to listing the paper as an SFI working paper, and we are simply changing how our papers are presented, I believe we can make the transition to another server by notifying the authors of our intent rather than getting permission in writing from each one. They can contact us if they object. Note that some past authors may be difficult to contact.

If you think about it, these papers have been in the public domain for some time now so they are not proprietary. And as long as the server remains open, we are not violating the original "agreement" with the authors. We just need to remember that we don't own the copyright for any of the papers.

7. What metadata will we require? Is the metadata at SFI consistent, complete, and accurate? If not, who will do the work?

Ronda:For SFI's working papers, we have a great deal of information in a FileMaker database and this data has been checked. We do not have electronic copies of all papers back to 1989, nor do we monitor the quality of these papers per se (we simply check to see if they print correctly). We will have to decide later whether we want to recreate key historical papers, particularly those that are frequently requested, and what to do about quality control.

The conversion of our database to the metadata is an issue

The following need to be asked of the AISTI eprint server:

8. What if the eprint server fails? At SFI, since we already have a working paper site, we are protected.

Ronda':Not if we defer to the server instead. I thought this is what George said at our meeting--it would replace what we have. We will continue to track key information about each paper and the abstract, but will have to decide whether to, and how to, archive electronic copies. And we would need a plan for how to make this information available should the eprint server fail.

9. What if revenue comes from the eprint server? We also need to make certain that the eprint server remains open and does not require licensing.

10. Who incurs the costs of running the eprint server over the long-term? Initially, the eprint server has been funded by LANL with a grant from AISTI.

11. How will the eprint server be used? For example, can outside researchers use the repository as a "sand box" for their own research?

Ronda:I think having a repository that accepts a variety of papers is fine as long as we protect those papers classified as SFI working papers. Hence, I think it would be helpful to have some classification of articles by broad topics, such as complexity, as well as key words and specific paper series. For example, if an author submits a paper, it is not listed as an SFI wp until someone designated at SFI agrees or verifies this.

I am very excited about this project. I hope the researchers are as enthusiastic. Good luck at your presentation.

Let me know your thoughts on how these questions should be answered. On Nov. 13 I will introduce the eprint server at the noon time faculty meeting.

Margaret

Complexity Portal-2003

Date: Sun, 16 Nov 2003 14:19:37 -0700 From: "Marcus G. Daniels" <mgd@santafe.edu> To: Brent Jones <brent@santafe.edu> CC: "James P. Crutchfield" <chaos@santafe.edu>, Kevin Drennan <kevin@santafe.edu>, Supriya Krishnamurthy <supriya@santafe.edu>, John Miller <miller@santafe.edu>, Sarah Knutson <srk@santafe.edu>, walter@santafe.edu, Ginger Richardson <grr@santafe.edu>, mba@santafe.edu Subject: Re: Updated Report Brent Jones wrote:

> I would also suggest that Margaret be there, not for technical input, > but because she has ideas and concerns of how SFI should fit in with > the rest of the electronic library world. We need to make sure we're > all thinking along the same lines. I think we need to meet as a > group, and not just Kevin, Marcus and myself, as this is pretty much > the core of SFI as the Complexity Portal.

FYI, I installed DSpace on an Econophysics machine at SFI. (I understand that Laura, coordinating with Margaret, is learning how to upload Working Papers into an offsite DSpace server.)

DSpace is a Java server package / web app that maintains digital library collections. The main web page is http://www.dspace.org, but the DSpace-in-use page is https://dspace.mit.edu/index.jsp. Roughly, DSpace is to digital collections (e.g. PDFs, Quicktime, MP3s) as Millennium (the software hosted at UNM for tracking SFI's library collection) is to books and journals. Millennium is a large and complex commercial software package used by libraries around the world. It deals with maintaining the database of the library collection, tracking circulation (including over the web), provides optional modules for cataloging, features for ordering/reciept/payment/budgeting of library material. The company that makes Millennium, Innovative Interfaces Inc., is sort of the Microsoft of the field, and the software is not cheap.

The analogy is not perfect, as the overlap between journals and digital journals is not a terribly important one. Some things change, e.g. circulation changes to some extent, due to the possible no-cost redistributability of digital works. Other things don't change, the determination of `subject' categorizations. It's quite common to use conventional library software like Millennium to provide library patrons with a web interface to collections that subsume digital works.

One practical problem for SFI is that we don't have direct control over the Millennium software configuration. UNM runs that. If you go to the Libros web site and find some digital file reference, many of these references are not `links', they are just ad-hoc embeddings of URLs that aren't clickable. Perhaps at UNM there are software clients in the library configured such that these references are clickable, but at least the web interface isn't perfect in this regard. So one question is the extent to which the digital works are cataloged, and another question is whether the Millennium web module is configured in a way that can be integrated into the SFI Complexity Portal.

At SFI, we have uncataloged works, and the cataloging that is done is outsourced. By that, I mean that Working Papers are basically invisible to the library world. Google will pick things up based on keywords as the abstracts of the Working Papers are on the SFI website, but a normal literature search (say as performed by a reference librarian doing his everyday job for an assistant professor) might not. An ISI citation search (a commercial indexer of journals) won't find Working Papers, and neither will a canonical OCLC search. OCLC is a consortium of worldwide libraries (including Library of Congress) that manage shared cataloging records and tools.

We also don't list all of SFI's digital subscriptions with Libros so that's another obstacle to using the existing UNM Millennium server as a tool to query SFI's digital holdings. With regard to the Complexity Portal web site, the way it would work would be that a user would submit a request via the Z39.50 protocol to Millenium, and it would send back USMARC records that could be taken apart and formatted as HTML. This approach has two advantages. First, it integrates SFI into the library community as good citizen, and secondly it provides a single interface to query the physical and digital holdings of SFI. Z39.50 is standard protocol libraries can use to query each other and there are open-source packages for using it. The larger computing committee report has details on this.

Cataloging has three purposes. The first is to put identifiers on a work, e.g. a physical home, authors, publishers, dates, page ranges, etc. Secondly, it adds classification information so that patrons of a library know where to go looking for a work in library. The call number encodes information about what subfield the work relates to and then additional numbers to discriminate it from other works in that subfield within that library. Thirdly, the cataloging records can be extended with additional information, such as detailed subject headings. A work can have multiple abstract subject attributes, and systematically encoding this information can help people find works of interest. Finally, and most importantly, the collective process of cataloging by librarians across different libraries and the sharing of this information in a systematic way leads to deeper insights into how works relate to one another.

The systematic way this for the most part done using MARC records. MARC records encode a lot of information, and it is a labor intensive process. It's labor intensive because even if it is obvious what the subject is (What subject is a book with a title like "Landscape Terrain Variation" -- unless you skim the book, is it about how to design your flowerbeds or is it about the problems a population might encounter in dealing with changes in their habitat?), there is still a need to look up the subjects in tables and assign the right codes. A typical MARC record is about a page of text. To give an idea of the resolution of the subject headings, Stuart Kauffman's _Origin's of Order_ is just listed under Biology, but Melanie Mitchell's _Introduction to Genetic Algorithms_ is listed under various Genetics headings, Mathematical Models, Computer Simulation, and Algorithms. Stephen Wolfram's _New Kind Of Science_ (gasp), is listed under Cellular automata and Computational Complexity. It's possible to introduce new subject headings, and there are various committees at MARC, OCLC and the Library of Congress, that evolve the standard set, but one would have to contact these organizations and make an argument. (Say, if you wanted to introduce "Econophysics" as a subject heading.) The current Library of Congress guide to subject headings has 270,000 entries.

Basically, the upside of coding MARC records is that it is the bread and butter of librarians; it is well understood. The generation of MARC records potentially can be useful for search, due to the thought given to the cataloging. I say "generation" to discriminate from "use". If SFI cataloged the Working Papers, `deep' subject headings could be given that would be meaningful and useful to SFI-affiliated researchers (or anyone else for that matter). However, if a work exists and anyone else has it, it is the case that one can consume other library's MARC records for local use (and most frequently, the Library of Congress MARC records). The downside for _generation_ of new records (i.e. labor, not the submission itself) is that the cost of entry is high enough that less complex alternatives can be attractive.

One alternative in use by DSpace is the Dublin Core. Dublin Core doesn't require all of the detail of MARC records in order to be correct, but it can encode and use it if the information is provided. With DSpace, contributors enter some basic information about a document (author, title, e-mail), and then upload a paper. Then, as an optional step, there is a tool in the package to do cataloging (a different person can do this). The point is, with DSpace what you get may or may not be cataloged and may or may not be cataloged completely (relative to what would be done in a conventional package like Millenium by librarians). This characteristic could be an advantage or a disadvantage. Note that existing DSpace servers (e.g. see the MIT DSpace URL above) don't replace the main library catalog, they just augment it -- presumably (digital) material that doesn't warrant the level of cataloging attention given to material in the main library. Material like doctoral theses, working papers, and internal documents.

DSpace has several potential advantages, but neither has really been realized yet in the current version. The first, more general idea, is that it can be told about different kinds of digital contents. In principle, DSpace could be educated how to take apart an archive of simulation, and run in it some web context. This could be difficult or impossible to do with a package like Millenium because of the server-side customizations that could be needed. There doesn't seem to be any code in DSpace as of the current version to facilitate this. It's just an idea. Secondly, DSpace aims to provide searching of full text contents in the next version. This is not the domain of the main Millenium package, but there seem to be some products from the company that may provide this to some extent.

In terms of exactly what DSpace is and how it is set up, here is some background. DSpace is an open source Java project being lead by MIT Libraries and Hewlett Packard. It runs inside of the Tomcat JSP server. JSP, or Java Server Pages, is a way to write dynamic web pages in Java. It's primary competitors are PHP (kind of low-tech in comparison), and ASP (from Microsoft). Tomcat is free from Apache. DSpace uses JDBC (Java -> SQL) interfaces for maintaining its databases. In other words, DSpace stores stuff in a RDBMS. They suggest Postgres (which is how I set it up), but MySQL should work too. So, it ought to be possible to query the DSpace database from outside of DSpace and collect or install information.

Specifically, it should be possible with a little bit of technical work to import the whole of the cleaned working paper bibliography into DSpace in batch (rather than full re-entry), and keep the Endnote version in sync through a procedure similar to the retrival of citations from ISI (rather than co-entry into Endnote & Filemaker & DSpace).

The use of Millenium and/or DSpace as a tool for the Complexity Portal implies certain priorities in terms of how SFI thinks about the accessibility of resources needed by scientists at SFI and how the work done by SFI scientists is made available to the world. Some of these tools introduce technical constraints the committee should be thinking about.

Information Products at SFI: Context

November 2004

RATIONALE A professional and integrated information system is no longer a luxury but is the hallmark of a forward-looking, efficient organization. People expect state-of-the-art products from SFI in this domain but unfortunately the current interface does not reflect an institution at the cutting edge of science. As SFI continues the process of refining its research goals it is imperative that it present a current, representative spectrum of its intellectual work.

We also recognize that providing access to technology and information is one of the most important things this institution can do to nurture science in developing countries.

OBJECTIVES General: To develop an active, comprehensive knowledge repository and associated dissemination tools To facilitate scientific collaborations within our national and international communities

Technical: Increased perspective on the knowledge available---better search, categorization, discovery tools Improved information farming Collaborative content building—real-time communication, ad-hoc forums Improved presentation/dissemination media

STRUCTURING MEDIA Web: delivery venue for public relations, education, research dissemination Publication Artifacts: traditional/electronic Institutional Repository Site: published material, preprints, gray material, etc. Library Site Database Platforms

ACTION Consensus on an action plan Survey technology and develop solution framework Assemble team and resources Estimate costs Timeline and delivery schedule

Information Products at SFI: Survey of Current Available Data November, 2004

Intellectual Products Annual Research Reports Bibliography of work in refereed publications Bulletin Business Network discussion sites Modeling demos (a few sites) Library: Online catalog, online journals, searches, user services Research Topic summaries: synthetic, top-down report; parts outdated Researcher sites (on an individual basis, don’t exist for entire community) Working Papers and abstracts Workshop/Working Group Sites-abstracts, discussion sites

Information Business Network: Overview, members, events (online registration), projects, discussion sites Directories: Inhouse coordinates-phone, email, url, office location, computing guides Education Information/application materials for: Complex Systems Summer Schools Other schools Postdoctoral, graduate and undergraduate students Secondary Programs Events: Calendar, talks (seminars, colloquia, public lectures), workshops FAQ’s Gifting Opportunities General Information about SFI Institutional forms International Program: Fellowships, program information, link to CSSS People: Board of Trustees, Science Board, External Faculty, Resident Researchers, Staff (coordinates-phone, email, url, office location) Policies: Working group and workshop guidelines, reimbursements Visitors: Calendar of visitors, visitor guidelines and how to visit, travel info and driving directions, Santa Fe city info, accommodations, etc.

Available to Inhouse Readers Only Online report of computing problem or computing work order

Relatively speaking, there is a great deal of administrative information at the site, but considerably fewer intellectual products.

February 8, 2005

Next steps for developing knowledge resources at SFI

To: Santa Fe Institute Community

From: Ginger Richardson

Over the past several months several of us have had discussions about scholarly communications at SFI. We've even undertaken some ad hoc projects including first steps at redesigning our web site, work on a pilot module exploring a semantic classification system, and development of web-based curricula material for the CSSS. These have been good warm-up moves; we're producing deliverables requested by Bill Melton for the International Program, and we've got a much better sense of the scope of the issue and how to approach it. Our next step is to convene a group to explore Big Picture Issues about the nature of an SFI-wide institutional repository. Three big questions to consider are: o What should be the general boundaries of the institutional repository? (Scholarly content, administrative content, one or both?) o Second, what is a high-level description of the content ? In other words, what is the "stuff" (e.g. text, format documents (ala PDF), data sets, video, audio, simulations, discussion boards etc.) that will be in the repository and what not? o Finally, what do we think are the high level functional requirements for the content? That is generally, how do we want to be able to create and manage the content, and what do we want to be able to do with the content once in the repository? As noted, several of us have discussed these questions in one form or another, but not comprehensively with all interested parties at hand. I invite anyone who has input to join us for a meeting to this end on Tuesday, March 1 at 1:30 p.m. in the Medium Conference Room . My intention is that we gather three or at most four times over the next several weeks. We will produce a working report by April 15. Assuming that report inspires a critical mass of SFI backing, some hardy (self-selected) subset of us will move on to a detailed requirement phase that will flesh out more fully what the needs are in terms of creation/populating the IR; management; and use.

More on the project dynamics when we get together.

Please respond by Feb. 25 if you plan to attend.  In the meantime, here are a couple general information URLs about digital collections and getting started (thanks to Bae).

o http://www.imls.gov/pubs/forumframework.htm o http://www.nyu.edu/its/humanities/ninchguide o http://www.cdpheritage.org/resource/metadata/wsdcmbp/index.html

Please look them over before the meeting; if you have other relevant materials you'd like to distribute to the group prior to the first meeting, let me know

_______________________________________________

SIS_Requirements.pdf

SIS_Report.pdf

Crosswalks

2) MARC 21 to DC Crosswalk from the Library of Congress

This covers both qualified and unqualified DC.

MARC to DC: http://www.loc.gov/marc/marc2dc.html This one provides the mapping in the form of an easy-to-read table.

DC to MARC: http://www.loc.gov/marc/dccross.html

As for whether or not "report number" maps to dc:relation...MARC #088 is "rept.#"; therefore, your report numbers would map to dc:identifier rather than relation. See the back page of the copy of the TRI mappings that I gave you.

I have not been able to find a better explanation of each element, other than what is in the DCMI web site. I did find a page full of links to crosswalks, though: http://www.ukoln.ac.uk/metadata/interoperability/

Regards, Jewel

@@ Line 34: / Line 34: @@
 ===SFI Reports (chronological order)===
+====Early Discussion of Open Archives at SFI-2002====
 Date: Mon, 4 Nov 2002 13:29:19 -0700 (MST)
 From: Margaret Alexander <mba@santafe.edu>
@@ Line 159: / Line 159: @@
-====Complexity Portal====
+====Complexity Portal-2003====
 Date: Sun, 16 Nov 2003 14:19:37 -0700
 From: "Marcus G. Daniels" <mgd@santafe.edu>
@@ Line 337: / Line 337: @@
 about.
-'''Information Products at SFI:  Context'''
+==='''Information Products at SFI:  Context'''===
 November 2004

Metadata: Difference between revisions

From Santa Fe Institute Events Wiki

Revision as of 20:00, 3 April 2006

Contents

Metadata

Metadata Standards

Metadata Preservation

SFI Reports (chronological order)

Early Discussion of Open Archives at SFI-2002

Complexity Portal-2003

Information Products at SFI: Context

Crosswalks

Native SFI Metadata

Native SFI Metadata

Native SFI Metadata