Friday, November 21, 2008

Muddiest Point

no muddiest point this week

Wednesday, November 12, 2008

greenstone...ALMOST THERE!

I feel like I am SO CLOSE to having greenstone up and running on my mac.

As near as I can tell, the only thing I still need to do is add the following directive to the httpd.conf file:

ScriptAlias /gsdl/cgi-bin "/opt/greenstone/cgi-bin"

Options None
AllowOverride None


Alias /gsdl "/opt/greenstone"

Options Indexes MultiViews FollowSymLinks
AllowOverride None
Order allow,deny
Allow from all



So here's my question...where is the httpd.conf file, and how do I add this directive to it? Help!

week 11 readings

Arms:
viewpoint analysis, a technique from software development. The idea is to identify the various stakeholders in a system and view the system from each of their viewpoints.

Before computer networks, an emphasis on the organizational viewpoint was natural. When libraries were defined by their buildings, an individual patron used a very small number of libraries, perhaps the local public library or a university library.

Interoperability research assumes that there are many digital libraries: the challenge is how to encourage collaboration among independent digital libraries with differing missions and resources

From the user's viewpoint, technology is irrelevant and organizations are of secondary importance. Separate organizations, each with their own identity, can easily become an obstacle.

For the past decade, many people have carried out research and development on separate digital libraries and technical interoperability among them. As the early work matures, it would be easy for digital libraries research to become inbred, focusing on detailed refinement of the same agenda. Alternatively, we can think of the digital library from the user's viewpoint

About twenty years ago, independent computer networks began to merge into the single unified Internet that we take for granted today. Perhaps now is the time for digital libraries to strive for the same transition, to a single Digital Library

Roush:
The digitization of the world's enormous store of library books--an effort dating to the early 1990s in the United Kingdom, the United States, and elsewhere--has been a slow, expensive, and underfunded process.

Google's efforts and others like it will force libraries and librarians to reëxamine their core principles -- including their commitment to spreading knowledge freely

some librarians are very concerned about the terms of access and are very concerned that a commercial entity will have control over materials that libraries have collected

libraries, which allow many readers to use the same book, have always enjoyed something of an exemption from copyright law. Now the mass digitization of library books threatens to make their content just as portable -- or piracy prone, depending on one's point of view -- as digital music.

libraries in the United States are gaining users, despite the advent of the Web, and that libraries are being constructed or renovated at an unprecedented rate (architect Rem Koolhaas's Seattle Central Library, for example, is the new jewel of that city's downtown)

Digitization itself, of course, is no small challenge. Scanning the pages of brittle old books at high speed without damaging them is a problem that's still being addressed, as is the question of how to store and preserve their content once it's in digital form. The Google initiative has also amplified a long-standing debate among librarians, authors, publishers, and technologists over how to guarantee the fullest possible access to digitized books, including those still under copyright

Optical character recognition (OCR) technology cannot yet interpret handwritten script, so exposing the content of these books to today's search engines requires typing their texts into separate files linked to the original images

digitization machines: a fleet of proprietary robotic cameras, still under development, that will turn the digitization of printed books into a true assembly-line process and, in theory, lower the cost to about $10 per book, compared to a minimum of $30 per book today.

Google will give each participating library a copy of the books it has digitized while keeping another for itself. Initially, Google will use its copy to augment its existing Google Print program, which mixes relevant snippets from recently published books into the usual results returned by its Web search tool.

may do whatever it likes with the digital scans of its own holdings -- as long as it doesn't share them with companies that could use them to compete with Google. Such limitations may prove uncomfortable, but most librarians say they can live with them.

free and open access is exactly what public libraries, as storehouses of printed books and periodicals, have traditionally provided. But the very fact that digital files are so much easier to share than physical books (which scares publishers just as MP3 file sharing scares record companies) could lead to limits on redistribution that prevent libraries from giving patrons as much access to their digital collections as they would like.

idea that there are some things you can exploit for commercial purposes for a certain amount of time, and then you play the open game

American Library Association is one of the loudest advocates of proposed legislation to reinforce the "fair use" provisions of federal copyright law, which entitle the public to republish portions of copyrighted works for purposes of commentary or criticism

Mass digitization may eventually force a redefinition of fair use, some librari­ans believe. The more public-domain literature that appears on the Web through Google Print, the greater the likelihood that citi­zens will demand an equitable but low-cost way to view the much larger mass of copyrighted books.

Social Aspects of Digital Libraries:
digital libraries represent a set of significant social problems that require human and technological resources to solve.

Digital libraries are a set of electronic resources and associated technical capabilities for creating, searching, and using information. In this sense they are an extension and enhancement of information storage and retrieval systems that manipulate digital data in any medium (text, images, sounds; static or dynamic images) and exist in distributed networks.

Digital libraries are constructed -- collected and organized -- by a community of users, and their functional capabilities support the information needs and uses of that community. They are a component of communities in which individuals and groups interact with each other, using data, information, and knowledge resources and systems.

While it is possible to build systems independent of human activities that will satisfy technical specifications, systems that work for people must be based on analyses of learning and other life activities. Empirical research on users should be influencing design in three ways: (1) by discovering which functionalities user communities regard as priorities; (2) by developing basic analytical categories that influence the design of system architecture; and (3) by generating integrated design processes that include empirical research and user community participation throughout the design cycle.

three themes:
  • Human-centered research issues: a focus on people, both as individual users and as members of groups and communities, communicators, creators, users, learners, or managers of information. We are concerned with groups and communities as units of analysis as well as with individual users.
  • Artifact-centered research issues: a focus on creating, organizing, representing, storing, and retrieving the artifacts of human communication.
  • Systems-centered research issues: a focus on digital libraries as systems that enable interaction with these artifacts and that support related communication processes.
Individual users of information technology are studied in communication, library and information science, education, psychology, human factors, and linguistics, among others. Most of the research in these disciplines views the individual as an actor who employs the technology for instrumental purposes.

Among the better understood topics at this level are the relationship between work practices and the design of systems and user interfaces; evolution, implementation, and evaluation of information technologies, especially in organizations; and user perceptions of and participation in development. A substantial body of work extending over several decades has demonstrated enduring inequities in the distribution of and access to information and related technologies across social groups.

Heterogeneous populations and applications:
Institutions/cultural objects of study
Information literacy skills:
Designing for richness:
Studies of situated use:
Design world/Content world interface:
Tools for content creators:

Making artifacts useful within a community:
Making artifacts useful to multiple communities:
Dynamic artifacts:
Hybrid digital libraries:
Professional practices and principles:
Human vs. automated indexing:
Legacy data
Hierarchies of description
Portability:
Artifactual relationships
Level of representation

Community-based development tools
Multiple interfaces
Social interfaces:
Mediating interaction:
Intelligent agents, user models:
Information presentation
Open architecture:
Development methods:
Tools for accessing and filtering information:

Participatory design:
Studying new activities
Levels of evaluation:
Iterative methods
Tailoring methods:

recommending that research be conducted on these themes, that scholars from multiple disciplines be encouraged to develop joint projects, that scholars and practitioners work together, and that digital libraries be developed and evaluated in operational, as well as experimental, work environments.

week 10 muddiest point

Is there any chance that we could have some sort of lab session and go over how to use Greenstone in the lab?

Friday, November 7, 2008

reading notes, week 10

Arms ch. 8
In discussing the usability of a computer system, it is easy to focus on the design of the interface between the user and the computer, but usability is a property of the total system. All the components must work together smoothly to create an effective and convenient digital library, for both the patrons, and for the librarians and systems administrators.

In any computer system, the user interface is built on a conceptual model that describes the manner in which the system is used.

The introduction of browsers, notably Mosaic in 1993, provided a stimulus to the quality of user interfaces for networked applications.

Mobile code gives the designer of a web site the ability to create web pages that incorporate computer programs

Interface design is partly an art, but a number of general principles have emerged from recent research. Consistency is important to users, in appearance, controls, and function. Users need feedback; they need to understand what the computer system is doing and why they see certain results. They should be able to interrupt or reverse actions. Error handling should be simple and easy to comprehend. Skilled users should be offered shortcuts, while beginners have simple, well-defined options. Above all the user should feel in control.

Research into functional design provides designers with choices about what functions belong on which of the various computers and the relationships between them.

A presentation profile is an interesting concept which has recently emerged. Managers of a digital library associate guidelines with stored information. The guidelines suggest how the objects might be presented to the user. For example, the profile might recommend two ways to render an object, offering a choice of a small file size or the full detail.

Few computer systems are completely reliable and digital libraries depend upon many subsystems scattered across the Internet

Kling and Elliott
"Systems usability" refers to how well people can exploit a computer system's intended functionality. Usability can characterize any aspect of the ways that people interact with a system, even its installation and maintenance.

wo key forms of DL usability - interface and organizational. The interface dimensions are centered around an individual's effective acclimation to a user interface, while the organizational dimensions are concerned with how computer systems can be effectively integrated into work practices of specific organizations.

interface usability dimensions:

1. Learnability - Ease of learning such that the user can quickly begin using it.

2. Efficiency - Ability of user to use the system with high level of productivity.

3. Memorability - Capability of user to easily remember how to use the system after not using it for some period.

4. Errors - System should have low error rate with few user errors and easy recovery from them. Also no catastrophic errors.

organizational usability dimensions include:

1. Accessibility - Ease with which people can locate specific computer systems, gain physical access and electronic access to their electronic corpuses. This dimension refers to both physical proximity and administrative/social restrictions on using specific systems.

2. Compatibility - Level of compatibility of file transfers from system to system.

3. Integrability into work practices - How smoothly the system fits into a person or group's work practices.

4. Social-organizational expertise - The extent to which people can obtain training and consulting to learn to use systems and can find help with problems in usage.

A great deal of people's satisfaction is influenced by the size and content of the corpus of a DL service

"Design for usability" is a new term that refers to the design of computer systems so that they can be effectively integrated into the work practices of specific organizations.

The usability engineering life cycle model includes these stages proposed as a paradigm for companies to follow:

1. Know the user - Study intended users and use of the product. At a minimum, visit customer site to study user's current and desired tasks, and to understand the evolution of the user and the job.

2. Competitive analysis - Analyze existing products according to usability guidelines and perform user tests with products.

3. Setting usability goals - Establish minimal acceptable level of usability and estimate the financial impact on cost of users' time.

4. Parallel design - Use several designers to explore different design alternatives before deciding on one final design.

5. Participatory design - Include end-users throughout design phase.

6. Coordinated design of the total interface - Maintain consistency across screen layouts, documentation, on-line help systems, and tutorials.

7. Apply guidelines and heuristic analysis - Select user interface guideline appropriate for situation.

8. Prototyping - Build prototype to pretest on end-users.

9. Empirical testing - Test end-users on specific usability attributes.

10. Iterative design - Capture design rationale through iterative testing and design.

11. Collect feedback from field use - Gather usability work from field studies for future design.

The organizationally sensitive model of "design for usability" is a new model. It refers to the design of computer systems so that they can be effectively integrated into the work practices of specific organizations. It goes beyond the focus on user interfaces. "Design for usability" includes the infrastructure of computing resources which are necessary for supporting and accommodating people as they learn to maintain and use systems. "Design for usability" encourages system designers either to accommodate to end-users' mix of skills, work practices, and resources or to try to alter them


Saracevic:
Usability has been used widely in digital library evaluation, but there is no uniform definition of
what does it cover in digital library context. Usability is a very general criterion that covers a lot
of ground and includes many specific criteria – it is a meta term. ISO defines usability “as the
extent to which a product can be used by specified users to achieve specified goals with
effectiveness, efficiency and satisfaction in a specified context of use"

users have many difficulties with digital libraries, such as:
– they usually do not fully understand them
– they hold different conception of a digital library from operators or designers
– they lack familiarity with the range of capabilities, content and interactions
provided by a digital library.
– they often engage in blind alley interactions

in use, more often than not, digital library users and digital libraries are in an
adversarial position

The ultimate evaluation of digital libraries will revolve around assessing transformation of their
context – determining possible enhancing changes in institutions, learning, scholarly publishing,
disciplines, small worlds and ultimately society due to digital libraries.

Sheiderman and Plaisant
Human-Computer Interaction

Requirements analysis:
1- ascertain user needs
2- ensure proper readability
3-context of use & appropriate standardization, integration, consistency, portability

evaluation:
1- time to learn
2- speed of performance
3- rate of errors by user
4- retention over time
5- subjective satisfaction

motivations:
1- life-critical systems
2- industrial/commercial
3- office/home/entertainment
4- exploratory/creative/collaborative
5- sociotechnical

Universal Usability:
1- Variations in physical abilities & physical workplaces
2- diverse cognitive & perceptual abilities
3- personality differences
4- cultural & international diversity
5- users with disabilities
6- older adult users
7- designing for/with children
8- hardware and software diversity

muddiest point week 9

when can we expect to get our midterm grades?

Wednesday, October 22, 2008

Muddiest Point - Week 8

Are there no readings this week?

Friday, October 17, 2008

muddiest point - week 7

no real muddiest point, just still haven't been able to install apache/greenstone

Reading Notes - Week 8

Miller - Federated Searching: Put it in it's Place

The overwhelming success of Google offers powerful evidence as to which search model users prefer.

the universe of available content is no longer limited to that stored within the library walls. Moreover, the type of content required by users is often not cataloged by most libraries.

Providing books and other cataloged material is only one aspect of the modern library's charter.

Google has taught us, quite powerfully, that the user just wants a search box. Arguments as to whether or not this is "best" for the user are moot—it doesn't matter if it's best if nobody uses

Hane - The Truth about federated searching

Federated searching is a hot topic that seems to be gaining traction in libraries everywhere

It's very difficult to manage authentication for subscription databases, particularly for remote users

It's impossible to perform a relevancy ranking that's totally relevant.

You can't get better results with a federated search engine than you can with the native database search. The same content is being searched, and a federated engine does not enhance the native database's search interface. Federated searching cannot improve on the native databases' search capabilities. It can only use them.

Lossau - Search Engine TEchnology and Digital LIbraries

Libraries see themselves as central information providers for their clientele, at universities or research institutions. But how do they define academic content?

Libraries still see themselves as a place of collections rather than as an information "gateway". Other concerns of libraries are grounded in the fact that there is no guarantee that a remote host will maintain its resources in the long-term.

paper from Michael Bergman on the "Deep Web" [2], highlights the dimensions we have to consider. Bergman talks about one billion individual documents in the "visible" [3] and nearly 550 billion documents on 200,000 web sites in the "deep" web

Libraries are increasingly hesitant to support big, monolithic and centralised portal solutions equipped with an all-inclusive search interface which would only add another link to the local, customer-oriented information services.

particularly at universities, libraries deal with a range of users with often different usage behaviours. It almost goes without saying that an undergraduate has other demands for information than a qualified researcher, and their usage behaviours can vary substantially. Young undergraduates will try much harder to transfer their general information seeking behaviour (using internet search engines) to the specific, academic environment, while established researchers have better accommodated the use of specific search tools

Current digital library systems integrate predominantly online library catalogues and databases with some full text repositories (e.g. e-journal

The continual exponential growth in the volume of online web content as described above makes it unrealistic to believe that one library can build one big, all-inclusive academic web index. Even to provide a substantial part, such as indexing the academic online content of one country, would mean a major challenge to one institution. Thus, collaboration is required among libraries

Lynch - Z39.50 Information Retrieval

Z39.50 is one of the few examples we have to date of a protocol that actually goes beyond codifying mechanism and moves into the area of standardizing shared semantic knowledge. The extent to which this should be a goal of the protocol has been an ongoing source of controversy and tension within the developer community

"Information Retrieval (Z39.50); Application Service Definition and Protocol Specification, ANSI/NISO Z39.50-1995" -- is a protocol which specifies data structures and interchange rules that allow a client machine (called an "origin" in the standard) to search databases on a server machine (called a "target" in the standard) and retrieve records that are identified as a result of such a search

Z39.50 has its roots in efforts dating back to the 1970s to allow standardized means of cross-database searching among a handful of (rather homogeneous) major bibliographic databases hosted by organizations such as the Library of Congress, the Online Computer Library Center (OCLC), and the Research Libraries Information Network

Z39.50 becomes linked to the semantics of the databases being searched in two primary areas: the attribute sets used to describe the access points being searched, and the record syntax (and related record composition control parameters in PRESENT) that are used to actually transfer records back from server to client.

OAI-PMH

designed to enable greater interoperability between digital libraries. simpler than Z39.50

works with structured data, specifically XML

document-like objects

primary purpose is to define a standard way to move metadata from point a to point b within the virtual information space of the www

OAI formed 1999

OAI-PMH 2000

formal public opening 2001 "open day"

NOT inherently open access, nor traditional ARCHIVES

Friday, October 10, 2008

Reading Notes - Week 7

Henzinger et. al.

users of web search engines tend to examine only the first page of search results. for commercially-oriented web sites whose income depends on their traffic, it is in their interest to be ranked within the top 10 results for a query relevant to the content of the web site.

to achieve high rankings, authors either use a text-based approach, a link-based approach, a cloaking approach, or a combination thereof.

traditional research in information retrieval has not had to deal with this problem of malicious content in the corpora.

the web is full of noisy, low-quality, unreliable, and indeed contradictory content. In designing a high-quality search engine, one has to start with the assumption that a typical document cannot be "trusted" in isolation, rather it is the synthesis of a large number of low-quality documnets that provides the best set of results.

layout information in HTML may seem of limited utility, especially compared to information contained in languages like XML that can be used to tag content, but in fact it is a particularly valuable source of meta-data.

There are two way to try to improve ranking. one is to concentrate on a small set of keywords, and try to improve perceived relevance for that set of keywords. another technique is to try and increase the number of keywords for which the document is perceived relevant by a search engine.

a common approach is for an author to put a link farm at the bottom of every page in a site, where a link farm is a collection of links that points to every other page in that site, or indeed to any site that the author controls.

doorway pages are web pages that consist entirely of links. they are not intended to be viewed by humans; rather, they are constructed in a way that makes it very likely that search engnes will discover them.

cloaking involves serving entirely different content to a search engine crawler than to other users.

while there has been a great deal of research on determining the relevance of documents, the issue of document quality of accuracy has not been received much attention.

another promising area of research is to combine established link-analysis quality judgments with text-based judgments.

three assumed web conventions:
1) anchor text is meant to be descriptive
2) assume that if a web page author includes a link to another page, it is because the author believes that readers of the source page will find the destination page interesting and relevant.
3) META tags: currently the primary way to include metadata within HTML. content META tag used to describe the content of the document.

duplicate hosts are the single largest source of duplicate pages on the web, so solving the dupicate hosts problem can result ina significantly improved web crawler.

vaguely-structured date: information on these web pages is not structured in a database sense, typically it's much closer to prose than to data, but it does have some structure, often unintentional, exhibited through the use of HTML markup. not typically the intent of the webpage author to describe the documents semantics.

Hawking, pt. 1

search engines cannot and should not index every page on the web.

crawling proceeds by making an HTTP request to fetch the page at the first URL in the queue. when the crawler fetches the page, it scans the contents for links to other URLs and adds each previously unseesn URL to the queue.

even hundredfold parallelism is not sufficient to achieve the necessary crawling rate.

robots.txt file to determine whether the webmaster has specified that some or all of the site should not be crawled.

search engine companies use manual and automated analysis of link patterns and content to identify spam sites that are then included in a blacklist.

crawlers are highly complex parallel systems, communication with millions of different web servers, among which can be found every conceivable failure mode, all manner of deliberate and adcidental crawler traps, and every variety of noncompliance with published standards.

Hawking, pt. 2

search engines use an inverted file to rapidly identify indexing terms - the documents that contain a particular word or phrase.

in the first phase, scanning, the indexer scans the text of each input document.

in the scond phase, inversion, the indexer sorts the temporary file into term number order, with the document number as the secondary sort key.

the scale of the inversion problem for a web-sized crawl is enormous.

there is a strong economic incentive for serach engines to use caching to reduce the cost of answering queries.

Lesk, ch. 4

Muddiest Point - Week 6

no muddiest point this week!

Friday, October 3, 2008

Reading Notes - Week 6

Hedstrom: Research Challenges in Digital Archiving and Long-term Preservation

Future research capabilities will be seriously compromised without significant investments in research and the development of digital archives.

Digital collections are vast, heterogenuous, and growing at a rate that outpaces our ability to manage and preserve them.

Human labor is the greatest cost factor in digital preservation.

need systems that are: self-sustaining, self-monitoring, self-repairing.

redundancy, replication, security against intentional attacks & technological failures, issues of forward migration: critical

Economic and policy research needs span a wide range of issues such as incentives for organizations to invest in digital archives and incentives for depositores to place content in repositiories.

questions of intellectual property rights, privacy, and trust.

digital preservation will not scale without tools and technologies that automate many aspects of the preservation process and that support human decision-making.

models needed to support: selection, choice of preservation strategies, costs/benfits of vatious levels of description/metadata.

it is important to recognize that metadata, shemas, and ontologies are dynamic

managing schema evolution is a major research issue.

Research issues in the area of naming and authorization nclude development of methods for uniquye and persistent naming of archived digital objects, tools for certification and authentication of preserved digital objects, methodds for version control, and interoperability among naming mechanisms

research is needed on the requirements for a shared and scalable infrastructure to suport digital archiving

a metadat schema registry is also needed


Littman: Actualized Preservation Threats

Chronicling America, three goals for the program are to support the digitization of historically significant newspapers, facilitate public access via a web site, provide for the long-term preservation of these materials by constructing a digital repository.

made the explicit decision not "trust" the repository until some later point; stored and backed up in a completely seperate environment

four preservation threat categories: media failure, hardware failure, software failures, operator errors

a number of hard drive failures; in one case a second problem occurred while storage system was rebuilding; resulted in the loss of a small amount of data from the system. fortunately, file system diagnostics were able to identify & restore corrupted files

first software failure was failure to successfully validate digital objects created by awardees; gaps remained in validation that allowed awardees to to submit METS records that passed validation and were ingested into the repository, but did not conform to the appropriate NDNP profile.

transformation failure: transformation of the METS record has proven to be complex and error prone; the transformation that put the original METS record inline was stripping the XML markup.

XMS file system was corrupted, resulting in the loss of some data

most sifnificant threats to preservation occurred as a result of operator errors. deletion of a large number of files from a section of a file system; lack of auditing capabilities contribured to this problem.

mistakes performed during ingest

already implemented some significant architectural changes to address.


Lavoie: Technology Watch Report

digital preservation – securing the long-term persistence of information in digital form

cultural heritage institutions, businesses, government agencies, etc. – with the need to take steps to secure the long-term viability of the digital materials in their custody. Many of these entities do not perceive an archival function within the scope of their organizational mission.

no perceived consensus on the needs and requirements for maintaining digital information over the long-term. A unifying framework that could fill this gap would be invaluable in terms of encouraging dialog and collaboration among participants in standards-building activities, as well as identifying areas most likely to benefit from standards development.

two primary functions for an archival repository: first, to preserve information – i.e., to secure its long-term persistence – and second, to provide access to the archived information

obtain sufficient intellectual property rights, along with custody of the items, to authorize the procedures necessary to meet preservation objectives. For example, if the OAIS must create a new version of the archived item so that it can be rendered by current technologies, it must have the explicit right to do so.

must not only preserve information, but also a sufficient portion of its associated context to ensure that the information is understandable, and ultimately, useable by future generations. "Contextual information" that might be preserved includes, but is not limited to, a description of the structure or format in which the information is stored, explanations of how and why the information was created, and even its appropriate interpretation.

first functional component is Ingest, the set of processes responsible for accepting information submitted by Producers and preparing it for inclusion in the archival store.

Archival Storage. This is the portion of the archival system that manages the long-term storage and maintenance of digital materials entrusted to the OAIS.

Data Management is the third functional component of an OAIS. The Data Management function maintains databases of descriptive metadata identifying and describing the archived information in support of the OAIS’s finding aids; it also manages the administrative data supporting the OAIS’s internal system operations, such as system performance data or access statistics

Preservation Planning. This service is responsible for mapping out the OAIS’s preservation strategy, as well as recommending appropriate revisions to this strategy in response to evolving conditions in the OAIS environment.

Access is the fifth functional component of an OAIS-type archive. As its name suggests, the Access function manages the processes and services by which Consumers – and especially the Designated Community – locate, request, and receive delivery of items residing in the OAIS’s archival store.

Administration. The Administration function is responsible for managing the day-to-day operations of the OAIS, as well as coordinating the activities of the other five high-level OAIS services

OAIS information model is built around the concept of an information package: a conceptualization of the structure of information as it moves into, through, and out of the archival system. An information package consists of the digital object that is the focus of preservation, along with metadata necessary to support its long-term preservation and access, bound into a single logical package

Submission Information Package, or SIP, is the version of the information package that is transferred from the Producer to the OAIS when information is ingested into the archive.

Archival Information Package, or AIP, is the version of the information package that is stored and preserved by the OAIS.

Dissemination Information Package, or DIP, is the version of the information package delivered to the Consumer in response to an access request.

Taken together, the Content Information and Preservation Description Information represent the archived digital content, the metadata necessary to render and understand it, and the metadata necessary to support its preservation.

Jones/Baegrie: Introduction & Digital Preservation

growing awareness of the significant challenges associated with ensuring continued access to these materials, even in the short term.

The need to create and have widespread access to digital materials has raced ahead of the level of general awareness and understanding of what it takes to manage them effectively.

institutions that have not played a role in preserving traditional collections do not have a strong sense of playing a role in preserving digital materials. Individual researchers were keen to "do the right thing" but frequently lacked the clear guidance and institutional backing to enable them to feel confident of what they should be doing

Digital preservation has many parallels with traditional preservation in matters of broad principle but differs markedly at the operational level and never more so than in the wide range of decision makers who play a crucial role at various stages in the lifecycle of a digital resource

While there is as yet only largely anecdotal evidence, it is certain that many potentially valuable digital materials have already been lost.

Machine Dependency. Digital materials all require specific hardware and software in order to access them

The speed of changes in technology means that the timeframe during which action must be taken is very much shorter than for paper

Fragility of the media.The media digital materials are stored on is inherently unstable and without suitable storage conditions and management can deteriorate very quickly

The ease with which changes can be made and the need to make some changes in order to manage the material means that there are challenges associated with ensuring the continued integrity, authenticity, and history of digital materials.

The implications of allocating priorities are much more severe than for paper.

The nature of the technology requires a life-cycle management approach to be taken to its maintenance

widely acknowledged that the most cost-effective means of ensuring continued access to important digital materials is to consider the preservation implications as early as possible, preferably at creation, and actively to plan for their management throughout their lifecycle.

All public institutions such as archives, libraries, and museums need to be involved in applying their professional skills and expertise to the long-term preservation of digital materials, just as they have taken a role in the preservation of traditional materials.

Preservation costs are expected to be greater in the digital environment than for traditional paper collections

need actively to manage inevitable changes in technology at regular intervals and over a (potentially) infinite timeframe.

lack of standardisation in both the resources themselves and the licensing agreements

as yet unresolved means of reliably and accurately rendering certain digital objects so that they do not lose essential information after technology changes

for some time to come digital preservation may be an additional cost on top of the costs for traditional collections unless cost savings can be realised

Because digital material is machine dependent, it is not possible to access the information unless there is appropriate hardware, and associated software which will make it intelligible.

While it is technically feasible to alter records in a paper environment, the relative ease with which this can be achieved in the digital environment, either deliberately or inadvertently, has given this issue more pressing urgency

Although computer storage is increasing in scale and its relative cost is decreasing constantly, the quantity of data and our ability to capture it with relative ease still matches or exceeds it in a number of areas.

approaches to digital preservation:
-Preserve the original software (and possible hardware) that was used to create and access the information. technology preservation strategy
-Program future powerful computer systems to emulate older, obsolete computer platforms and operating systems as required.This is the technology emulation strategy.
-Ensure that the digital information is re-encoded in new formats before the old format becomes obsolete.This is the digital information migration strategy

The dramatic speed of technological change means that few organisations have been able even fully to articulate what their needs are in this area, much less employ or develop staff with appropriate skills

Roles are also changing within as well as between institutions. Assigning responsibility for preservation of digital materials acquired and/or created by an organisation will inevitably require involvement with personnel from different parts of the organisation working together

Some consideration also needs to be given in the selection to the level of redundancy needed to ensure digital preservation. A level of redundancy with multiple copies held in different repositories is inherent in traditional print materials and has contributed to their preservation over centuries

The IPR issues in digital materials are arguably more complex and significant than for traditional media and if not addressed can impede or even prevent preservation activities. Consideration may need to be given not only to content but to any associated software

Muddiest Point

No muddiest point this week, but I am still unable to install apache on my mac and thus cannot run greenstone.

Sunday, September 28, 2008

assignment 2 - photo link

here's a photobucket album with my pictures!

Friday, September 26, 2008

Muddiest Point

Will we actually be doing any XML coding for our term projects, or is this something we just need to understand conceptually?

Reading Notes - Week 5

Bryan:
XML is subset of the Standard Generalized Markup Language.

XML allows users to:

  • bring multiple files together to form compound documents
  • identify where illustrations are to be incorporated into text files, and the format used to encode each illustration
  • provide processing control information to supporting programs, such as document validators and browsers
  • add editorial comments to a file
XML is formal language that can be used to pass information about the component parts of a document to another computer system

provides a formal syntax for describing the relationships between the entities, elements and attributes that make up an XML document

users must create a Document Type Definition that formally identifies the relationships between the various elements that form their documents

Where elements can have variable forms, or need to be linked together, they can be given suitable attributes to specify the properties to be applied to them

An XML file normally consists of three types of markup, the first two of which are optional:

  1. An XML processing instruction identifying the version of XML being used, the way in which it is encoded, and whether it references other files or not, e,g,
  2. A document type declaration that either contains the formal markup declarations in its internal subset (between square brackets) or references a file containing the relevant markup declarations (the external subset), e.g.:
  3. A fully-tagged document instance which consists of a root element, whose element type name must match that assigned as the document type name in the document type declaration, within which all other markup is nested.
XML-coded files are, by their nature, ideal for storing in databases. Because XML files are both object-orientated and hierarchical in nature they can be adopted to virtually any type of database, though care sometimes needs to be taken to ensure that enough structural data is retained in the database to reconstruct the original file


Ogbuji:

XML is based on Standard Generalized Markup Language (SGML), defined in ISO 8879:1986 [ISO Standard]. It represents a significant simplification of SGML, and includes adjustments that make it better suited to the Web environment.

an entity catalog can be used to specify the location from which an XML processor loads a DTD, given the system and public identifiers for that DTD. System identifiers are usually given by Uniform Resource Identifiers (URIs)

**
A URI is just an extension of the familiar URLs from use in Web browsers and the like. All URLs are also URIs, but URLs also add URNs** [is this true? i thought the opposite...]

In XML namespaces each vocabulary is called a namespace and there is a special syntax for expressing vocabulary markers. Each element or attribute name can be connected to one namespace

XML Base [W3C Recommendation] provides a means of associating XML elements with URIs in order to more precisely specify how relative URIs are resolved in relevant XML processing actions.

the XML Infoset, defines an abstract way of describing an XML document as a series of objects, called information items, with specialized properties. This abstract data set incorporates aspects of XML documents defined in XML 1.0, XML Namespaces, and XML Base. The XML Infoset is used as the foundation of several other specifications that try to break down XML documents

a physical representation of an XML document, called the canonical form, accounts for the variations allowed in XML syntax without changing meaning

XPointer, language that can be used to refer to fragments of an XML document

XLink offers such links (simple links), as well as more complex links that can have multiple end-points (extended links), and even links that are not expressed in the linked documents, but rather in special hub documents (called linkbases).


XML Tutorial:

The purpose of an XML Schema is to define the legal building blocks of an XML document, just like a DTD.

An XML Schema:

  • defines elements that can appear in a document
  • defines attributes that can appear in a document
  • defines which elements are child elements
  • defines the order of child elements
  • defines the number of child elements
  • defines whether an element is empty or can include text
  • defines data types for elements and attributes
  • defines default and fixed values for elements and attributes
XML Schema became a W3C Recommendation 02. May 2001.

One of the greatest strength of XML Schemas is the support for data types.

With support for data types:

  • It is easier to describe allowable document content
  • It is easier to validate the correctness of data
  • It is easier to work with data from a database
  • It is easier to define data facets (restrictions on data)
  • It is easier to define data patterns (data formats)
  • It is easier to convert data between different data types
A simple element is an XML element that can contain only text

Simple elements cannot have attributes. If an element has attributes, it is considered to be of a complex type. But the attribute itself is always declared as a simple type.

A complex element is an XML element that contains other elements and/or attributes.

There are four kinds of complex elements:

  • empty elements
  • elements that contain only other elements
  • elements that contain only text
  • elements that contain both other elements and text

WSDL is a schema-based language for describing Web services and how to access them.

WSDL describes a web service, along with the message format and protocol details for the web service.


Bergholz:

Extensible Markup Language (XML), a semantic language that lets you meaningfully annotate text. Meaningful annotation is, in essence, what XML is all about.

DTDs let users specify the set of tags, the order of tags, and the attributes associated with each

Elements can have zero or more attributes, which are declared using the !ATTLIST tag

Using namespaces avoids name clashes (that is, situations where the same tag name is used in different contexts). For instance, a namespace can identify whether an address is a postal address, an e-mail address, or an IP address

Unfortunately, namespaces and DTDs do not work well together

XML extends HTML’s linking capabilities with three supporting
languages.
 Xlink (http://www.w3.org/TR/xlink/), which describes how two documents can be linked;
 XPointer, which enables addressing individual parts of an XML document; and
 XPath, which is used by XPointer to describe location paths.

The Extensible Stylesheet Language (XSL) is actually two languages: a transformation language (called XSL transformations, or XSLT) and a formatting language (XSL formatting objects). Although DTDs were the first proposal to providefor a standardized data exchange betweenusers, they have disadvantages. Their expressive power seems limited, and their syntax is not XML. Several approaches address these disadvantages by defining a schema language (rather than a grammar) for XML documents:
 document definition markup language (DDML), formerly known as XSchema,
 document content description (DCD),
 schema for object-oriented XML (SOX), and
 XML-Data (replaced by DCD). The W3C’s XML Schema activity takes these four proposals into consideration.

Friday, September 19, 2008

Reading Notes

Setting the Stage
metadata: the sum total of what one can say abotu any information object at any level of aggregation
- content (intrinsic) - what is contained/about
- context (extrinsic) - who what when etc about creation
- structure (either int or ext) - associations b/c or among independent info objects

library metadata: includes indexes, abstracts & catalog records created according to cataloging rules (MARC, LCSH etc)

archival & manuscript metadata: accession records, finding aids, catalog records.
- MARC Archival and Manuscript Control (AMC), no MARC format for bib control

not as much emph on structure for lib/arch, but always important even before digitization.
- growing as comp capabilities increase
- structure can be exploited for searching, etc but need specific metadata

metadata:
- certifies authenticity & degree completeness of content
- establishes & documents context of content
- identifies & exploits structural rel's that exist b/t & w/in info objects
- provides range of intell access points for increasingly diverse range of users
- provides some of info an info prof might have provided in a physical setting

repositories also metadata for admin, accession, preserving, use of coll's
- personal info mgmt, recordkeeping

Dublin Core Metadata Element Set

Table 1 - Types
1) Administrative (acqu info, rights & repro, documentation of legal access req's, location info, selection criteria, version control, audit trails)
2) Descriptive (cat records, finding aids, spec indexes, hyperlink rel's, annotations by users, md for recordkeeping systems)
3) Preservation (documentation of phys condition, actions taken to preserve)
4) Technical (hard/software documentation, digitization info, track sys resp time, auth & sec data)
5) Use (exhibit records, use & user tracking, re-use & multi-version info)

Attributes of metadata:
- Source (Int or Ext)
- Method of creation (auto/comp or manual/human)
- Nature (lay/nonspecialist v. expert - but often orig is lay)
- Status (static, dynamic, long- or short-term)
- Structure (structured or no)
- Semantics (Controlled or no)
- Level (collection or item)

Life-Cycle of Info Object:
1) Creation & Multi-Versioning
2) Organization
3) Searching & Retrieval
4) Utilization
5) Preservation & Disposition

Little-Known facts about metadata:
- doesn't have to be digital
- more than just description
- variety of sources
- continue to accrue during life of info object/sys
- one info obj's metadata can simultaneously be another's data

Why Important?
- Increased accessibility
- retention of context
- expanding use
- multi-versioning
- legal issues
- preservation
- system improvement & economics


Border Crossings

DC: simple, modular, extensible metadata

prob's w/ user-created metadata

structured md important in managing intellectual assets

md for images critical for searchability/discoverability

controlled v. uncontrolled vocab (folksonomy)
- still not sure of future role

"Railroad Gage Dilemma" - why no common formal model?

demand for international/multicultural approach


Witten 2.2

19th c goals for bib sys: finding, collocation, choice

Today: 5 goals
1) locate entities
2) identify
3) select
4) acquire
5) navigate rel's between

principle entities: documents, works, editions, authors, subjects.
- titles & sub classifications are attributes of works, not own entitities

doc's: fundamental hierarchical structure that can more faithfully be reflected in DL than physical

"helpful DLs need to present users an image of stability & continuity" (p.49)

works: disembodied contents of a document

edition - also "version, release, revision"

authority control

subjects: extraction: phrases anlyzed for gram & lexical structure
- key phrase assignment (auto class), easier for scientific than literary

LCSH controlled vocab rel's: equivalence, hierarchical, associative


Witten 5.4 - 5.7

MARC & Dublin Core, BibTeX & Refer

MARC: AACR2R guidelines
- authority files & bib records

Dublin Core: specifically for non-specialist use
- title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, rights

BibTeX: math/sci

Refer: basis of endnote

metadata even more imp for images & multimedia
- image files contain some info

TIFF: tags. integers & ASCII text. dozen or so mandatory.
- most DLs use for images

MPEG-7: multimedia content description interface
- still pics, graphics, 3D models, audio, speech, video, combinations
- stored or streamed
- complex/extensible: DDL (desc def lang) usus XML syntax
- temporal series, spectral vector
- some automatic, some by hand

text-mining: auto metadata extraction
- structured markup s/a XML

Greenstone includes lang ident, extracting acronyms & key phrases, generating phrase hierarchies.

some info easy to extract: url/email, money, date/time, names
- generic entity extraction

bib ref's / citation info can be located & parsed

n-grams procedure for lang ID, can assign language metadata

key-phrase: assignment/extraction
- assignment: compare doc to each entry in list (Y or N)
- extraction: phrases from doc listed & best chosen (easier)

IDing phrases: content word, phrase delimiters, maximal length

Wednesday, September 17, 2008

muddiest point

My "muddiest point" this week is installing and running greenstone. It appears that the only version available for mac is the web version, which requires me to download a web server called Apache...I've done so but am not sure what to do with Apache.

Maybe I am making things way more complicated than necessary, but some help would be appreciated!

Friday, September 12, 2008

Reading Notes week 3

I don't have all of the readings in front of me at the moment, so I'm going to just do a brief commentary on them as a whole.

I have to say that I'm still confused by the concept of DOI's. Unfortunately I read the Lynch article last...I feel like reading that one first would have perhaps made things a bit clearer. I understand the motivation behind developing a DOI system, in particular the concept of persistence. However, I do not necessarily how they would work. Lynch raised some good points that helped me to understand that this system is not yet fully fleshed out, and that was helpful. In particular the statement that "Today's standard browsers do not yet understand URNs and how to invoke resolvers to convert them to URLs, but hopefully this support will be forthcoming in the not too distant future."

I suppose one of the main things that confused me was the need for something similar to an ISBN for digital content. Perhaps (probably) I am over-simplifying things, but it seems to me that an ISBN or ISSN are necessary because large numbers of copies exist for printed works. For most digital content, only one or a small few copies exist and this/these can be accessed by multiple users concurrently. To be honest, I am not well versed in Intellectual Property or Copyright issues at all, and this is probably what is confusing me the most.

Another thing that stood out to me was the issue of not being able to learn an object's DOI unless the object carries it as a label. If I wanted to know the ISBN of a particular book, there are ways for me to find it by searching with other, known identifiers such as the title. It seems to me that there is a huge issue in general with the naming of web-based content. Digital files can be given filenames that could serve as some sort of searchable identifier, but in general web pages and sites that host digital content are haphazardly named and authoring information is inconsistently revealed, if at all.

Lesk and Arms Chapter 9 were straightforward and I do not really have much to say about them. In particular, Arms provided a good refresher for me on some concepts that I am somewhat familiar with, but could always learn more about!

I hope this was sufficient for this week's posting...

about me =)

It occurred to me that since I missed the first class session and had to run out right at the end of last week's class (to get to another), I haven't really met anyone in this class yet. And to top it off, comcast had my internet off all week for no good reason! So, here's a little bit more about me, in case anyone is looking for a potential group member...

I grew up in Louisville, KY and then attended college at Emory University in Atlanta, GA. I have a BA in Linguistics and Psychology with a minor in Russian (although I've pretty much lost all ability to speak it, oops). I'm hoping to get a second masters in the not-too-distant future, most likely in cognitive science. I'd like to work in an academic library and focus on services for science/math.

After finishing my BA I worked at an educational non-profit organization teaching kids from 3rd-8th grade, and when I couldn't make rent on that "salary" anymore (ha) I took a job working for an advertising consultant doing generic office grunt work. I also did a good amount of publication and presentation design. I was graphics editor and then editor-in-chief of my college's humor magazine, so I have lots of experience with that sort of thing.

I never really thought of myself as especially tech-savvy, but recently I've decided that maybe I'm selling myself a bit short. I don't have any formal training in anything, but I can generally figure out most things that are put in front of me. As I mentioned, I have lots of design experience and can certainly contribute that.

I've thrown a couple of ideas for the project around in my head, but I'm basically open to anything. At the moment I'm still confused about copyright issues for the materials we use, so I'm not trying to get too attached to any one idea.

I am a talker. In case you didn't notice, ha. I can go on at length on pretty much any topic. But some things that I am most interested in are music (especially independent rock, hip-hop, and electronic stuff), crafts, magazines, pets (especially cats), kitschy robot stuff, and language.

I'll be out of town this weekend (back for class Monday), but feel free to email me if you'd be interested in working together.

Muddiest Point - Week 3

I am still rather confused over the issue of using copyrighted materials for our term projects. It would be great if someone could go into detail about what types of materials are appropriate/ok to use and what are not...

Monday, September 8, 2008

An Architecture for Information in Digital Libraries

Main bldg blocks: digital objects, handles, repositories

Purpose of IA is to rep riches & variety of library info
-digital object: way of structuring info in dig form, some of which may be metadata & includes a unique identifier called a handle.
-DOs often in sets, structure depends on info
-material can be devided into cat's (SGML, WWW objects, comp prog;s, digitized radio prog's etc)
-user interface: browser & client svcs
-repository: interface called Repository Access Protocol
-handle system: unique indentifiers
-search system

Issues in structuring info:
-dig materials frequently related to others by relationships (part/whole etc)
-same item may be in different formats
-diff versions created often (mult copies, or time-based)
-obj's have diff rights & permissions
-users access from diff comp sys's & networks

key-metadata: info to store, replicate, transmit obj w/out providing access to the content. includes terms & cond's and handle

digital material: used to store DL materials

A Framework for Building Open Digital Libraries

"Open" DLs build directly on concepts/philosophies of Open Archives Initiative

Most existing systems classified as DLs resulted from custom-built software dev projects - each involves intensive design, implementation & testing cycles.
- why repeat effort?
- some software toolkits: Dienst, Repository-in-a-box

Most programming environments adopt a component model

Oct 1999 - OAI launched.
- focus on high-level communication among systems & simplicity of protocol
- OAI protocol for Metadata Harvestingn (OAI-PMH): system of interconnected components
- OAI protocol can be thought of as glue that binds together components of a larger DL (or collaborative system of DLs)

DLs modeled as networks of extended OAs, with each OA being a source of data and/or provider of services.
- this approach closely resembles the way physical libraries work
- research & production DLs differ

ODLs guided by a set of design principles & operationalized with aid of OAI-PMH extensions. proven techniques from internet development
- simplicity of protocols, oppeness of standards, layering of semantics, independence of components, loose coupling of systems, purposeful othogonality, reuse

Formal Principles:
1) All DL svcs should be encapsulated w/in components that are extensions of OAs
2) All access to DL svcs should be through their extended OAI itnerfaces
3) Semantics of OAI Protocol should be extended or overloaded as allowed by OAI protocol, but w/out contradicting essential meaning
4) All DL svcs should get access to other data sources using extended OAI protocol
5) Dls should be constructed as networks of extended OAs

OAI Harvester obtains datastream which creates indices for searching.

Components in prototype systems:
- Union: combine metadata from mutl src
- Filter: reformat metadata from non-OAI data srcs
- Search: search-engine functionality
- Browse: bategory-driven browsing fx'lity
- Recent: sample of recently-added items

Tested on NDLTD system - good feedback

Designing by principles & implementing for real-world scenario

Interoperability for Digital Objects & Repositories

Cornell & CNRI: open architecture, confederated DLs, goal of interoperability & extensibility. Allows flexible interaction of existing services & augmentation of the infrastructure with new services.

Interoperability: broad problem domain. typically investigated w/in specific scope (comminity, classification of info, IT area, etc)
- creating a general framework for info access & integration across domains
- goal to enable communities w/ different info & tech to achieve general level of info sharing
Definition: ability of DL components of services to be functionally & logically interchangeable by virtue of their having been implemented in accordance with a set of well-defined, publically known interfaces.

some approaches:
1) standardization
2) distributed object request architectures (eg COBRA)
3) remote procedure calls
4) mediation
5) mobile computing
Cornell/CNRI approach:
1) agreement on common abstractions
2) definition of open interfaces to services/components that implement the abstractions
3) creation of extensibility mechanism for introducing new functionality into arch w/out interfering w/ core interoperability

Principle abstractions:
1) repository: different content managed in uniform manner
2) Digital Object: datastreams/elements (MIME sequence in bytes)
3) Disseminator: extend behavior of DOs & enable interaction
4) AccessManager

Disseminator Types: set of op's that extends basic functionality of a DO. (book - "get next page" "translate text" etc)
- signatures
- DOs can be used to make diss types avail in interface

Servlets: executable program capable of performing the set of op's defined for specific diss types.
- equivalence achieved when diff servlets operate on diff types of underlying datastreams to produce equivalent results.
- stored & registered in infrastructure in uniquely named DO

Extensibility: key is clean seperation of object structre, diss types, & mechanisms taht implement extended functionality.
- can endow DO w/ additional functionality
- new interfaces can be added

Interoperability Experiments:
- IT0: Protocol & Syntactic Interoperability
-IT1: Functional & Semantic
- 1.1: DO Access
- 1.2: DO Creation
- 1.3 Extensible Access
- IT2: Interoperability of Extensibility Mechanisms.
- 2.1: Ability to dynamically load signatures & servlets
- 2.2: Demonstrate flexibility w/ which new diss types are dynamically added to infrastructure

Arms chapter 2

I really enjoyed this article as a refresher on some concepts that I'm familiar with, but hadn't known all the specifics about.


*****
Internet- collection of networks. LAN & Wide-Area Networks. Based on protocol of ARPAnet (TCP/IP)
*IP: Internet Protocol. joins together network segments that constitute internet. Four numbers, each 0-255 stored as 4 bytes. Connected by routers. Info in packets.
*TCP: Transport Control Protocol. Divides msg into packets, labels each w/ destination IP & Sequence #, sends them out on network. Receiving comp acknowledges receipt & reassembles.
- guarantees error-free delivery, but not prompt.

TCP/IP Suite of Programs:
*Terminal Emulation: telnet
*File Transfer: FTP
*Email: SMTP (single message)

Scientific publishing on the internet:
*RFC's (request for comment)
*IETF (Internet Engineering Task Force
*Los Alamos E-print archives

HTML, HTTP, MIME, URLs
*MIME specifies data type.

Conventions:
*web sites
*home page
*buttons
*hierarchical organization

web is giant step to build DLs on...not just detour until real thing comes along

Thursday, September 4, 2008

Suleman & Fox: A Framework for Building Open Digital Libraries

-- "Open" DLs build directly on concepts/philosophies of Open Archives Initiative.

-- Most existing systems classified as DLs resulted from custom-built software dev projects - each involves intensive design, implementation & testing cycles. why repeat effort? some software toolkits: Dienst, Repository-in-a-box etc

-- Most programming enviro's adapt a component model.

-- OAI launched october 1999. Focuse on high-level comm among systems & simplicity of protocal.

-- OAI protocol for Metadate Harvesting (OAI-PMH): system of interconnected components. OAI protocal can be thought of as glue that binds together components of a larger DL (or collaborative system of DLs)

** DLs modeled as networks of extended OAs, with each OA being a source of data and/or provider of services. This approach closely resembles the way physical libraries work.

-- research & production DLs differ

-- ODLs guided by a set of design principals & operationalized with aid of OAI-PMH extensions. proven techniques from internet development: simplicity of protocols, openness of standards, loose coupling of systems, purposeful orthogonality, reuse whenever possible.

-- Formal Principals:
1. All DL services should be encapsulated within components that are extensions of OAs
2. All access to DL services should be through their extended OAI interfaces
3. Semantics of OAI protocol should be extended or overloaded as allowed by OAI protocol, but without contradicting essential meaning.
4. All DL services should get access to other data sources using extended OAI protocol
5. Dls should be constructed as networks of extended OAs.

-- OAI Harvester obtains data stream which creates indices for searching.

-- Components in prototype systems:
1. Union: combine metadata from mult sources
2. Filter: reformat metadata from non-OAI data sources
3. Search: search-engine functionality
4. Browse: category-driven browsing functionality
5. Recent: sample of recently added items

-- Tested on NDLTD system w/ good feedback.

-- Designing by principles & implementing for real-world scenario.

ARMS ch. 2 - link broken?

I'm having trouble accessing the ARMS ch. 2 reading. The link brings up an error message on the site that the article cannot be found. Is anyone else having this problem?

(also posted to Bb discussion board)

Tuesday, September 2, 2008

SOPAC 2.0

found this article on a blog I read for my internship and thought it was interesting, relevant & exciting!

http://tametheweb.com/2008/09/02/on-sopac-change-and-mr-john-blyberg/

Monday, September 1, 2008

Paepcke et al: Dewey Meets Turing: Librarians, Computer Scientists, and the Digital Libraries Initiative

I appreciated that this reading was rather relaxed and approachable, especially after finishing the DELOS article which was quite technical. I have entered my written notes, hopefully this is not too informal.

-- 1994: NSF launches DLI
*3 interested parties: librarians, comp scientists, publishers
*spawned google & other developments
-- DLI has changed work & private activities for nearly everyone

-- For scientists: exciting new work informed by librarianship
*resolution of tension between novel research & valuable products
-- For librarians: increased funding opportunities, improved impact of services
*OPACs were all most lib's had. expertise was needed

-- Advent of web: changed plans & got in the way
*blurred distinction b/t consumers & producers
-- Problem for CS: bound by perproject agreements & compyright. Web removed restrictions.
* broadening of relevance
-- Web threatened pillar of lib'ship...left w/out connection to recognizable, traditional library fx's
* Both valued predicatbility & repeatability - web led to laissez faire attitude toward info retrieval (did not particularly upset public)

-- Librarians feel they haven't gotten adequate funding for coll. dev.
-- CS's couldn't understand vast amounts of time devoted to structures like metadata

-- Core fx of lib'ship remains despite new, technical infrastructure
*importance of collections re-emerging
-- Direct connection b/t lib's & scholarly authors
-- Broadened opportunities in lib sci as a result of DLI

Setting the Foundations of Digital Libraries: the DELOS Manifesto

This reading lacked the depth of explanation that the other readings have had, probably owing to its purpose as an overview to a larger work. Some of the terminology was a bit confusing to me, especially the distinction between a Digital Library System and a Digital Library Management System.

I don't have as many thoughts/reflections on this article, as it was fairly practical and informative without raising questions or topics that are up for debate. I have left below the notes that I cut/pasted from the reading, as well as a few of my own notes (designated with a "--"). I hope this is sufficient...

*****

"Generally accepted conceptions have shifted from a content-centric system that merely supports the organization and provision of access to particular collections of data and information, to a person-centric system that delivers innovative, evolving, and personalized services to users."

"expectations of the capabilities of Digital Libraries have evolved from handling mostly centrally located text to synthesizing distributed multimedia document collections, sensor data, mobile information, and pervasive computing services"

-- point out that definition and expectations of DL's change even as we try to define them.

"three types of relevant "systems" in this area: Digital Library, Digital Library System, and Digital Library Management System."

"Digital Library (DL)
A possibly virtual organization that comprehensively collects, manages, and preserves for the long term rich digital content, and offers to its user communities specialized functionality on that content, of measurable quality and according to codified policies.

"Digital Library System (DLS)
A software system that is based on a defined (possibly distributed) architecture and provides all functionality required by a particular Digital Library. Users interact with a Digital Library through the corresponding Digital Library System.

"Digital Library Management System (DLMS) A generic software system that provides the appropriate software infrastructure both (i) to produce and administer a Digital Library System incorporating the suite of functionality considered foundational for Digital Libraries and (ii) to integrate additional software offering more refined, specialized, or advanced functionality."

"Six core concepts provide a foundation for Digital Libraries. Five of them appear in the definition of Digital Library: Content, User, Functionality, Quality, and Policy; the sixth one emerges in the definition of Digital Library System: Architecture."

"We envisage actors interacting with Digital Library Systems playing four different and complementary roles: DL End-Users, DL Designers, DL System Administrators, and DL Application Developers. "

"Digital libraries need to obtain a corresponding Reference Model in order to consolidate the diversity of existing approaches into a cohesive and consistent whole, to offer a mechanism for enabling the comparison of different DLs, to provide a common basis for communication within the DL community, and to help focus further advancement"

"Reference Architecture is an architectural design pattern indicating an abstract solution to implementing the concepts and relationships identified in the Reference Model."

"Concrete Architecture - At this level, the Reference Architecture is actualised by replacing the mechanisms envisaged in the Reference Architecture with concrete standards and specifications."

LESK, ch. 1

Right from the start, I found this reading a bit more practical and informative than Borgman.

I found the extended definition on pp. 2-3 quite helpful, with the following four main points:
1. DL must have content
2. Content needs to be stored and retrieved.
3. Content must be made accessible.
4. Content must be delivered to user.
Followed by the introduction of the new costs and legal issues surrounding digital collections.

Section 1.2 seemed quite straightforward to me with some good points, especially that "For more than a decade, nearly every word printed and typed has been prepared on a computer. Paradoxically, until very recently most reading has been from paper." The focus on the interplay between technology, economics, and user-driven information usage felt like a good overview of the challenges facing digital libraries, and digital information sources generally.

I particularly enjoyed the in-depth (albeit lengthy) discussion in section 1.4 of changing prices and capacities for different info storage and retrieval technologies. The level of detail really drove home the challenges of storing and maintaining such vast quantities of information, even as our resources improve.

It feels like I have been hearing about Vannevar Bush in every single class I've had this week, so section 1.3 of the chapter was interesting more in the contrast depicted between Bush and Warren Weaver. Reading about the different emphases of their research reminded me of events in the history of Psychology, specifically the emphasis throughout much of the early twentieth century on behaviorist theories until the so-called "cognitive revolution" in the 60s. And in fact, the chapter later alluded to that same revolution on p.24 (section 1.5), with references to scientists such as Chomsky and Oettinger.

I have a BA in Lingustics (full disclosure, ha), so section 1.5 was highly interesting to me. However, I did feel that it understated the real challenges and shortcomings of attempts to capture the intricacies of natural human language with computers and other machines, perhaps for necessary reasons of length.

The background info on the history of the internet and the programs/interfaces that are commonly used today was quite informative and clear, in particular the discussion of Google's groundbreaking method of ranking search results to provide better information, and not just a lot of it. On a side note, I was talking to my mother on the phone last night and she told me that for the first time ever she successfully used google to find information that she was looking for (a substitution for self-rising flour in a recipe). That this is the first time she has managed to do so only now, after nearly six years of regular computer usage (yes, she was a late adopter) really drove this point home to me, of the very really challenges involved in making digital information an effective tool for retreiving high-quality information.

As stated in section 1.7, I felt that the two most important questions lingering over the development of DL's are:
1. What do people want to do with available technologies?
2. What content should be provided? And specifically, which content can be provided entirely digitally, and which types will never be as effective/adequate in a digital form?

Sunday, August 31, 2008

muddiest point - week one

As I have just joined the class through add/drop, I missed the class session on Monday. On that note, I suppose my "muddiest point" would just be what exactly is due when between now and the next class on Sept. 8. I know which readings are due for the first two sessions, just not exactly which notes/responses are due when.

Borgman, Christine: From Gutenberg to the Global Information Infrastructure

This reading seemed a bit repetitive and long-winded to me, although I suppose it was good to get a more in-depth perspective on the conflicts of interest in digital library research and implementation. The point that stood out to me as most important was the fact that databases and other forms of stored digital information my comprise the bulk of a DL, but these collections will not be usable and worthwhile unless they are accessible to a community of users, and the information contained within these collections must be of value to the user community.

In particular I liked this sentence from p. 46: "Libraries collect content on the basis of the information needs of their user communities; the medium in which the content is captured is a secondary concern."

I did agree with the use of Borgman's woking definition on p.48 as a way to encompass both sides of the research/practice divide, as well as the statement on p.50 that "digital libraries are themselves becoming 'enabling technologies' for other applications." However, I was a bit confused about the emphasis on a definition for a "global digital library." Is this concept meant in the same sense as H.G. Wells' goal of attaining a single encyclopedia that contains all the world's knowledge? If not, I guess I just don't understand the importance of defining this beyond the loose concept of individual digital libraries that are accessible via a single medium, in this case the internet.

One concern that I did have was the aging of the information in this article...as it was written in 2000, I am curious to see how much of the information is still valid and up-to-date in 2008. For example, p. 50 states that "Digital libraries research now falls under the human-centered systems program of CIC" and I would like to know if this is still the case or if it has changed in the past 8 years. The final paragraph on this same page laments the challenges of DL's as an interdisciplinary problem and that some reasearches have yet to hear the term "digital libraries." Surely this is not still the case in 2008, or at least not to nearly the same degree?

welcome

Hello all, I have just add/dropped into this course and will post my notes ASAP. I look forward to meeting you all on the 8th!