Wednesday, October 22, 2008

Muddiest Point - Week 8

Are there no readings this week?

Friday, October 17, 2008

muddiest point - week 7

no real muddiest point, just still haven't been able to install apache/greenstone

Reading Notes - Week 8

Miller - Federated Searching: Put it in it's Place

The overwhelming success of Google offers powerful evidence as to which search model users prefer.

the universe of available content is no longer limited to that stored within the library walls. Moreover, the type of content required by users is often not cataloged by most libraries.

Providing books and other cataloged material is only one aspect of the modern library's charter.

Google has taught us, quite powerfully, that the user just wants a search box. Arguments as to whether or not this is "best" for the user are moot—it doesn't matter if it's best if nobody uses

Hane - The Truth about federated searching

Federated searching is a hot topic that seems to be gaining traction in libraries everywhere

It's very difficult to manage authentication for subscription databases, particularly for remote users

It's impossible to perform a relevancy ranking that's totally relevant.

You can't get better results with a federated search engine than you can with the native database search. The same content is being searched, and a federated engine does not enhance the native database's search interface. Federated searching cannot improve on the native databases' search capabilities. It can only use them.

Lossau - Search Engine TEchnology and Digital LIbraries

Libraries see themselves as central information providers for their clientele, at universities or research institutions. But how do they define academic content?

Libraries still see themselves as a place of collections rather than as an information "gateway". Other concerns of libraries are grounded in the fact that there is no guarantee that a remote host will maintain its resources in the long-term.

paper from Michael Bergman on the "Deep Web" [2], highlights the dimensions we have to consider. Bergman talks about one billion individual documents in the "visible" [3] and nearly 550 billion documents on 200,000 web sites in the "deep" web

Libraries are increasingly hesitant to support big, monolithic and centralised portal solutions equipped with an all-inclusive search interface which would only add another link to the local, customer-oriented information services.

particularly at universities, libraries deal with a range of users with often different usage behaviours. It almost goes without saying that an undergraduate has other demands for information than a qualified researcher, and their usage behaviours can vary substantially. Young undergraduates will try much harder to transfer their general information seeking behaviour (using internet search engines) to the specific, academic environment, while established researchers have better accommodated the use of specific search tools

Current digital library systems integrate predominantly online library catalogues and databases with some full text repositories (e.g. e-journal

The continual exponential growth in the volume of online web content as described above makes it unrealistic to believe that one library can build one big, all-inclusive academic web index. Even to provide a substantial part, such as indexing the academic online content of one country, would mean a major challenge to one institution. Thus, collaboration is required among libraries

Lynch - Z39.50 Information Retrieval

Z39.50 is one of the few examples we have to date of a protocol that actually goes beyond codifying mechanism and moves into the area of standardizing shared semantic knowledge. The extent to which this should be a goal of the protocol has been an ongoing source of controversy and tension within the developer community

"Information Retrieval (Z39.50); Application Service Definition and Protocol Specification, ANSI/NISO Z39.50-1995" -- is a protocol which specifies data structures and interchange rules that allow a client machine (called an "origin" in the standard) to search databases on a server machine (called a "target" in the standard) and retrieve records that are identified as a result of such a search

Z39.50 has its roots in efforts dating back to the 1970s to allow standardized means of cross-database searching among a handful of (rather homogeneous) major bibliographic databases hosted by organizations such as the Library of Congress, the Online Computer Library Center (OCLC), and the Research Libraries Information Network

Z39.50 becomes linked to the semantics of the databases being searched in two primary areas: the attribute sets used to describe the access points being searched, and the record syntax (and related record composition control parameters in PRESENT) that are used to actually transfer records back from server to client.

OAI-PMH

designed to enable greater interoperability between digital libraries. simpler than Z39.50

works with structured data, specifically XML

document-like objects

primary purpose is to define a standard way to move metadata from point a to point b within the virtual information space of the www

OAI formed 1999

OAI-PMH 2000

formal public opening 2001 "open day"

NOT inherently open access, nor traditional ARCHIVES

Friday, October 10, 2008

Reading Notes - Week 7

Henzinger et. al.

users of web search engines tend to examine only the first page of search results. for commercially-oriented web sites whose income depends on their traffic, it is in their interest to be ranked within the top 10 results for a query relevant to the content of the web site.

to achieve high rankings, authors either use a text-based approach, a link-based approach, a cloaking approach, or a combination thereof.

traditional research in information retrieval has not had to deal with this problem of malicious content in the corpora.

the web is full of noisy, low-quality, unreliable, and indeed contradictory content. In designing a high-quality search engine, one has to start with the assumption that a typical document cannot be "trusted" in isolation, rather it is the synthesis of a large number of low-quality documnets that provides the best set of results.

layout information in HTML may seem of limited utility, especially compared to information contained in languages like XML that can be used to tag content, but in fact it is a particularly valuable source of meta-data.

There are two way to try to improve ranking. one is to concentrate on a small set of keywords, and try to improve perceived relevance for that set of keywords. another technique is to try and increase the number of keywords for which the document is perceived relevant by a search engine.

a common approach is for an author to put a link farm at the bottom of every page in a site, where a link farm is a collection of links that points to every other page in that site, or indeed to any site that the author controls.

doorway pages are web pages that consist entirely of links. they are not intended to be viewed by humans; rather, they are constructed in a way that makes it very likely that search engnes will discover them.

cloaking involves serving entirely different content to a search engine crawler than to other users.

while there has been a great deal of research on determining the relevance of documents, the issue of document quality of accuracy has not been received much attention.

another promising area of research is to combine established link-analysis quality judgments with text-based judgments.

three assumed web conventions:
1) anchor text is meant to be descriptive
2) assume that if a web page author includes a link to another page, it is because the author believes that readers of the source page will find the destination page interesting and relevant.
3) META tags: currently the primary way to include metadata within HTML. content META tag used to describe the content of the document.

duplicate hosts are the single largest source of duplicate pages on the web, so solving the dupicate hosts problem can result ina significantly improved web crawler.

vaguely-structured date: information on these web pages is not structured in a database sense, typically it's much closer to prose than to data, but it does have some structure, often unintentional, exhibited through the use of HTML markup. not typically the intent of the webpage author to describe the documents semantics.

Hawking, pt. 1

search engines cannot and should not index every page on the web.

crawling proceeds by making an HTTP request to fetch the page at the first URL in the queue. when the crawler fetches the page, it scans the contents for links to other URLs and adds each previously unseesn URL to the queue.

even hundredfold parallelism is not sufficient to achieve the necessary crawling rate.

robots.txt file to determine whether the webmaster has specified that some or all of the site should not be crawled.

search engine companies use manual and automated analysis of link patterns and content to identify spam sites that are then included in a blacklist.

crawlers are highly complex parallel systems, communication with millions of different web servers, among which can be found every conceivable failure mode, all manner of deliberate and adcidental crawler traps, and every variety of noncompliance with published standards.

Hawking, pt. 2

search engines use an inverted file to rapidly identify indexing terms - the documents that contain a particular word or phrase.

in the first phase, scanning, the indexer scans the text of each input document.

in the scond phase, inversion, the indexer sorts the temporary file into term number order, with the document number as the secondary sort key.

the scale of the inversion problem for a web-sized crawl is enormous.

there is a strong economic incentive for serach engines to use caching to reduce the cost of answering queries.

Lesk, ch. 4

Muddiest Point - Week 6

no muddiest point this week!

Friday, October 3, 2008

Reading Notes - Week 6

Hedstrom: Research Challenges in Digital Archiving and Long-term Preservation

Future research capabilities will be seriously compromised without significant investments in research and the development of digital archives.

Digital collections are vast, heterogenuous, and growing at a rate that outpaces our ability to manage and preserve them.

Human labor is the greatest cost factor in digital preservation.

need systems that are: self-sustaining, self-monitoring, self-repairing.

redundancy, replication, security against intentional attacks & technological failures, issues of forward migration: critical

Economic and policy research needs span a wide range of issues such as incentives for organizations to invest in digital archives and incentives for depositores to place content in repositiories.

questions of intellectual property rights, privacy, and trust.

digital preservation will not scale without tools and technologies that automate many aspects of the preservation process and that support human decision-making.

models needed to support: selection, choice of preservation strategies, costs/benfits of vatious levels of description/metadata.

it is important to recognize that metadata, shemas, and ontologies are dynamic

managing schema evolution is a major research issue.

Research issues in the area of naming and authorization nclude development of methods for uniquye and persistent naming of archived digital objects, tools for certification and authentication of preserved digital objects, methodds for version control, and interoperability among naming mechanisms

research is needed on the requirements for a shared and scalable infrastructure to suport digital archiving

a metadat schema registry is also needed


Littman: Actualized Preservation Threats

Chronicling America, three goals for the program are to support the digitization of historically significant newspapers, facilitate public access via a web site, provide for the long-term preservation of these materials by constructing a digital repository.

made the explicit decision not "trust" the repository until some later point; stored and backed up in a completely seperate environment

four preservation threat categories: media failure, hardware failure, software failures, operator errors

a number of hard drive failures; in one case a second problem occurred while storage system was rebuilding; resulted in the loss of a small amount of data from the system. fortunately, file system diagnostics were able to identify & restore corrupted files

first software failure was failure to successfully validate digital objects created by awardees; gaps remained in validation that allowed awardees to to submit METS records that passed validation and were ingested into the repository, but did not conform to the appropriate NDNP profile.

transformation failure: transformation of the METS record has proven to be complex and error prone; the transformation that put the original METS record inline was stripping the XML markup.

XMS file system was corrupted, resulting in the loss of some data

most sifnificant threats to preservation occurred as a result of operator errors. deletion of a large number of files from a section of a file system; lack of auditing capabilities contribured to this problem.

mistakes performed during ingest

already implemented some significant architectural changes to address.


Lavoie: Technology Watch Report

digital preservation – securing the long-term persistence of information in digital form

cultural heritage institutions, businesses, government agencies, etc. – with the need to take steps to secure the long-term viability of the digital materials in their custody. Many of these entities do not perceive an archival function within the scope of their organizational mission.

no perceived consensus on the needs and requirements for maintaining digital information over the long-term. A unifying framework that could fill this gap would be invaluable in terms of encouraging dialog and collaboration among participants in standards-building activities, as well as identifying areas most likely to benefit from standards development.

two primary functions for an archival repository: first, to preserve information – i.e., to secure its long-term persistence – and second, to provide access to the archived information

obtain sufficient intellectual property rights, along with custody of the items, to authorize the procedures necessary to meet preservation objectives. For example, if the OAIS must create a new version of the archived item so that it can be rendered by current technologies, it must have the explicit right to do so.

must not only preserve information, but also a sufficient portion of its associated context to ensure that the information is understandable, and ultimately, useable by future generations. "Contextual information" that might be preserved includes, but is not limited to, a description of the structure or format in which the information is stored, explanations of how and why the information was created, and even its appropriate interpretation.

first functional component is Ingest, the set of processes responsible for accepting information submitted by Producers and preparing it for inclusion in the archival store.

Archival Storage. This is the portion of the archival system that manages the long-term storage and maintenance of digital materials entrusted to the OAIS.

Data Management is the third functional component of an OAIS. The Data Management function maintains databases of descriptive metadata identifying and describing the archived information in support of the OAIS’s finding aids; it also manages the administrative data supporting the OAIS’s internal system operations, such as system performance data or access statistics

Preservation Planning. This service is responsible for mapping out the OAIS’s preservation strategy, as well as recommending appropriate revisions to this strategy in response to evolving conditions in the OAIS environment.

Access is the fifth functional component of an OAIS-type archive. As its name suggests, the Access function manages the processes and services by which Consumers – and especially the Designated Community – locate, request, and receive delivery of items residing in the OAIS’s archival store.

Administration. The Administration function is responsible for managing the day-to-day operations of the OAIS, as well as coordinating the activities of the other five high-level OAIS services

OAIS information model is built around the concept of an information package: a conceptualization of the structure of information as it moves into, through, and out of the archival system. An information package consists of the digital object that is the focus of preservation, along with metadata necessary to support its long-term preservation and access, bound into a single logical package

Submission Information Package, or SIP, is the version of the information package that is transferred from the Producer to the OAIS when information is ingested into the archive.

Archival Information Package, or AIP, is the version of the information package that is stored and preserved by the OAIS.

Dissemination Information Package, or DIP, is the version of the information package delivered to the Consumer in response to an access request.

Taken together, the Content Information and Preservation Description Information represent the archived digital content, the metadata necessary to render and understand it, and the metadata necessary to support its preservation.

Jones/Baegrie: Introduction & Digital Preservation

growing awareness of the significant challenges associated with ensuring continued access to these materials, even in the short term.

The need to create and have widespread access to digital materials has raced ahead of the level of general awareness and understanding of what it takes to manage them effectively.

institutions that have not played a role in preserving traditional collections do not have a strong sense of playing a role in preserving digital materials. Individual researchers were keen to "do the right thing" but frequently lacked the clear guidance and institutional backing to enable them to feel confident of what they should be doing

Digital preservation has many parallels with traditional preservation in matters of broad principle but differs markedly at the operational level and never more so than in the wide range of decision makers who play a crucial role at various stages in the lifecycle of a digital resource

While there is as yet only largely anecdotal evidence, it is certain that many potentially valuable digital materials have already been lost.

Machine Dependency. Digital materials all require specific hardware and software in order to access them

The speed of changes in technology means that the timeframe during which action must be taken is very much shorter than for paper

Fragility of the media.The media digital materials are stored on is inherently unstable and without suitable storage conditions and management can deteriorate very quickly

The ease with which changes can be made and the need to make some changes in order to manage the material means that there are challenges associated with ensuring the continued integrity, authenticity, and history of digital materials.

The implications of allocating priorities are much more severe than for paper.

The nature of the technology requires a life-cycle management approach to be taken to its maintenance

widely acknowledged that the most cost-effective means of ensuring continued access to important digital materials is to consider the preservation implications as early as possible, preferably at creation, and actively to plan for their management throughout their lifecycle.

All public institutions such as archives, libraries, and museums need to be involved in applying their professional skills and expertise to the long-term preservation of digital materials, just as they have taken a role in the preservation of traditional materials.

Preservation costs are expected to be greater in the digital environment than for traditional paper collections

need actively to manage inevitable changes in technology at regular intervals and over a (potentially) infinite timeframe.

lack of standardisation in both the resources themselves and the licensing agreements

as yet unresolved means of reliably and accurately rendering certain digital objects so that they do not lose essential information after technology changes

for some time to come digital preservation may be an additional cost on top of the costs for traditional collections unless cost savings can be realised

Because digital material is machine dependent, it is not possible to access the information unless there is appropriate hardware, and associated software which will make it intelligible.

While it is technically feasible to alter records in a paper environment, the relative ease with which this can be achieved in the digital environment, either deliberately or inadvertently, has given this issue more pressing urgency

Although computer storage is increasing in scale and its relative cost is decreasing constantly, the quantity of data and our ability to capture it with relative ease still matches or exceeds it in a number of areas.

approaches to digital preservation:
-Preserve the original software (and possible hardware) that was used to create and access the information. technology preservation strategy
-Program future powerful computer systems to emulate older, obsolete computer platforms and operating systems as required.This is the technology emulation strategy.
-Ensure that the digital information is re-encoded in new formats before the old format becomes obsolete.This is the digital information migration strategy

The dramatic speed of technological change means that few organisations have been able even fully to articulate what their needs are in this area, much less employ or develop staff with appropriate skills

Roles are also changing within as well as between institutions. Assigning responsibility for preservation of digital materials acquired and/or created by an organisation will inevitably require involvement with personnel from different parts of the organisation working together

Some consideration also needs to be given in the selection to the level of redundancy needed to ensure digital preservation. A level of redundancy with multiple copies held in different repositories is inherent in traditional print materials and has contributed to their preservation over centuries

The IPR issues in digital materials are arguably more complex and significant than for traditional media and if not addressed can impede or even prevent preservation activities. Consideration may need to be given not only to content but to any associated software

Muddiest Point

No muddiest point this week, but I am still unable to install apache on my mac and thus cannot run greenstone.