Friday, September 19, 2008

Reading Notes

Setting the Stage
metadata: the sum total of what one can say abotu any information object at any level of aggregation
- content (intrinsic) - what is contained/about
- context (extrinsic) - who what when etc about creation
- structure (either int or ext) - associations b/c or among independent info objects

library metadata: includes indexes, abstracts & catalog records created according to cataloging rules (MARC, LCSH etc)

archival & manuscript metadata: accession records, finding aids, catalog records.
- MARC Archival and Manuscript Control (AMC), no MARC format for bib control

not as much emph on structure for lib/arch, but always important even before digitization.
- growing as comp capabilities increase
- structure can be exploited for searching, etc but need specific metadata

metadata:
- certifies authenticity & degree completeness of content
- establishes & documents context of content
- identifies & exploits structural rel's that exist b/t & w/in info objects
- provides range of intell access points for increasingly diverse range of users
- provides some of info an info prof might have provided in a physical setting

repositories also metadata for admin, accession, preserving, use of coll's
- personal info mgmt, recordkeeping

Dublin Core Metadata Element Set

Table 1 - Types
1) Administrative (acqu info, rights & repro, documentation of legal access req's, location info, selection criteria, version control, audit trails)
2) Descriptive (cat records, finding aids, spec indexes, hyperlink rel's, annotations by users, md for recordkeeping systems)
3) Preservation (documentation of phys condition, actions taken to preserve)
4) Technical (hard/software documentation, digitization info, track sys resp time, auth & sec data)
5) Use (exhibit records, use & user tracking, re-use & multi-version info)

Attributes of metadata:
- Source (Int or Ext)
- Method of creation (auto/comp or manual/human)
- Nature (lay/nonspecialist v. expert - but often orig is lay)
- Status (static, dynamic, long- or short-term)
- Structure (structured or no)
- Semantics (Controlled or no)
- Level (collection or item)

Life-Cycle of Info Object:
1) Creation & Multi-Versioning
2) Organization
3) Searching & Retrieval
4) Utilization
5) Preservation & Disposition

Little-Known facts about metadata:
- doesn't have to be digital
- more than just description
- variety of sources
- continue to accrue during life of info object/sys
- one info obj's metadata can simultaneously be another's data

Why Important?
- Increased accessibility
- retention of context
- expanding use
- multi-versioning
- legal issues
- preservation
- system improvement & economics


Border Crossings

DC: simple, modular, extensible metadata

prob's w/ user-created metadata

structured md important in managing intellectual assets

md for images critical for searchability/discoverability

controlled v. uncontrolled vocab (folksonomy)
- still not sure of future role

"Railroad Gage Dilemma" - why no common formal model?

demand for international/multicultural approach


Witten 2.2

19th c goals for bib sys: finding, collocation, choice

Today: 5 goals
1) locate entities
2) identify
3) select
4) acquire
5) navigate rel's between

principle entities: documents, works, editions, authors, subjects.
- titles & sub classifications are attributes of works, not own entitities

doc's: fundamental hierarchical structure that can more faithfully be reflected in DL than physical

"helpful DLs need to present users an image of stability & continuity" (p.49)

works: disembodied contents of a document

edition - also "version, release, revision"

authority control

subjects: extraction: phrases anlyzed for gram & lexical structure
- key phrase assignment (auto class), easier for scientific than literary

LCSH controlled vocab rel's: equivalence, hierarchical, associative


Witten 5.4 - 5.7

MARC & Dublin Core, BibTeX & Refer

MARC: AACR2R guidelines
- authority files & bib records

Dublin Core: specifically for non-specialist use
- title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, rights

BibTeX: math/sci

Refer: basis of endnote

metadata even more imp for images & multimedia
- image files contain some info

TIFF: tags. integers & ASCII text. dozen or so mandatory.
- most DLs use for images

MPEG-7: multimedia content description interface
- still pics, graphics, 3D models, audio, speech, video, combinations
- stored or streamed
- complex/extensible: DDL (desc def lang) usus XML syntax
- temporal series, spectral vector
- some automatic, some by hand

text-mining: auto metadata extraction
- structured markup s/a XML

Greenstone includes lang ident, extracting acronyms & key phrases, generating phrase hierarchies.

some info easy to extract: url/email, money, date/time, names
- generic entity extraction

bib ref's / citation info can be located & parsed

n-grams procedure for lang ID, can assign language metadata

key-phrase: assignment/extraction
- assignment: compare doc to each entry in list (Y or N)
- extraction: phrases from doc listed & best chosen (easier)

IDing phrases: content word, phrase delimiters, maximal length

No comments: