Setting the Stage
metadata: the sum total of what one can say abotu any information object at any level of aggregation
- content (intrinsic) - what is contained/about
- context (extrinsic) - who what when etc about creation
- structure (either int or ext) - associations b/c or among independent info objects
library metadata: includes indexes, abstracts & catalog records created according to cataloging rules (MARC, LCSH etc)
archival & manuscript metadata: accession records, finding aids, catalog records.
- MARC Archival and Manuscript Control (AMC), no MARC format for bib control
not as much emph on structure for lib/arch, but always important even before digitization.
- growing as comp capabilities increase
- structure can be exploited for searching, etc but need specific metadata
metadata:
- certifies authenticity & degree completeness of content
- establishes & documents context of content
- identifies & exploits structural rel's that exist b/t & w/in info objects
- provides range of intell access points for increasingly diverse range of users
- provides some of info an info prof might have provided in a physical setting
repositories also metadata for admin, accession, preserving, use of coll's
- personal info mgmt, recordkeeping
Dublin Core Metadata Element Set
Table 1 - Types
1) Administrative (acqu info, rights & repro, documentation of legal access req's, location info, selection criteria, version control, audit trails)
2) Descriptive (cat records, finding aids, spec indexes, hyperlink rel's, annotations by users, md for recordkeeping systems)
3) Preservation (documentation of phys condition, actions taken to preserve)
4) Technical (hard/software documentation, digitization info, track sys resp time, auth & sec data)
5) Use (exhibit records, use & user tracking, re-use & multi-version info)
Attributes of metadata:
- Source (Int or Ext)
- Method of creation (auto/comp or manual/human)
- Nature (lay/nonspecialist v. expert - but often orig is lay)
- Status (static, dynamic, long- or short-term)
- Structure (structured or no)
- Semantics (Controlled or no)
- Level (collection or item)
Life-Cycle of Info Object:
1) Creation & Multi-Versioning
2) Organization
3) Searching & Retrieval
4) Utilization
5) Preservation & Disposition
Little-Known facts about metadata:
- doesn't have to be digital
- more than just description
- variety of sources
- continue to accrue during life of info object/sys
- one info obj's metadata can simultaneously be another's data
Why Important?
- Increased accessibility
- retention of context
- expanding use
- multi-versioning
- legal issues
- preservation
- system improvement & economics
Border Crossings
DC: simple, modular, extensible metadata
prob's w/ user-created metadata
structured md important in managing intellectual assets
md for images critical for searchability/discoverability
controlled v. uncontrolled vocab (folksonomy)
- still not sure of future role
"Railroad Gage Dilemma" - why no common formal model?
demand for international/multicultural approach
Witten 2.2
19th c goals for bib sys: finding, collocation, choice
Today: 5 goals
1) locate entities
2) identify
3) select
4) acquire
5) navigate rel's between
principle entities: documents, works, editions, authors, subjects.
- titles & sub classifications are attributes of works, not own entitities
doc's: fundamental hierarchical structure that can more faithfully be reflected in DL than physical
"helpful DLs need to present users an image of stability & continuity" (p.49)
works: disembodied contents of a document
edition - also "version, release, revision"
authority control
subjects: extraction: phrases anlyzed for gram & lexical structure
- key phrase assignment (auto class), easier for scientific than literary
LCSH controlled vocab rel's: equivalence, hierarchical, associative
Witten 5.4 - 5.7
MARC & Dublin Core, BibTeX & Refer
MARC: AACR2R guidelines
- authority files & bib records
Dublin Core: specifically for non-specialist use
- title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, rights
BibTeX: math/sci
Refer: basis of endnote
metadata even more imp for images & multimedia
- image files contain some info
TIFF: tags. integers & ASCII text. dozen or so mandatory.
- most DLs use for images
MPEG-7: multimedia content description interface
- still pics, graphics, 3D models, audio, speech, video, combinations
- stored or streamed
- complex/extensible: DDL (desc def lang) usus XML syntax
- temporal series, spectral vector
- some automatic, some by hand
text-mining: auto metadata extraction
- structured markup s/a XML
Greenstone includes lang ident, extracting acronyms & key phrases, generating phrase hierarchies.
some info easy to extract: url/email, money, date/time, names
- generic entity extraction
bib ref's / citation info can be located & parsed
n-grams procedure for lang ID, can assign language metadata
key-phrase: assignment/extraction
- assignment: compare doc to each entry in list (Y or N)
- extraction: phrases from doc listed & best chosen (easier)
IDing phrases: content word, phrase delimiters, maximal length
Friday, September 19, 2008
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment