Indigenous Digital Archive: Opensource tools for creating effective access to & collaboration with mass-digitized archival documents
OCR can only take you so far with some of the most important documents shaping the 19th & 20th century; we'll build collaborative tools.
Over 10,000 Native American children from 140 tribes attended Carlisle Indian Industrial School during its nearly 40 years of operation. Only 158 ever graduated. The number of children who died in the school is yet unknown; at least some are represented in the cemetery, as they are in the 24 other government Indian boarding schools of this era. Children were often taken from their parents without consent. It was policy to make a break between them and their communities and their culture.
Having the luxury or luck of being able to travel to a distant archives, your only access being during business hours, is a great thing, but such chances are usually rare if offered at all. Where you live shouldn't be such a bar to what you can find out about your own history!
A federal court recently ruled in favor of an Oklahoma tribe's claim of the federal government being delinquent on payments owed to it dating back to 1932. The tribe's attorney noted that among the resources the tribe spent on their claim was 6 years retrieving documents from the US National Archives. Six years! How much more information could be immediately useful to families and communities today if it were accessible? (Traditional dance, Ohkay Owingeh, New Mexico; CC BY Einar E. Kvaran)
How to build tools that are interoperable and not stuck within one system's silo? Work with an international community forging agreed-on and useful standards! This overview of the Indigenous Digital Archive project comes from the International Image Interoperability Framework (IIF.io) Community talks at the National Gallery of Art in May.
Map from the tribal community around the former Mt Pleasant Indian Industrial School, the Saginaw Chippewa, who host an annual Reconciliation Day while still searching for many basic facts about what happened at the school, such as how many children died there. (American Indian Boarding Schools: A Supplementary Curriculum Guide, Ziibiwing Center of Anishinabe Culture & Lifeways, 2011. http://www.sagchip.org/ziibiwing/planyourvisit/educators/curriculumMain.htm)
Describe your project.
To build tools enabling efficient access to and collaboration with mass digitized archival documents and photos, the Indigenous Digital Archive project will create an opensource toolkit for the popular opensource online content system Omeka. Our toolkit for online digital engagement and collaboration builds on existing international standardization in application programming interface standards (APIs) for image interoperability (http://IIIF.io) and the Open Annotation format. Our software developer, Digirati, has laid important groundwork in this area in building interoperable viewers for the British Library and others (the Universal Viewer) and in computer-aided tagging (http://digirati.com/powertagging).
For our use case, our initial subject is 19th and 20th century public government documents that due to typescript quality or inclusion of handwriting or photos are highly resistant to computer recognition (OCR). The first focus is records of US government Indian boarding schools. We will draw on new work in user interface design, particularly the intriguing explorations in building user friendly “generous interfaces” full of effective visualizations and informative leads for users.
In addition to other crowdsourcing volunteers, a cohort of members of New Mexico’s 23 tribes will inform interface design and conduct sustained user testing over a year of collaborative work online and in person.
Having identified open public records related to Native land claims and the government boarding schools of the late 19th to early 20th century as both a priority for and otherwise unavailable to Native communities in our region, we will acquire digital images of at least 140,000 pages (140 reels) of previously microfilmed records held by the US National Archives, ingest them into a hosted server instance of Omeka, apply the new opensource toolkit layer, and make them available through a new online environment characterized by rich interaction and collaboration tools and an enhanced user experience made possible by using generous interface design principles to replace the standard search box.
We also have a commitment to work with the Digital Public Library of America to increase discoverablity through http://dp.la and improve the user experience through DPLA working with our International Image Interoperability Framework (IIIF) endpoints, helping develop this capacity of DPLA.
How does this project advance the library field?
There is a need to be able to access large quantities of archival documents, such as government records, without waiting for the bottleneck of a staffmember performing detailed tasks like indexing. Indeed, given scarce resources, this will often never happen for many important information-bearing records. Additionally, there is interest in repositories sharing with researchers the authority of describing the records, and drawing on that expertise.
The bulk of records created in the 1800s and into the 1900s, a time when many government institutions were taking shape and having powerful impacts, are highly resistant to automated Optical Character Recognition (OCR). Whether due to irregular typescript, handwriting, or other characteristics, this means they can't be effectively OCR'd and keyword searched.
Crowdsourcing has already become one approach to improving access to data, or creating machine-readable versions of analog data. Current interfaces are designed to collect a limited range of structured data according to a particular research design, and are becoming increasingly sophisticated at directing the workflow. (e.g. a “line-at-a-time, queue-oriented, multi-track transcription workflow.”) However, these are not tools for exploring collections beyond a specific kind of encounter. Sometimes the narrow task orientation and gamification of some interfaces might mean that a user is presented with a single image of text that is completely interesting to them, only to see it whisked away after they've completed a transcription task, with no clue of where it came from, or ability to see the whole item. In these applications, you're often creating data for someone else, rather than yourself and your peers.
Tools are still needed to enable work with mass digitized documents. For mass digitized archival documents, access needs are not always met by transcription. This is not only because full transcription or OCR correction is usually much more time consuming than selecting a tag (a name, event, concept, or place) that would be meaningful for someone looking for the content, but also because often times what would be used as a keyword does not actually appear in that text. (For example, a derivative, alternate, or misspelled form of a name is used, or what would receive a keyword tag of “boarding school deaths” appears in euphemistic language.)
The need to create online access to allow collections to reach a wider group of users means that repositories do continue to look to mass digitization as part of their strategies. This project would create a toolkit layer based on international interoperability standards that allows online collaboration to provide navigation points, keyword tagging, and annotation. IIIF and Open Annotation standards mean the results will be extendable to other and future systems, and best situated for long term digital preservation.
Who is the audience and what are their information needs?
The use case of the IDA project will address the absence of access to open public government records relating to the build up to and operations of the US government boarding and day schools for Indians in the period of the Indian Wars up to the reforms of the “Indian New Deal” in the 1930s, and records related to tribal land claims in the same period. These records are not currently available within New Mexico, where they were created. The information is sought by Indigenous peoples of New Mexico and others. To take a small example, the New Mexico State Coordinator of Tribal Libraries, who often receives reference requests related to information the documents the project will make available, notes that now having even just a pilot project of documents of student names online (http://native-docs.org) fills a need no one has been able to respond to before in connecting people affected by these government policies generations onward with the information they’re seeking. Online access is essential as few can afford to take time off, travel, and support research during business hours at a repository. A recent court case shows the need for information from public docs of this era: a federal court ruled in favor of an Oklahoma tribe's claims for overdue payments from the federal government dating back to 1932 -- but their attorney noted they had to spend 6 years getting the documents from the US National Archives. New tools are needed to prevent this kind of bar to access to information.
The audience for our use case includes members of the 23 tribes of New Mexico plus Hopi (geographically separate but culturally and genealogically related), and other descendants of boarding school students separated from their home communities and sent to boarding schools in New Mexico.
Part of the design of the government Indian boarding school was to widely disperse and mix students to achieve fuller separation from their communities. This means, for example, that viewing the records from the boarding schools in New Mexico does not show all students from New Mexico, so a longer range goal is to have access in the Indigenous Digital Archive interface to the records of the government Indian boarding and day schools from across the US, as well as other government records in this period before government policy was changed to Indian Self Determination. The audience for our use case will include descendants of Indian boarding school attendees, staff, and other community members throughout the nation.
Creating effective access locally is particularly important at this time as this is a window of opportunity where tribal people in New Mexico have the benefit of understanding the records with the input of those who are elders today who were young children at the time of the creation of the later records, and others who still have first-hand stories from their parents or grandparents in the 1920s-1930s and even earlier.
Please list your team members and their qualifications.
George Oates (Good, Form & Spectacle), User Interface Design. George designed Flickr! George is a world leader in developing generous interface design. She's developing innovative user experiences in exploring data for the British Museum, the Wellcome Library, and has consulted for numerous cultural institutions including the Smithsonian and Historypin.
Tom Crane, Adam Christie, Edward Silverton, Software Engineering (Digirati). Digirati Tech Lead Tom Crane has worked on large projects for Microsoft, Sony, Oxford University Press, English Heritage, the Wellcome Library and many others, focusing on web publishing and content management. He shows how creative systems integration can be used to connect cultural heritage collection data, digitization output and content management systems, using linked data and semantic web technologies. He is an editor of the international IIIF specification. Adam and Edward, Senior Consultants, have developed apps and systems for clients including the British Library, Wellcome Trust, and Sun Microsystems. Digirati has debuted a new tool assisting management of computer-aided keyword tagging (http://digirati.com/powertagging).
Dr. Anna Naruta-Moya, Project Director. Formerly archivist for the US National Archives and for the Hoover Institution Archives of Stanford University, she has experience in the paper version of “big data.” A 2015 Getty Institute Summer Fellow, UC Berkeley PhD, member of the Society of American Archivists Archival Standards Committee, Academy of Certified Archivists, SAA Digital Archives Specialist, Research Associate Prof Univ. of New Mexico. She is married to the Tewa artist Daniel Moya (P’o Suwae Ge Owingeh), raised on the reservation by his grandmother, a 2nd generation early gov't Indian boarding school student (of "the Starving Years").
Caren Gala (Nambe), Communications Coordinator. Caren has played major roles in planning, organizing, and executing major events such as the Southwestern Association for Indian Arts Santa Fe Indian Market and the International Folk Art Market.
Dr. Robert Sanderson, Technical Advisory Panel member, is Information Standards Architect for Stanford University Digital Library Systems and Services, and an editor of the IIIF and Open Annotation international standards.
Glen Robson, Technical Advisory Panel member, is Head of Systems Unit for the National Library of Wales; adopter of IIIF and the Open Annotation W3C formats. This has allowed marked advances in usability of collections, allowing, for example, better interaction with digitized newspapers (http://newspapers.library.wales), and Cynefin: Mapping Wales' Sense of Place project (http://cynefin.archiveswales.org.uk), in which people volunteer to transcribe and geolocate entries in church tithing records to create a map and database that speaks to detailed land use and community histories.
Advisory Panel members and qualifications are detailed in the attachment “Advisory Panel”.
Organization name and location (City, State).
The Museum of Indian Arts and Culture (of the State of New Mexico), Santa Fe, New Mexico, leads the IDA project in collaboration with the Indian Pueblo Cultural Center (jointly operated by all 19 Pueblo tribes) and the State Library Tribal Libraries Program.