Humanities Data: a Hands-on Approach 2016

Making the Most of Messy Data

Conveners: Megan Senseney and Andrea Thomer
Hashtag: #dhcuration and #DHOxSS
Computers: Students are not required to bring their own laptops for this workshop. Desktop computers will be provided by DHOxSS

Abstract:

Humanists have data. Moreover, advances in the methodologies and approaches of digital humanities research have exposed the importance of maintaining research data and digital information in a manner that preserves its meaning and usefulness. Data curation is the active and ongoing management of data through its lifecycle of interest. Purposeful curation provides the foundation for a range of related activities from analyzing and visualizing research data to promoting access and reuse across a broader scholarly community.

This workshop will provide a hands-on introduction to a suite of useful tools, methods, and concepts for managing, organizing, cleaning, and processing data in digital humanities projects. Sessions will cover a range of topics, including information organization, data modelling, data quality and cleaning, and workflows. Participants will be introduced to humanities data from a selection of real-world digital humanities projects, and these datasets will serve as project case studies for use with each tool introduced throughout the week. At the end of the week, participants will present on their experiences working with the case studies throughout the week.

The programme is aimed at humanities researchers—whether traditional faculty or alternative academic professionals—and may also be of interest to librarians, archivists, cultural heritage specialists, other information professionals, and advanced graduate students. Sessions will be led by experts from the iSchool at Illinois's Center for Informatics Research in Science and Scholarship and the HathiTrust Research Center as well as the University of Oxford’s Bodleian Libraries, Oxford e-Research Centre, and Oxford Internet Institute.

Timetable

Time	Monday	Tuesday	Wednesday	Thursday	Friday
11:00 - 12:30	Introductions David De Roure Introduction to Humanities Data Allen Renear and Andrea Thomer	Information Organization Allen Renear	Contextual Data Modeling Neil Jefferies	Provenance, Reproducibility, and Research Workflows David De Roure	Further Topics in Data Curation David Weigl and Andrea Thomer
Lunch	Venue: St Anne's College, Dining Room
14:00 - 16:00	Hands On with GitHub Andrea Thomer	Introduction to Data Quality Andrea Thomer and Bertram Ludäscher Hands on with OpenRefine Andrea Thomer	Hands on with SQLite Bertram Ludäscher	Hands on Provenance, Reproducibility Bertram Ludäscherand David De Roure	Further hands on with GitHub, OpenRefine, SQLite, and YesWorkflow Andrea Thomer and Bertram Ludäscher Participant presentations on their work with DH use cases Andrea Thomer and Bertram Ludäscher
16:30 - 17:30	Data and Project Management Andrea Thomer	Hands on with OpenRefine [Continued]	The Physical and Digital via the Meta: A Hands-On Linked Data in a Musicological Case Study Kevin Page	From Project to Preservation: Institutional Repositories David Tomkins	Closing Discussion Andrea Thomer, Bertram Ludäscher and Allen Renear

Schedule Details

Monday

11:00 - 12:30

Introductions
David De Roure

Introductions should come first. We want to know about you, your projects, and your data.

Introduction to Humanities Data
Allen Renear and Andrea Thomer

In this session, we’ll review some of the unique characteristics and challenges in working with humanities data. We will also introduce the “messy” dataset we’ll be cleaning throughout the week, and review the workshop agenda.

14:00 - 16:00

Hands On with GitHub
Andrea Thomer

Github is a git-based web repository service for code, documentation, and data. We’ll explain what all those words mean and provide a brief overview of this useful service.

16:30 - 17:30

Data and Project Management
Andrea Thomer

Moving away from agency-required data management plans, this session discusses data management within the overarching context of digital project management. We will also introduce Zenhub, a free project management plugin for Github.

Tuesday

11:00 - 12:30

Information Organization
Allen Renear

An overview of basic strategies for information organization through structured tables, trees, and triples along with a discussion of different levels of information representation.

14:00 - 16:00

Introduction to Data Quality
Andrea Thomer and Bertram Ludäscher

This session introduces key concepts in data quality and cleaning, including stakeholder analysis, fitness for use, and provenance.

Hands on with OpenRefine
Andrea Thomer

OpenRefine is a “free, open source power tool for working with messy data and improving it.” We’ll demo this tool and prepare you to work with OpenRefine on your own in this session.

16:30 - 17:30

Hands on with OpenRefine [Continued]

Wednesday

11:00 - 12:30

Contextual Data Modeling
Neil Jefferies

Building upon concepts from Information Organization, this session approaches data modeling through deeper considerations of context, provenance, and evidence.

14:00 - 16:00

Hands on with SQLite
Bertram Ludäscher

SQLite is the most ubiquitous database engine across the globe. It’s lightweight, relatively easy to learn, and can be an important asset in your data curation arsenal. Participants will dive in with a hands-on introduction to database structures and data profiling.

16:30 - 17:30

The Physical and Digital via the Meta: A Hands-On Linked Data in a Musicological Case Study
Kevin Page

A data curation case study from the MetaMuSAK that explores a musicological annotation effort through digital tools, data capture strategies, RDF representation models, and explorations of linked data outputs.

Thursday

11:00 - 12:30

Provenance, Reproducibility, and Research Workflows
David De Roure

This session will explore how and why scholars capture their personal research workflows to ensure documentation of provenance and support reproducibility and reuse.

14:00 - 16:00

Hands on Provenance, Reproducibility
Bertram Ludäscher and David De Roure

Concepts from the morning session will be put to use through a demonstration of the YesWorkflow initiative. Participants will also have an opportunity to explore, alter, and run annotated scripts from the workshop’s shared dataset.

16:30 - 17:30

From Project to Preservation: Institutional Repositories
David Tomkins

What happens to your data when your project is complete? This session provides an overview of archiving and data management from the perspective of institutional repositories.

Friday

11:00 - 12:30

Further Topics in Data Curation
David Weigl and Andrea Thomer

This session will be a set of lightning-round demos and discussions of special topics in data curation including data integration, data visualization, and non-computational workflows.

11:00-11:30 -- Data integration with Karma (David Weigl)
11:30-12:00 -- Data visualization using Bookworm (Andrea Thomer)
12:00-12:30 -- Capturing non-computational workflows (Andrea Thomer)

14:00 - 16:00

Further hands on with GitHub, OpenRefine, SQLite, and YesWorkflow
Andrea Thomer and Bertram Ludäscher available for consultation

Each group will have a final opportunity to revisit tools from the week and continue working on their use cases.

Participant presentations on their work with DH use cases
Andrea Thomer and Bertram Ludäscher available for consultation

Every working group had the same dataset, but each use case posed a different set of challenges. What choices did you make based on your use scenarios? What issues did you encounter and how might you resolve them?

16:30 - 17:30

Closing discussion
Andrea Thomer, Bertram Ludäscher, and Allen Renear

What can you do to improve your own curatorial practices in the near term and in the long term? What are the key lessons you learned from your week as data curators?