From Text to Tech 2016

Corpus and Computational Linguistics for powerful text processing in the Humanities

  • Conveners: Gard Jenset, Barbara McGillivray and Gabor Toth
  • Hashtag: #text2tech and #DHOxSS
  • Computers: Students are not required to bring their own laptops for this workshop. Desktop computers will be provided by DHOxSSS

 

Abstract:

With large amounts of text becoming available through digitization efforts, there is a growing need for automatic analyses in the Digital Humanities to support distant reading. This workshop, originating from the HiCor research network, will impart some of the basics for working computationally and quantitatively with texts. It will take a hands-on approach to processing text, including cleaning and adding automatic linguistic annotation using freely available computational tools and the Python programming language, a very flexible tool with a wide range of applications in Humanities research. 

The workshop proceeds in a stepwise manner, with an introduction to corpus linguistics followed by basic programming in Python. The workshop will also teach how to explore texts quantitatively, for example by creating frequency lists and visualizations, and more advanced types of analysis, such as topic modelling. The practical sessions are accompanied by lectures that discuss research which demonstrates concretely how Python and corpus linguistics can be applied to answer questions in a range of humanistic disciplines. The workshop rounds off with a practical problem-solving session covering the topics of the week. 

No prior knowledge of programming is required, but attendees should be comfortable with identifying file paths on their own computer and installing software.

Timetable

Time

Monday

Tuesday

Wednesday

Thursday

Friday

11:00 - 12:30

Why should you learn Python?
Gard Jenset

 

Close versus distant reading and linguistic analysis in the Humanities 
Gabor Toth

Introduction to programming in Python
Gard Jenset

 

Corpus methods and social identity in historical texts
Heather Froelich

Creativity is what we say it is: using corpus linguistics to identify key aspects of creativity
Anna Jordanous

Corpora do what? On theory, method and data in Digital Humanities
Knut Melvær

Lunch

Venue: St Anne's College, Dining Room

14:00 - 16:00

Introduction to Corpora
Barbara McGillivray

 

Corpus tools
Gabor Toth

Basic natural language processing (NLP) with Python
Gard Jenset

 

Going further with NLP in Python
Barbara McGillivray

Python and more NLTK
Gabor Toth

 

Extracting information from text
Barbara McGillivray

 

Topic Modelling
Gard Jenset

Problem solving session

16:30 - 17:30

Corpus tools [Continued]

Going further with NLP in Python [Continued]

Python and more NLTK [Continued]

Topic Modelling [Continued]

Problem solving session [Continued]

 

Schedule Details

Monday

11:00 - 12:30

Why should you learn Python?
Gard Jenset

This introductory session gives an overview  of the workshop and discusses why programming is important in Digital Humanities.
 

Close versus distant reading and linguistic analysis in the Humanities
Gabor Toth

14:00 - 16:00

Introduction to Corpora
Barbara McGillivray

The session will give an introduction to the main concepts of corpus linguistics, including corpus creation and corpus processing for research in Digital Humanities.
 

Corpus tools
Gabor Toth

This session will introduce participants to selected tools for querying corpora, such as NoSketch Engine and Corpus Bench.

16:30 - 17:30

Corpus tools [Continued]
 

Tuesday

11:00 - 12:30

Introduction to programming in Python
Gard Jenset

The session provides a basic introduction to programming for digital humanities using the Python language. Among the topics covered are assignments and variables, data types, conditional statements, and reading/writing data.

14:00 - 16:00

Basic natural language processing (NLP) with Python
Gard Jenset

The session gives an introduction to working with linguistic data in Python. Topics include simple regular expressions and other methods for handling text data.
 

Going further with NLP in Python
Barbara McGillivray

This session introduces the NLTK library and shows how it can be used for tasks such as stemming and part-of-speech tagging with Python

16:30 - 17:30

Going further with NLP in Python [Continued]
 

Wednesday

11:00 - 12:30

Corpus methods and social identity in historical texts
Heather Froelich

This session will explore how researchers can use evidence from the Historical Thesaurus of the OED in combination corpus methods to investigate lexical features of social identity, with the language of Shakespeare and his contemporaries as a case study.

14:00 - 16:00

Python and more NLTK
Gabor Toth

Corpus linguistics with Python: The session provides and introduction to doing corpus linguistics in Python and NLTK. Topics include collocations, frequency lists, and key words.

16:30 - 17:30

Python and more NLTK [Continued]
 

Thursday

11:00 - 12:30

Creativity is what we say it is: using corpus linguistics to identify key aspects of creativity
Anna Jordanous

As a concept, creativity is complex and multi-dimensional, encompassing many related aspects, abilities, properties and behaviours. Using techniques from the field of statistical natural language processing, we have identified a collection of fourteen key components of creativity. Words were identified which appeared significantly often in connection with discussions of the concept, and a measure of lexical similarity was used to cluster these words. A number of distinct themes emerged, which collectively contribute to our understanding of how creativity is composed.

14:00 - 16:00

Extracting information from text
Barbara McGillivray

The session gives introduction to how Python and the NLTK library can be used to extract structured information such as named entities from unstructured text.
 

Topic Modelling
Gard Jenset

This session gives a non-technical introduction to topic modelling along with examples of Python code.

16:30 - 17:30

Topic Modelling [Continued]
 

Friday

11:00 - 12:30

Corpora do what? On theory, method and data in Digital Humanities
Knut Melvær

Having stumbled my way into the Digital Humanities, I have had to overcome an array of challenges when it comes to messy data, undocumented and buggy software, the rapid advancements in the tech-world and the scarcity of theorizing about what digital methods such as “distant reading” really tell us. In this session I will invite you to explore some of these issues and discuss how we can make DH more approachable with regards to theory and method.

14:00 - 16:00

Problem solving session

The session will provide an opportunity to apply the skills taught during the week, with instructors present to provide guidance.

16:30 - 17:30

Problem solving session [Continued]