THE INCREDIBLE

DIGITAL COMIC LIBRARY

FRAMEWORK

Introduction

Comics offer an interesting challenge to libraries. Traditional integration has focused on enabling public access to existing material and its bibliographic data, while semantic web technologies and standards, alongside machine learning approaches, have been used to explore how the medium can fit into the evolving concept, and reality, of a digital library.

In this project, we sketch out a potential framework for what a semantic digital library dedicated to this popular literary medium might look like.

Our approach is split into five parts:

Definitions and state of the art in comics integration for digital libraries
Review of three existing resources
Outline of potential users
Sketch of a Framework for Semantic Digital Libraries for Comics
Suggestions for implementation of framework on reviewed resources

A note on Definition

“Defining comics entails cutting a Gordian-knotted enigma wrapped in a mystery” - R. C. Harvey, 2001

This project deals with comics as a medium, a way of communicating information that involves the combination of text and drawn images, and as a format, the ways in which this communication is physically or digitally delivered to a user. Complicating things further are the connotations associated with the word ‘comics’ itself, the most important being its direct evocation of a US and British take on the medium popularized in the early parts of the 20th century and which has become synonymous with the word.

We’ll use the word comics to broadly refer to the medium, regardless of its format, and when necessary we will refer to specific instances of formats. The three of us have an interest in and experiences with comics beyond their US/British connotations (Italian fumetti, French bandes dessinée, and Japanese manga) as well as with both its physical and digital formats. To be a comics fan today means having an intrinsic understanding, if not outright knowledge, of the medium’s historical and cultural versatility and the fluidity of its format.

Background & State of the Art

Following their beginnings in the first half of the 19th century as social satire, comics were long thought as a non-serious literary format due in part to their intrinsic combination of texts and images. Their continuous popularity throughout the 20th century in the United States, Europe, and Japan would ultimately change this perception.

Today, many fields within the humanities - such as literary studies or library and information science - treat comics as a serious subject of study while GLAM institutions - including the Library of Congress and Bibliothèque Nationale de France - consider it a valid acquisition for preservation and investment for public engagement.

Below is an overview of existing research and projects dealing with comics and semantic digital technologies.

Background & State of the Art

Corpora

Graphic Narrativity Corpus

Produced by Hybrid Narrativity, a DH research project focused on graphic novels at the universities of Paderborn and Potsdam in Germany. The GNC was conceived as a monitor corpus, with regular updates, and currently contains around 240 titles, starting from the mid-1970s, of what the group defines as graphic narratives including “fictional and non-fictional texts, such as graphic novels and memoirs, graphic journalism, and [...] graphic fantasy, including comic books.” The project’s website allows users to browse the corpus by authors and title as well as showcasing some statistical information via a series of basic visualisations.

Visual Language Research Corpus

Produced by Visual Language Lab, a research group devoted to linguistic and cognitive research on visual language and multimodality primarily based at Tilburg University. Where the GNC focuses on the text of comics, the VLRC is instead a corpus of annotated comics focused on enabling analysis of visual languages worldwide. It features over 36,000 panels from European, Asian, and American comics from 1940 to the present and its annotations include coding of panel frames, semantic relationships between panels, layout, and multimodality. The corpus is still in the process of being released to the public, expected sometime in 2022, and the group’s research continues with the TINTIN project which seeks to analyse “visual narratives as a window into language and cognition.”

Datasets

DCM772, eBDtheque, COMICS, ICDAR2019FGC, Manga 109

Datasets of annotated comic pages appear to be few and far between, and are all the direct product of academic research and in particular research into machine learning and automation of tasks such as character recognition, object metadata, image indexing, and multimodality. While many of these datasets are describe as being public, in truth none of them are easily accessible to the public.

Both DCM772 and edBDtheque were produced by academics at the University of La Rochelle in France, and require specific access request. The former includes 772 images taken from the Digital Comics Museum, an archive of public domain golden era comics which is assessed in this project, and annotations relating to bounding, character, and text recognition, while the later includes 100 pages from European, Japanese, and American comics with manual annotations and metadata for text lines, balloons, and panels.

ICDAR2019FGC is a dataset extracted from DCM772, the result of an academic competition, and further focuses on fine-grained classification of characters.

COMICS is the largest of the datasets, with over 1.2 million panels also extracted from the DCM archive, and annotated to enable narrative prediction based on a combination of text, images, and their bounding. It was created by academics at the University of Maryland, College Park and University of Colorado, Boulder.

Lastly, Manga109 is the only dataset we could find devoted exclusively to the Japanese version of the medium, produced by the Aizawa Yamasaki Matsui Laboratory, at the University of Tokyo. It includes 109 annotated manga volumes released between the 1970s and 2010s and intended for academic research or non-profit use (a subset is available for commercial use). Annotations include character faces and bodies, text, and frame. The project also includes a simple Python API to read annotation data while a dataset of onomatopoeias is currently under construction.

Datasets

GCD data dump

The Grand Comics Database is one of the three resources we assess in this project. The GCD is focused on cataloguing comics bibliographic data, with images of covers the only non-textual asset within the database. As part of their cataloguing and preservation efforts, the GCD makes their database available as either a MySQL or name-value dump that uses a table-driven schema.

Comics as Data: North America

The CaDNA is an ongoing project started in 2018 that “examines library catalog data to explore geographies of publishing and library collecting policies in North American comics.” The project draws from Michigan State University Library (MSUL) Comic Art Collection and has been driven by the university’s Graphic Possibilities Research Workshop, which has used edit-a-thons as a means to explore and expand the original MARC data. This data has been released publicly via github, and has been used to create vizualisations.

Ontologies

Comic Book Ontology (CBO)

Created by Sean Petiya as part of his 2014 thesis at Kent State University, CBO is the first publicly available RDFS/OWL vocabulary dedicated to describing comics both in narrative and physical form. Petiya hopes that its conceptualization of the medium would allow it to be used to describe a variety of comics formats while also allowing “for further describing other aspects of comics culture and scholarship, or connecting, community created data to Semantic Web applications, such as next-generation library catalogs.”

Schema.org's ComicSeries type

In 2014, Schema.org added the ComicSeries type and related properties, placed under its Periodical type, which allows for the integration of comics related information into the wider Linked Open Data world. Marvel, one of the largest and best known comics publishers in the world, was part of the groups that pushed for the addition of the CosmicSeries type and while they do not accurately detail the origins of their own schema (used in the API) it appears to use ComicSeries in some way.

Metadata Framework for Manga

Proposed in 2009 by academics at the university of Tsukuba, the Metadata Framework for Manga defines three key aspects of Japanese comics - bibliographic, structural, and intellectual entities - with digital as the primary format. The framework is built on top of the FRBR and TV-Anytime models, extending these to allow various descriptions of manga that are key to their actual usage such as translations and narrative structure.

A Comics Ontology

Proposed by Paul Rissen in the early 2010s, this ontology is another early attempt at creating a way to “describe the form of comics, and their narrative content, rather than the bibliographical information.” It is cited by the eBDtheque creators as an inspiration for their own custom vocabulary used in their project.

Markup

Comic Book Markup Language, Advanced Comic Book Format, CoMet, ComicsML

XML schemas are an obvious tool to help describe, model, and integrate comics - especially digital ones - into the open web, libraries, and archives. The Comic Book Markup Language, Advanced Comic Book Format, CoMet and ComicsML are the most notable, and commonly referenced, efforts. They all began in the early 2010s and take various approaches to extending XML for the purposes of representing document structure, visual sequences, textual features, descriptive information, and metadata.

The difference in approaches also includes how XML is extended: CBML is based on TEI; ComicsML is available as a Document Type Definition; ACBF and CoMet are both XML Schema Definition, with the former extensively detailed while the latter is more narrow and makes use of the Dublin Core Metadata Element Set.

Apps/API

Comics Factory, Marvel API, Comic Vine API, Metron

Digital applications and interfaces offer an ideal way to make existing comics-related data and tools available to the public to both educate and entertain.

Yet in our research we’ve only found one example of a non-commercial, public-funded app that combines cultural heritage and digital humanities: BnF’s BDnF: The Comics Factory, which lets users create digital comics using the library’s collections.

Alongside this there are a small number of APIs that deal with comics data including the aforementioned Marvel API, which grants access to information about the company’s publications; the Comic Vine API, giving access to data from Gamespot’s comic book wiki for non-commercial use; and Metron, a community-driven effort intended as a counterpoint to Comic Vine’s corporate standing which is building an open database with REST API for comics and makes use of bibliographic data from the GCD.

Reviews

One of the starting points for our project was to look at, and assess, three existing digital libraries dedicated to comics. From this assessment we developed the idea of sketching out a framework that could be used to enhance existing comics-focused digital libraries or create new ones.

The three resources we selected were:

Reviews

Digital Comic Museum

The Digital Comic Museum (DCM) is a digital library of comic books in the public domain, established in 2010. The library was originally started in 2006, under the name GoldenAgeComics.co.uk. Dealing strictly with public domain comics from the 1930s to late 1950s, it directly held the books for download, rather than pointing to torrents and proved very popular. Since 2010, the library has experienced a gradual decline and after a server move, the underlying GAC software and database became increasingly unstable. In addition, the DCM has had to deal with spammers and several attacks.

Downloadable public domain golden age comics.
Scans uploaded by users.
Scans viewable via image carousel, downloads reserved for members.
Run by volunteers.
Financed by donors: donations pay server bills, bandwidth and future expansion and give user accounts VIP status (unlimited access to the website).

Metadata: submitted by uploaders or taken from the Grand Comics Database. Inconsistent and can’t be searched directly.
Technologies: powered by Simple Machines Forum (SMF), an open-source, PHP forum/message board solution. Uses MySQL for database management.

Grand Comics Database

The Grand Comics Database™ (GCD) is a nonprofit, internet-based organization of international volunteers dedicated to building a database covering all printed comics worldwide. It aspires to be the world's most comprehensive online comics database for comic readers, collectors, scholars, and professionals. The GCD catalogues information on creators, story details, reprints, and other potentially useful information.

Its history goes back to 1978 when the GCD's immediate predecessor, APA-I (Amateur Press Alliance for Indexing), was formed by a few fans interested in exchanging information on comic books in index form. In late 1993 and early 1994, three members of APA-I interested in comic books started up an e-mail correspondence. Tim Stroup, Bob Klein, and Jonathan E. Ingersoll soon began sharing indexing information in a common format using electronic media for storage and distribution. By March 1994, they had formed a new group to create an electronic version of APA-I related to comic books, giving it the name Grand Comic-Book Database and the goal to 'contain information on every comic book ever published.'

Defines comic book as 50% or more art and/or pictures which tell a story. This definition excludes non-physical formats, such as web or digital comics, but it includes small print run fanzines, promotional giveaway comics, and mini-comics. Although syndicated comic strips are not indexed, listings include mentions of comic books reprinting newspaper strips.
International: GCD has comic books from many countries representing over forty languages.
Run by a team of volunteers, who answer to a chart of guidelines.
Membership is regulated by the Board and gives users responsibilities and advantages inside the community.

User reviews and feedback.
Tutorials for contributors.
Metadata: uses a custom schema based on public and private tables, in the process of being udpated to reflect Django architecture.
Technologies: the GCD is implemented as a Django-based web application, using MySQL as a database; Haystack; Elasticsearch. Underlying technology detailed on github.

Internet Archive

Essentially a category page within the wider Internet Archive, a non-profit initiative that has been building a free digital library of internet sites and cultural artifacts in digital form since 1996. The comic books category page is one of hundreds which is used to organize the information collected by IA and uploaded by its users, and part of the more than 38 million books and texts it holds.

User reviews.
About page detailing creator of category, contributors, and stats for views, items uploaded, regions accessing content, and collection (category) metadata (identifiers, language, media type, filters).
Detailed metadata on OCR / scanning process.
High quality scans available via a JS viewer in browser.
Accessibility functionalities including Text-To-Speech and different view types.
Multiple download options including text / pdf files, OCR, and torrent files.

Collection pages have a set of basic filters for media types, publishers, years, topics etc...
Collection is international, and allows inclusion of sub collections (for specific comics/publishers). However, it also includes non-comics content such as novelizations.
Metadata: based on Dublin Core, extended for specific needs and user control. Included with every item as an .xml file. Metadata powers recommendation system at the bottom of each page.
Technologies: IA primarily uses HTML5 and jQuery for its front-end services, but gives no public details for the library's underlying infrastructure.

Overview

#	DCM	GCD	IA
Pros	Public domain content. User-submitted content. Data used by researchers.	Well-structured metadata. Detailed documentation. User contributions with tutorials. User reviews. International collection. Data sharing.	Detailed scanning and OCR metadata. Metadata-powered recommendations. User-friendly reader with accessibility options. Multiple download options. Search filters. User reviews. International collection.
Cons	Outdated front and back-end infrastructure. Scarse, inconsistent metadata. Small scope due to golden age focus. No OCR.	Outdated, text-heavy front-end. Data only available as MySQL or name/value. No Machine Learning implementations. Catalogue only focus. Limited to physical media.	Includes non-comics material. Limited focus due to being one collection among thousands. Front-end limited to generic IA interface. Limited bibliographic information.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Summary

#	DCM	GCD	IA
Content	Golden age comics.	Physical comics only, international.	Any comics-related printed and digital content, international.
User Contribution	Yes.	Yes.	Yes.
Metadata	Limited, inconsistent.	Custom, well-structured.	Based on DC, well-structured.
Machine Learning	None.	None.	OCR.
Aesthetics	Outdated, text-heavy.	Outdated, text-heavy.	Minimalist, site-wide template.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Users

In recent years there has been a shift in attention in the designing process of digital cultural objects from the technological and technical aspect to the whole experience of the user. For Digital libraries in particular it’s become evident that the quality and importance of content doesn’t matter when the knowledge can’t be delivered to the user in a way that is comprehensible and useful for them.

In order to delineate some possible user types for our framework we looked at existing surveys of comics readers, user studies in digital libraries, and existing documentation in libraries we reviewed.

Users

Surveys

Using the results of five different surveys on comics readers from Italy, France, and North America we can see some generic trends about comics usage:

More popular with men than women.
More popular with younger demographics (25 and under).
Women prefer graphic novels.
Print is still the preferred medium but digital is increasing in readership.
Comics are an important part of public library offerings.
Readers enjoy a variety of types of comics, with manga a particularly important type regardless of the origin of the reader.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Digital libraries

To understand the needs of digital libraries users we need to first understand who they are and why they might want to browse.

A comparative study conducted by Walsh et al. (2016), reviewing existing literature on user characterization in digital archives, museums and libraries, identifies three different way to categorize users: by expertise level, by information needs, and by motivation and role.

Expertise level: four types of users incl. professional, lay, experienced, and novel.
Information needs: three types incl. general, educational, specialist.
Motivation & role: identifies a type of ‘unaffiliated scholar’ that covers journalists, teachers, writers, and other creative types with an interest in Cultural Heritage. Motivations can include using web 2.0 tools, online collaboration, learning, needing librarian/archivist help for research.

A 2017 study on users of Europeana highlighted the following motivations for users conducting searches on the site:

Using information to create a new work (books, exhibitions, presentation, academic studies).
Cultural Heritage interest.
Professional needs (research, fact checking).
Teaching.

A study for the National Museum of Military History in Copenhagen introduces a new type of user, hobbyists, that follows the definition of serious leisure introduced by Robert Stebbins. The study defined two types of hobbyists: collectors (interested in objects) and liberal arts enthusiasts (interested in historical and humanities knowledge).

DCM, GCD, IA

Digital Comic Museum: offers little data on users aside from a 2010 forum survey that found nearly half of users to be aged 40 or over.
Grand Comic Database: a 2009 survey of 416 visitors found the majority reading language to be English and sex to be male (97%). Based on results to questions about reasons for use of the site, GCD highlighted three user types incl. interactive users (contributors to catalogue), research users, and personal users (no reasons given).
Internet Archive: no user survey or meaningful information available, though the website has some general statistical information about user accounts, forum posts, and server usage.

Summary

Expertise level: Based on our research we think that users of comics-focused digital libraries can likely be split into three level of expertise similar to those mentioned by Walsh but with the lay and experienced combined into one type, which also bears similarities to the hobbyist type. These types can be summarized as researcher, contributor, and reader.
Information needs: educational (researcher), specialist (contributor), and general (reader).
Motivation and role: research, contribution, and personal enjoyment (reading).

User Type

We would like to propose the following categories of user for our framework:

Researcher: motivated by the production of academic or scholarly work, with a likely need for high quality metadata and services such as APIs. We believe that identifying a “professional” type would mostly be misleading (should it be a literature professor doing research on comics, a comic book editor, a comic artist?) but a researcher (academic or journalistic) type seems to be an ideal fit for the more engaged users of existing libraries such as GCD and DCM.
Contributor: someone with an interest in comics high enough to want to contribute to the growth of a digital library’s offerings, and capable of learning the implications of using specific metadata schemas. Could be a researcher, but is more likely to be a fan motivated by the activity itself, especially when it comes to scanning material or adding entries to a catalogue.
Reader: motivated by the desire to read first and foremost, but capable of crossing over into the other two types. Primarily passive and not necessarily interested in what happens behind the scenes though likely to have an opinion on how the library’s schemas impact their ability to find things.

In the design of a digital library for comics it would be important to take into consideration both the needs of the reader, as a baseline user, who would benefit from an easy to use interface, and of researchers and contributors, who need to be able to add to or manipulate existing information.

Recommendations

Each library varies in content offerings, quality of metadata, and usability. Their choices of what to focus on (e.g. golden age comics, indexing, public contribution) creates the limitations for the library itself as well as which type of user they're best suited for.

DCM is arguably the weakest of the three libraries we've reviewed though it does have the benefit of offering public domain material, which is clearly of interest to researcher users.
GCD offers the most robust user contribution system, ideal for contributor users.
Internet Archive offers the best user experience for actual consumption of comic books, and thus for reader users.

Based on the framework we've outlined we have made some recommendations for improving each of the resources reviewed.

Recommendations

DCM

Implement use of generic bibliographic metadata standards, such as MARC, and dedicated ones, such as the Comic Book Ontology.
Implement ML-powered services such as network visualisation, statistics about comics (e.g. authors, nationality, gender etc…).
Enable additional user contribution such as bibliographic cataloguing, annotation campaigns to enable OCR.
Make data more easily available via a dedicated service.
Modernise front-end design to move away from antiquated web solutions like message board software.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

GCD

Align existing schema with LOD standards, such as Schema or DC, or with bibliographic standards.
Implement ML-powered services such as network visualisation, statistics about comics (e.g. authors, nationality, gender etc…).
Make data more easily available via an API rather than a dump.
Implement semantic search and recommendation solutions, aided by a knowledge graph.
Modernise front-end design to make content more engaging, could also allow for better back-end interface for contributors.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

IA

Improve textual information in about pages.
Implement dedicated standards for specific medium, e.g. Comic Book Ontology for comics collection. Ultimately this could take the form of spinning off such collections into their own mini-sites which could be tied to existing collaborations IA has with libraries and archives.
Implement ML-powered services such as network visualisation, statistics about comics (e.g. authors, nationality, gender etc…).

Bibliography:

* Clément Guérin, Christophe Rigaud, Karell Bertet, Arnaud Revel. “An ontology-based framework for the automated analysis and interpretation of comic books' images”. Information Sciences, Elsevier, 2017, 378, pp. 109-130. ⟨10.1016/j.ins.2016.10.032⟩. ⟨hal-01387033⟩

* Laubrock, J. and Dunst, A. (2020), “Computational Approaches to Comics Analysis”. Top Cogn Sci, 12: 274-310. https://doi.org/10.1111/tops.12476

* Morozumi, Ayako & Nomura, Satomi & Nagamori, Mitsuharu & Sugimoto, Shigeo. (2009). “Metadata Framework for Manga: A Multi-paradigm Metadata Description Framework for Digital Comics”.

* Nguyen, Nhu-Van, Christophe Rigaud, and Jean-Christophe Burie. 2018. "Digital Comics Image Indexing Based on Deep Learning" Journal of Imaging 4, no. 7: 89. https://doi.org/10.3390/jimaging4070089

* Petiya, Sean. "Building a Semantic Web of Comics: Publishing Linked Data in HTML/RDFa Using a Comic Book Ontology and Metadata Application Profiles ." Master's thesis, Kent State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=kent1416791055

Sitography:

* A Comics Ontology. URL: http://www.r4isstatic.com/231

* Advanced Comic Book Format (ACBF). URL: https://launchpad.net/acbf/

* BDnF: The Comics Facts. URL: https://bdnf.bnf.fr/EN/index.html

* Comic Book Markup Language (CBML). URL: https://dcl.ils.indiana.edu/cbml/

* Comic Book Ontology (CBO). URL: https://comicmeta.org/cbo/

* Comic Metadata (CoMet). URL: http://www.denvog.com/comet/

* Comic Vine API. URL: https://comicvine.gamespot.com/api/

* Comics as Data: North America (CADNA). URL: https://graphicpossibilities.hcommons.org/comics-as-data-north-america-cadna/

* ComicsML (XML for digital comics). URL: http://comicsml.jmac.org

* Digital Comic Museum (DCM). URL: https://digitalcomicmuseum.com

* Grand Comics Database (GCD). URL: https://www.comics.org

* Internet Archive’s Comic Books and Graphic Novels Collections. URL: https://archive.org/details/comics

* Marvel API. URL: https://developer.marvel.com

* Schema.org’s ComicSeries type. URL: https://schema.org/ComicSeries

* Metron. URL: https://metron.cloud

* The Graphic Narrative Corpus (GNC). URL: https://groups.uni-paderborn.de/graphic-literature/gncorpus/corpus.php

* The Visual Language Research Corpus (VLRC). URL: https://www.visuallanguagelab.com/vlrc

Surveys:

* Chi è il lettore di fumetti in Italia? (Who is the comic reader in Italy?), 2021, AIE (Italian Editors Association), Italy

* Public Library Graphic Novels Survey, 2018, Library Journal, North America

* BD reader in France, 2020, IPSOS, France

* Manga Planet survey, 2019, online

* Charbonneau, Olivier (2005) Adult Graphic Novel Readers: A Survey in a Montreal Library. Young Adult Library Services, 3 (4). pp. 39-42. ISSN 1541-4302

Image Sources:

* [photograph of comic covers]. (2021). Brett Jordan. Unsplash. https://unsplash.com/photos/CsZQ50xO35I

* [photograph of comic covers]. (2020). Dev. Unsplash. https://unsplash.com/photos/d2Py_uhXJQo

* [photograph of an open comic book]. (2018). Miika Laaksonen. Unsplash. https://unsplash.com/photos/nUL9aPgGvgM

* [photograph of comic covers]. (2021). Erik Mclean. Unsplash. https://unsplash.com/photos/27kCu7bXGEI

* [photograph of comic covers]. (2021). Jonathan Cooper. Unsplash. https://unsplash.com/photos/OOKnZA6nhqA

* [photograph of a comic page]. (2020). Brett Jordan. Unsplash. https://unsplash.com/photos/CsZQ50xO35I

Meet the team

This project was created and presented as part of the requirements for the Semantic Digital Libraries course within the Digital Humanities & Digital Knowledge Master's at the University of Bologna, a/y 2021-2022.

Laurent Fintoni

French but exiled abroad for so long he now writes in English.

Camilla Neri

Graduating in Digital Humanities, Webtoon connoisseur.

Laura Travaglini

master of none trying to get a master degree in digital humanities