Online Portfolio – Page 5 – Xiao’s personal blog on digital humanities and interdisciplinary research

What I Can Do with Crowdsourcing

Crowdsourcing is to invite the general public to help complete a project such as transcription of papers or correction of OCR texts. War Department Papers 1784-1800 and Trove of the national library of Australia are two of very successful crowdsourcing projects. I am a user as well as a contributor of these two projects.

War Department Papers 1784-1800 is transcription project. It asks the public help to transcribe the papers of American War Department to make the content easier to read, search and circulate. As a user, I find this crowdsourcing project is beneficial to both history professionals and non-professionals. Once made easier to read, the papers can attract wider public interest and save people’s time reading obscure handwriting. It’s more likely to make people enjoying reading original sources of history. As a contributor, the website of War Department Papers is very user-friendly. There is no difficulty or confusion when you try to help and transcribe the manuscripts. You can tick “mark as complete” when it is fully transcribed. Or you can view revision history to decide what you can do to the manuscript. To adapt to different contributors, the website gives the option to make the two boxes of original text and your transcription side by side or topdown. These small details of user-friendliness make the transcribing very pleasant experience.

The Newspaper Collection in the Trove of the national library company is a success among correction projects. It invites people to correct the OCR texts generated from the original newspapers. It sparked great public interest because the data shows the numbers of both users and the texts they corrected soared. The interface is easy to use. And it offers more options than War Department papers such as comments, tags, download and categorization. A user can tailor the newspaper articles to their need. It is also very easy to edit the OCR text. You can either click the pencil sign at the end of each line or click the editing text button on the top of the whole text. The original newspaper article sits side by side with OCR text and is highlighted in its own box. In the OCR text box, each line is relatively short so it is very easy to spot the errors in the sentence. Combining my interest in the articles, the correction process is fun to me.

The crowdsourcing projects build the communities which brings the people who have the same interest together. It makes history reach wider public by engaging the public in the project and such engagement changes the stereotype that history is a far, dry and professional-only domain. By inviting the public to the project and seeing their great contributions, the librarians and archivists have better understanding of the archives as well as gain lighter workload by outsourcing what could have been done by themselves. So, crowdsourcing is a fantastic way to energize public interest and participation in history projects, build communities for people who share the same interest, and better the public service of libraries and archives.

To attract more people to participate in crowdsourcing, I have a few pieces of thought. First, the interface should be easy to use. Like War Department papers, a contributor can just start their work by clicking the link. Second, the website should make it easy for other users to contribute if one piece of transcription is done. In War Department Papers, once I complete the transcribing, I have the option to mark it as “complete”. So when another user reads the list of manuscripts, he/she can look for another piece. This saves his or her time to browse and look for the untranscribed. Third, I think the crowdsourcing websites should increase a “share” button so that users can share their work on social media. This can promote the projects and attract more people to participate. Fourth, like Wikipedia, the websites should make a place where people can communicate with each other instead of reading revision history and working alone. This can not only improve the quality of crowdsourcing but also help build the communities where people of similar interest can get together. Users can also use this place to talk with librarians and archivists. In short, the easier the website is to use, the more likely it can attract wider public.

How I Read a Wiki article

When I come to the question how the site was created, I ask myself what the purpose of this site, who created this website, when, and how it developed into what I’m seeing now, and if I want to use the content of website, what are copyright issues. Wikipedia is a good place to find answers to my questions when I try to deconstruct the website: wikipedia.org.

Take the page “digital humanities” as an example. To find copyright issues and the basic information such as mission of the website, you can scroll down the page to the bottom. There are hyperlinks to: “Creative Commons Attribution-sharelike license”, “term of use” and “privacy policy””about wikipedia””developers””statistics” etc.

To find who created this webpage, go to “view history” on the top right of the page. Then you will see the history of this page such as who created this page, when, and who contributed to this page. The record goes from the creation of the page all the way to the most recent editing of the page. You can click the name of the contributor to see his/her information, but some users created their user pages and some don’t. If you want to communicate with this user, you can click “talk” besides the username and leave a message. You can also see his/her conversation records with other people who have left messages there. To see this user’s contribution, just click “contribs” beside the user name. To see the group discussion about the page rather than discuss with an individual user on his user page, click “talk” on the top left of the “digital humanities” page. There you can see people’s debate, suggestions, questions about the page. To view the history of content editing of the page, you can select any two entries and compare, or after getting into one entry, click “previous version”, “newer version” and “latest version” to see the difference.

The neutrality and accuracy of the content are two controversies of such crowdsourcing websites because the credentials of these contributors are unknown and human-beings have bias. To check their user pages, sometimes you can find some information of these users but sometimes you don’t. For example, in viewing the editing history of the page, I tried to get into the user pages of main contributors, here is what I find: Catsandthings holds graduate degrees in the humanities and is interested in Wikipedia and education. Sophia Chang is blocked by wikipedia. Elijahmeeks is the Digital Humanities Specialist at Stanford University and, once upon a time, he used to study Wikipedia and open-source culture. Simon Mahony is a Classicist by training and Director of the UCL Centre for Digital Humanities, at the Department of Information Studies, University College London. ARK is Rudolf Ammann (@rkammann), a researcher at the UCL Centre for Digital Humanities. Others just don’t have a user page. So, by reading these mini biographies and the missing of other biographies, you may question the authority of the content. It makes me think that you can use wiki as a starting point of your research, but your research can’t stop here because Wikipedia does not guarantee it is 100% accurate or neutral. Two other good examples for this alert are from Roy’s “Can History Be Open Source?”:

In the 25 biographies I read closely, I found clear-cut factual errors in only 4. Most were small and inconsequential. Frederick Law Olmsted is said to have managed the Mariposa mining estate after the Civil War, rather than in 1863. And some errors simply repeat widely held but inaccurate beliefs, such as that Haym Salomon personally loaned hundreds of thousands of dollars to the American government during the Revolution and was never repaid. (In fact, the money merely passed through his bank accounts.) Both Encarta and the Encyclopedia Britannica offer up the same myth.³¹ The 10,000-word essay on Franklin Roosevelt was the only one with multiple errors. Again, some are small or widely accepted, such as the false claim (made by Roosevelt supporters during the 1932 election) that fdr wrote the Haitian constitution or that Roosevelt money was crucial to his first election to public office in 1910. But two are more significant—the suggestion that a switch by Al Smith’s (rather than John Nance Garner’s) delegates gave Roosevelt the 1932 nomination and the statement that the Supreme Court overruled the National Industrial Recovery Act (nira) in 1937, rather than 1935.

Wikipedia tries to solve the problems of accuracy and bias. For example, it requires users to register before creating articles. It allows users to view the history of the page. It evolved intricate rules by which participants could be temporarily or even permanently banned from Wikipedia for inappropriate behavior. It also set up an elaborate structure of “administrators,” “bureaucrats,” “stewards,” “developers,” and elected trustees to oversee the project and it has arbitration committee (Roy, Can History Be Open Source?). But sometimes it may cause overkill. “The website can be as ugly and bitter as 4chan and as mind-numbingly bureaucratic as a Kafka story.”(Auberbach, Encyclopedia Frown).

So, when I read a wiki article, I take it as a starting point of research rather than the end. It provides valuable information but also needs your own efforts to check the accuracy. There are ways to do it such as look at the sources, editing history, and use other tools such as books and search engines. Overall, wikipedia promotes the flow of information and provides new points of knowledge, and inspires more research. Readers just need to keep the accuracy and bias issues in mind when they read a wiki article.

Compare Tools

Voyant, Kepler.gl and Palladio are useful tools of digital humanities. They help the researchers to make sense of a large collection of information through “macroview”. By processing a large pool of data, the softwares may show the researchers what they did not find or realize before. These softwares bring new possibilities. However, they are have different highlight in terms of data processing.

Voyant highlights text analysis. It’s a tool to mainly analyze texts. It makes the reading easier by showing a word frequency list (word cloud), word distribution plot (graphs), summary of key words and the context where the key words are. The interface composes of these panels. Voyant can be applied to analyze texts in a wide range of contexts such as literature, historical narrative and language teaching. For example, to use Voyant to analyze the WPA slave narrative collection. By the tool, a user can see the key words of narratives of each state, or the places of each state. A user can also grasp the similarities and differences of the states by analyzing the corpus. Voyant makes the analysis of such a large collection much easier, faster and more informative. The text-mining tool is a good source to learn about texts in new ways.

Kepler.gl highlights the geospatial side of texts. It is a visualization tool to process geospatial data. For example, the location data of the WPA slave narrative collection. Through Kepler.gl, you can see the map view of location information, where slaves were interviewed, where they were enslaved, etc. By clicking a dot on the map, you can the information of this location: the interviewee, the interviewer, where this place locates etc and the relationship between sets of information. To use the tool, a user need to select your dataset and the tool automatically captures the variables and data points and lays it out on a pretty looking map visualization. You can also add filters, apply scales and perform visual aggregations “on the fly”. Kepler.gl is powerful to analyze location-specific data and show the results on neat looking data-driven maps.

Palladio is a tool that highlights visualizing relationships/networks, with a map view. It stresses more over this relationship/network than the map itself. So, it’s not a mapping program but a “network” program. With Palladio, you can analyze the data by uploading data and visualize within the browser without any barriers. In the Map view, you can see any coordinates data as points on a map. Relationships between distinct points can be connected by lines, with the arc of the line representing the flow of the relationship. In the Graph view, you can visualize the relationships between any two dimensions of your data. Graph information will be displayed as nodes connected by lines. Nodes can be scaled to reflect their relative magnitude within your data. The display of links and labels can be toggled on and off. But you can’t get the information of a location (a point on the map) by just clicking the point. Palladio stressed more over relationship than the specific information of a place on the map.

These three tools highlight different sides of the text analysis and visualize the data. You can add a layer of lines to make a map in Kepler.gl which is the same as that made by Palladio. These three tools complement each other well because they different highlights of data processing and visualize different sizes of the same data. For example, you can use the information found by Voyant in Palladio. You find the mostly used word in the corpus and find its relationship to each state by using Palladio. Or you can work more with location-specific data on mapping the key words through Kepler.gl.

In short, Voyant, Kepler.gl and Palladio open a new window for the study and research of humanities. By processing data and visualizing them, they either make the text reading easier or help researchers make new discoveries of the texts.

Reflection on Palladio

I find Palladio is very useful to to visualize networks or relationship with a map view. It is interactive so you can choose the two things between which you want to find relationship.

For example, you want to see interviewers and their topics. You choose these two items, the source is interviewer and target is topics. Then Palladio will map out a network that will show the relation between interviewers and the topics. You can also choose to highlight one or both items to make it easier to see the relationship. You can click “size nodes” to adjust the size of nodes. In short, Palladio is very interactive and user-friendly tool to show network or relationship. The downside of the tool is despite its map view, it doesn’t show geography as shown by a standard map or you can’t get information about the nodes (places, people etc) by hovering your mouse click. It strength is it can show network and relationship very well.

Reflection on Kepler

kepler.gl is a mapping tool that can visualize the location-based data on maps of various forms. There are simple points, arc, heatmap and many other types of marks that can reveal what is hidden in a collection of data. Each type has its strength and weakness.

Simple points (point map) can better show the locations with more details showing up by clicking the dots but it can’t show a trend or density while cluster (base map) can better show intensity of subjects which may also reveal a “trend”. Network map, as it name shows, can better show network, or the movement of people or goods. I am most impressed with network map, of arc or of line, because it can clearly show two types of data and their connection by overlapping some parts of them.

In short, Kepler.lg is a very useful mapping tool to visualize data. And it can analyze multiple layers of data at the same time, making it a very powerful tool to visualize the multiple facets of the collection as well as the possible connection among these facets.

Thoughts on Voyant

This software can help researchers analyze a large collection in form of data and graphs and grasp the overall situation the corpus as well as each specific documents. Take the slave memoirs program as an example, it can show how frequently a word shows up in the corpus, in each state and in each specific document, where it shows up etc. This frequency is displayed in the form word cloud, graphs, summary and contexts. So this tool is really useful to find some big data that can’t not be found by reading each document closely or by summarizing with human brain.

For example, to look at the word “slaves”. As the interviews were all slaves, it should be a word that frequently shows up in each state. However, by examining the word in the word cloud of each state, you will find this assumption is not true. Clicking “scale” in the “Circus”, then click “document” to pick a state, then hover over the word in the new cloud. You will find that in some states such as Arkansas, slave doesn’t show at all while in Maryland and Florida, it is a big word. So this difference discovered from the clouds and graphs tell us that the slaves in different states may have different key memory over their experience. If you only look at the corpus cloud, you can’t tell such differences.

In short, Voyant is a good tool for the researchers to analyze the texts and visualize them in forms of clouds, graphs, summaries and contexts. It helps researchers make new discoveries in a large collection through bird’s eyes.

review of metadata

I mainly used the metadata of Library of Congress for the images I use. Generally the metadata describes title, contributor, year of publishing, genre, size, location, copy of right and citing.

It does not include the features such as the actual size of the objects, the texture, weight, smell and a 360 degree image of objects.

The metadata allows you ask questions about the item itself, for example, a picture or a video. The questions include, where is this picture from? Who took it? What or who’s in it? Can I use it? To what extent I can use it? If I want to cite this source, what exact format and content should be?

But the metadata doesn’t allow you to ask questions about the images or people inside the pictures. For example, for the bottled milk image I use, how big is such a bottle? Is it the same milk as we drink today, or milk 70 years ago tasted different? How accurately does this picture show the image? The light in the picture (telling from the shadow on the wall) is sunshine or bulb light? What is the purpose of organizing bottles in this way? Anyway, you can’t find much information about the objects in the pictures themselves in metadata. The link of the bottled image is here: https://www.loc.gov/item/2017813072/

database review: Artstor

Artstor is a non-profit organization that builds and distributes the Digital Library, an online resource of more than 2 million images in the arts, architecture, humanities, and sciences, and Shared Shelf, a Web-based cataloging and image management software service that allows institutions to catalog, edit, store, and share local collections.

Artstor provides faculty and students with a complete image resource in a wide array of subjects with the breadth and depth to add context and examine influences beyond the confines of your discipline.

With approximately 300 collections composed of over 2.5 million images (and growing), scholars can examine wide-ranging material such as Native American art from the Smithsonian, treasures from the Louvre, and panoramic, 360-degree views of the Hagia Sophia in a single, easy-to-use resource.

Artstor also supports study across disciplines, including anthropology from Harvard’s Peabody Museum, archaeology from Erich Lessing Culture and Fine Art Archives, and modern history from Magnum Photos, making it a resource for your whole institution.

The Artstor Digital Library provides straightforward access to curated images from reliable sources that have been rights-cleared for use in education and research — you are free to use them in classroom instruction and handouts, presentations, student assignments, and other noncommercial educational and scholarly activities.

And unlike results from Google or other search engines, the images come with high-quality metadata from the collection catalogers, curators, institutions, and artists themselves.

The digital Artstor library is here : https://library.artstor.org/#/, you can also get Artstor through https://www.artstor.org/get-artstor/

ARTstor users must create a free account to save and share images. Users must turn off the popup blocker.

There are a few features of Artstor: 1. you can quickly download groups of images into PowerPoint presentations with citation data in the notes field–each image links directly back to Artstor, allowing you to zoom in on details. 2. IIIF image viewer allows you to view images full screen and to compare up to 10 items at once while zooming in on details. 3. Curate groups of images for lectures and papers, access and download them from anywhere, and share them in many formats–even on your course management system.4. With the click of a button, students can turn image groups into flashcards using quiz mode or generate automatic image citations in APA, MLA, an Chicago styles.

Thoughts On Digitization

After learning about digitization, I realized that there are a few downfalls of digitization. I have more perspectives to view digitization now.

First, I find digitization can not capture all what you see and feel on print. There are many reasons such as pixels and just intrinsic disadvantages of some technologies. It is easier to catch the image and color but as for other aspects, digitization has space to improve. For example, you can’t feel the texture, you don’t know the actual size or even you don’t know the actual colors by just using cameras or scanners. There is “margin of error”.

Second, I think pictures make most sense of digitaization for the present. With current technologies, pixels and colors are easier problems to solve compared with other matter such as texture and size. You can use high resolution cameras to make images clearer and colors are easy to be adjusted to match what you see with naked eyes. Videos can partially solve some problems pictures have but videos need larger space to be preserved and only some people have ability to make good videos.

Third, working with digitized representations definitely impacts our understanding different kinds of items but the extent depends on what your items are and for what purpose you use them. For example, as a historian, digitized texts and images are so helpful because it saves a lot of time and trips on archive work. But this helpfulness is based on the fact that my research only needs facts instead of nuances. Historians of art definitely have higher standard and requirement for the quality of these images, like the pictures in “Building Meaning in Digitized Photographs” shows. The nuances in these images impacts our understanding of the items we see. This is only one example among many of the differences that digitized things and the things on print have on our understanding of the items. So, it is really a matter about what you expect to gain from digitized things.

There are other “margins of error” in digitized things that impact our ability to use it. For example, the technological limitation might make us to miss the some best part of the item by its “understatement” shown in color, texture and size. Or the selection of librarians and archivists for digitized things limits our access to some other work. We can’t see these works from afar because they are not digitized. These factors, among others, can impact our ability to use digital humanities.

Library of Congress

Library of Congress: https://www.loc.gov/

Rights Statement: https://www.loc.gov/legal/

what’s on the website: https://www.loc.gov/discover/