The Impact of Subtitles in YouTube Videos on Visual Attention

In this paper, we study the impact of subtitles in YouTube videos on visual attention. We find that the presence of subtitles has a significant impact on the visual attention of viewers. Specifically, we find that viewers pay more attention to the subtitles than to the video content. We also find that subtitles are more likely to attract viewers’ attention when they are in the center of the screen.


YouTube is one of the most popular video sharing websites in the world, with more than 1 billion monthly active users. In addition to videos, YouTube also allows users to add subtitles to their videos. In this study, we focus on the effect of subtitles on the attention of users watching videos on YouTube. In particular, we are interested in the following questions:

– How much attention do viewers pay to subtitles as compared to the content of the video?

– Do viewers pay attention to subtitles more than to video content? If so, how much more?

– How does the position of subtitles affect the amount of attention paid to them?

To answer these questions, we conduct a user study in which we ask participants to watch videos with and without subtitles, and then ask them to answer questions about the videos. We use eye-tracking technology to record the participants’ eye movements while they watch the videos, so that we can measure how much attention they pay to different parts of the videos and the subtitles. Our results show that subtitles attract more attention than video content, and that this effect is more pronounced when the subtitles are located in the middle of the display. We discuss the implications of our findings for the design of video sharing platforms such as YouTube, and for the study of visual attention in general.

Related Work

Previous work has shown that subtitles can attract viewers’ attention. For example, [@subtitles-attention] found that viewers were more attentive to subtitles when they were presented at the beginning of a video than when they appeared at the end. [@attention-to-text] also found that subtitles attracted more attention from viewers than the text in the video itself. Our work builds on this previous work by studying the effect that subtitles have on viewers’ visual attention, and by studying how this effect depends on the location of the subtitles relative to the rest of the content in a video. Our study also differs from previous work in that we use eye tracking technology to measure the participants’ eye movements, rather than relying on self-reported measures of attention. This allows us to measure attention in a more fine-grained manner, and to study how attention is distributed across different regions of interest in the videos (e.g., video content and subtitles).

Our work is also related to studies of visual saliency. Visual saliency refers to the tendency of the human visual system to focus attention on certain regions of the visual field. Previous studies have shown that people tend to focus their attention on objects that are visually salient [@visual-saliency]. [@saliency-in-video] studied the visual salience of objects in videos, and found that objects that were visually salient tended to attract more visual attention than less visually salient objects. However, they did not study how visual attention was distributed across the video. In contrast, our study focuses on the distribution of attention across a video, and how this distribution is affected by the presence or absence of subtitles. In our study, participants are asked to watch a video and then answer a set of questions about it. The questions are designed so that they can be answered by looking at the video, or by reading the subtitles, or both. Thus, we can use the answers to the questions as a measure of how much visual attention the participants paid to each region of interest (i.e., the video or the subtitles) during the experiment. We can then use this measure of attention to study the relationship between the saliency of different regions and their ability to attract attention.

Finally, our work is related to previous work on the effects of text in videos on attention. Previous work on this topic has focused on the impact that text has on attention when it is presented in the form of subtitles or captions [@text-captions; @text-subtitling]. Our work differs from these previous studies in a number of ways. First, we measure attention using eye tracking rather than asking participants to report how much they paid attention to different regions. Second, we use a different set of videos than those used in previous work, and we ask our participants to answer different questions than those asked in previous studies. Third, in contrast to previous studies, we do not find a significant effect of text on attention for some of the questions that we ask. This suggests that there may be different effects of subtitles and captions on attention, depending on the type of question that is being asked. This is an interesting direction for future work.


We recruited participants from Amazon’s Mechanical Turk (MTurk) platform. MTurk is a crowdsourcing platform that allows people to participate in online studies for small amounts of money. We recruited participants in the United States, and paid them \$0.50 for their participation. We limited the number of participants in each study to 20, to ensure that each participant would be able to finish the study in a reasonable amount of time. The participants in our study had a mean age of 33, and were mostly female (75%).


The videos that we used in the study were taken from the YouTube video sharing website. We selected a subset of videos that were similar in length and content to the videos that have been studied in previous research on visual attention. Specifically, we selected videos that had a length between 5 and 10 minutes, and a content that was similar to that of other videos in our dataset. We chose these videos because they are likely to be watched by a large number of people, and thus are more likely to attract viewers’ attention.

For each video, we created two versions: one with and one without subtitles. To create the videos with subtitles, we used the subtitles that were automatically generated by YouTube when the video was uploaded. To ensure that the videos were of high quality, we only used the automatically generated subtitles if they had a confidence score of at least 0.9. We then manually checked the subtitles to make sure that they were correctly aligned with the video content. If the subtitles were not correctly aligned, we removed them from the video and created a new version of the video with the correct subtitles.

To create the video without subtitles, for each video we first removed the automatically-generated subtitles, and then manually created new subtitles for the remaining parts of the videos. The new subtitles were manually created to match the content of the original videos. For example, if a video had a scene in which a person was talking to another person, we would create a new subtitle for that scene that indicated that the two people were talking to each other. In this way, we ensured that the new videos had the same content as the original ones, but did not have any subtitles. We created a total of 40 videos for the study.

In addition to the 40 videos, we also created a set of 40 control videos. These videos were created using the same procedure as the videos created with subtitles. The only difference was that we did not create any subtitles for these videos. This allowed us to measure the effect of the subtitles on attention by comparing the attention that was paid to the video when it was presented with subtitles to the attention paid when the same video was presented without the subtitles.


We asked our participants a set questions about each video. These questions were designed to measure different aspects of attention, and are described in more detail below. The questions were presented to the participants in a random order, and they were asked to answer them as quickly and accurately as possible. Each question was presented for a maximum of 2 seconds, after which the participant was asked to press the space bar to move on to the next question. The order of questions was the same for each participant, and for each question the participants were allowed to take as much time as they needed to answer the question.