Skip to main content

June’s New Questions and Answers from the Chicago Manual of Style

June’s New Questions and Answers from the Chicago Manual of Style

Read this and older Q&A sections of the Chicago Manual of Style at www.chicagomanualofstyle.org/qanda

Did you know Adept’s go-to style guide updates its Q&A section every month with new questions? Were you just wondering whether to list the website as Twitter or X in your citation of a tweet? Then this is your lucky month! You can find the answer below! Here is the latest batch of CMOS’ questions and answers.

Thanks for reading Capturing Voices! Subscribe for free to receive new posts and support my work.

Q. Would it be “the Color Purple musical” or “The Color Purple musical”?

A. The musical version of The Color Purple would be referred to as “the Color Purple musical”—where “the” is part of the surrounding text (and the The in the title has been omitted). A “the” belonging to the text could also be used before a title that doesn’t include an initial The. For example, a musical version of Star Trek might be referred to as “the Star Trek musical.”

Or consider other scenarios where a title that does include an initial The is used attributively (i.e., modifies another noun—like musical in the examples above). If you were to retain the The in the following example (where the title modifies character), the result would be clearly awkward:

Which Great Gatsby character do you dislike most?

not

Which The Great Gatsby character do you dislike most?

There’s no “the” at all in the first version of that example—which would also be true if you were to refer to “a Color Purple musical,” where the indefinite article “a” displaces the definite article The in the title. In general, when the title of a work is used attributively, be prepared to omit an initial The in favor of the surrounding text. See also CMOS 8.169.

Q. My question is regarding CMOS 2.12 on paragraph format—specifically, the directive to “let the word processor determine the breaks at the ends of lines.” This rule is for manuscripts, but I would like to know if it applies to websites. Are there exceptions?

A. Whether your document is a manuscript in Microsoft Word or an article published online as reflowable text, it’s usually best to let lines break where they will. But there are some exceptions in both contexts.

If you use Chicago-style spaced ellipses in your manuscript . . . like that, you’ll want to put a nonbreaking space before and after the middle dot. It isn’t mandatory at the manuscript stage—ellipses are usually formatted by whoever prepares a text for publication—but broken ellipses look bad.

In the published version of a document—as in an e-book or other reflowable format—there are some additional places where nonbreaking spaces may be added for publication. Some are optional:

  • Between initials in names like E. B. White

  • Between a parenthetical enumerator—e.g., (1) and (2) or (a) and (b)—and the word that follows

  • Between a numeral and an abbreviated unit of measure (e.g., 1 kg)

Others, like the nonbreaking spaces in spaced ellipses, would be required:

  • Between groups of digits in SI-style numerals like 33 333,33 (for 33,333.33), as described in CMOS 9.55

  • Between consecutive single and double quotation marks separated by a space, as described in CMOS 6.11 and in a related post at Shop Talk

For some additional considerations, start with CMOS 6.121 and 7.36.

Q. Would you spell out 150,000?

A. Use numerals for 150,000. The applicable principles are as follows:

  • Spell out numbers one through one hundred (Chicago’s general rule).

  • Spell out multiples of one through one hundred used in combination with hundred, thousand, or hundred thousand.

So you would spell out “five thousand” and “one hundred thousand” but use digits for 150,000—because 150 would normally be rendered as a numeral.

But if you’re following Chicago’s alternative rule of using digits for 10 and up, all such larger numbers are usually given as numerals. Rather than, for example, “fifteen thousand” or “15 thousand,” you’d write 15,000.

For more details, see CMOS 9.2, 9.3, and 9.4. For numbers with million, billion, and so forth, see CMOS 9.8.

Q. I am editing a nineteenth-century American diary, and I often want to omit passages that span a paragraph break. If I use, say, the first sentence of the first paragraph, then the second sentence of the second paragraph, how should it look? Using two ellipses looks weird to me. Or maybe I don’t need to indicate the new paragraph at all?

A. If you’re running the quotation in with the surrounding text instead of presenting it as a block quotation, there’s no need to signal the paragraph break; simply use ellipses for the omitted part as recommended in CMOS 13.50, 13.53, and 13.54. But if you’re using a block quotation (as for one hundred words or more), then show the paragraph break as follows:

Let’s pretend that the words in this extract (which is another term for block quotation) have been reproduced from the beginning of the first paragraph of a quoted source. This is the first paragraph continued, but our quotation is interrupted after this sentence—a break that’s signaled after a sentence-ending period by the three spaced dots of a Chicago-style ellipsis, like this. . . .
 . . . This is the second sentence from next paragraph of the quoted source. Note how the ellipsis at the beginning of this paragraph (the second ellipsis in this quotation) is preceded by a paragraph indent.

If the second paragraph in the block quotation above had started with the beginning of the quoted paragraph in the original, then the second ellipsis would have been omitted; see CMOS 13.56. But be careful. If the intended meaning of the original text wouldn’t be clear even to readers who haven’t consulted that same source, make adjustments until it is.

Q. I am seeing everywhere now that people are putting acronyms in parentheses instead of words, as in “Food and Drug Administration (FDA)” versus “FDA (Food and Drug Administration).” Can you explain to me why this is becoming more common? Parentheses have always been intended for additional information or words of further explanation, which is the opposite of an acronym. It just seems so backwards to me, and if you’re searching for what the acronym stands for, it’s hard to find because the acronym is in the parentheses and used from then on. Please help me understand the logic people are following with this style.

A. It makes sense to put the abbreviation first when the abbreviation is the better-known term—as is arguably the case for the FDA. But there’s no rule against putting the abbreviation in parentheses. In fact, when you introduce an abbreviation primarily as a space-saving device, the convention is to put the abbreviation in parentheses the first time it appears. For example,

According to the Abbreviation Appreciation Society (AAS) . . .

which is shorthand for

According to the Abbreviation Appreciation Society (which we’ll hereinafter refer to as AAS for the sake of convenience) . . .

And though it’s true that you lose a bit of clarity through abbreviation, there are a couple of strategies that can help readers. First, consider reintroducing the spelled-out term alongside the abbreviation in each new chapter or other major division in which it appears. And if your text features many otherwise unfamiliar abbreviations, consider adding a list as described in CMOS 1.44.

Q. I am citing a specific tweet according to the guidelines in CMOS 14.209. But if the tweet was published before July 2023, should I list the website as Twitter or X? Thanks!

A. Whether it’s a book from the 1970s or a post on social media, sources are generally cited as published. For books, that means recording the publisher’s name as listed on the title page, even if that name has changed or no longer exists. But when you cite an older tweet, the URL in the citation will direct readers to that same tweet (if it hasn’t been deleted) but on what is now called X (and whether the domain is twitter.com or x.com).

To make the situation clear even for readers who may not be aware of the change, add “now X” in parentheses after “Twitter” in your source citation: “. . . Twitter (now X) . . .” A post published after the name change would be cited as having been published on X (no need to add “formerly Twitter”).

Q. After years of using Chicago citation form, I have begun to wonder: What about all the folks who get left out of the citations, who go unrecognized for their work? For example, in a magazine article accompanied by striking and thoughtful illustrations or graphs or pictures, shouldn’t those workers get credit as well as the people who wrote the text? Often it’s those images that stay with us; often they are the only part of an article that people even take in. I guess I can freestyle my citations, but I wondered what your policy on this is. Thanks.

A. Though it’s nice when a footnote gives credit explicitly to one or more creators, the primary purpose of a source citation is to identify—concisely and unambiguously—the source of a quotation or other idea that is not your own. The responsibility for crediting the contributors to such a source lies with the source itself (as on the title page of a book or at the head of an article—or in a credit line that accompanies an illustration).

As you suggest, you can always name additional contributors if you want to. But unless the work of a particular illustrator or other contributor is essential to your reason for having consulted and cited the source—in which case the best place to give credit may be in the text rather than in a source citation—it’s usually best to stick to the basic citation format. Unnamed contributors, including anyone obscured behind et al. (“and others”), will simply have to take comfort in the fact that a source they’ve contributed to has been cited (and, one would hope, consulted).

(The forthcoming 18th edition of CMOS will include an example of how to credit an illustrator in addition to an author in a source citation.)

Thanks for reading Capturing Voices! Subscribe for free to receive new posts and support my work.

From the Washington Post Opinion

As the 80th anniversary of D-Day approaches, people of all ages are honoring the Americans who fought in World War II. But a few remaining citizens are remembering fighting in it themselves.

One is former Air Force gunner Mel Jenner, now 102. Follow this link to read the story. We transcribe these stories every day, and it’s amazing how little they have forgotten. This next paragraph from The Washington Post Opinion illustrates what we hear every day. They remember the names, they remember the faces, they remember their laugh, where they were from, where they were heading, the shock of realizing they weren’t coming back. I hope you can follow the link and read the article.

Jenner, as photojournalist David Burnett recounts, is not thinking about the war in the abstract. He is thinking about his best friend, Oscar McClure, then a young gunner as well, and watching as his friend waved his last goodbye from a neighboring plane. Far away from the speeches this week, that’s what D-Day still means to those who were there.

Mel Jenner, a veteran of the U.S. Army Air Corps and the Air Force, at his home in Orlando in March. (David Burnett/Contact Press Images)

Opinion

The B-17 blew apart in an instant. The memory has burned for 80 years.

For waist gunner Mel Jenner, a friend’s farewell in the skies over occupied France has echoed since 1944.

By David Burnett

June 3, 2024 at 6:00 a.m. EDT

When Online Content Disappears

When Online Content Disappears

We research words and content for our transcripts. Words that we aren’t sure of, we look up to verify spelling and validity. We got this article from https://www.pewresearch.org/?p=167501. If you’re intrigued, sign up for the Pew Research Newsletter.

Our object is to capture the spoken word, and people make vague references, sometimes to places that no longer exist. Some of the oral histories we transcribe reference events that happened a long time ago—World War II, for example. So we spend a lot of time researching. AI doesn’t verify terms, but we do—we try to verify everything we can.

Thanks for reading Capturing Voices! Subscribe for free to receive new posts and support my work.

How they did the study:

How we did this

Pew Research Center conducted the analysis to examine how often online content that once existed becomes inaccessible. One part of the study looks at a representative sample of webpages that existed over the past decade to see how many are still accessible today. For this analysis, we collected a sample of pages from the Common Crawl web repository for each year from 2013 to 2023. We then tried to access those pages to see how many still exist.

A second part of the study looks at the links on existing webpages to see how many of those links are still functional. We did this by collecting a large sample of pages from government websites, news websites and the online encyclopedia Wikipedia.

We identified relevant news domains using data from the audience metrics company comScore and relevant government domains (at multiple levels of government) using data from get.gov, the official administrator for the .gov domain. We collected the news and government pages via Common Crawl and the Wikipedia pages from an archive maintained by the Wikimedia Foundation. For each collection, we identified the links on those pages and followed them to their destination to see what share of those links point to sites that are no longer accessible.

A third part of the study looks at how often individual posts on social media sites are deleted or otherwise removed from public view. We did this by collecting a large sample of public tweets on the social media platform X (then known as Twitter) in real time using the Twitter Streaming API. We then tracked the status of those tweets for a period of three months using the Twitter Search API to monitor how many were still publicly available. Refer to the report methodology for more details.

The internet is an unimaginably vast repository of modern life, with hundreds of billions of indexed webpages. But even as users across the world rely on the web to access books, images, news articles and other resources, this content sometimes disappears from view.

A new Pew Research Center analysis shows just how fleeting online content actually is:

A quarter of all webpages that existed at one point between 2013 and 2023 are no longer accessible, as of October 2023. In most cases, this is because an individual page was deleted or removed on an otherwise functional website.

For older content, this trend is even starker. Some 38% of webpages that existed in 2013 are not available today, compared with 8% of pages that existed in 2023.

This “digital decay” occurs in many different online spaces. We examined the links that appear on government and news websites, as well as in the “References” section of Wikipedia pages as of spring 2023. This analysis found that:

23% of news webpages contain at least one broken link, as do 21% of webpages from government sites. News sites with a high level of site traffic and those with less are about equally likely to contain broken links. Local-level government webpages (those belonging to city governments) are especially likely to have broken links.

54% of Wikipedia pages contain at least one link in their “References” section that points to a page that no longer exists.

To see how digital decay plays out on social media, we also collected a real-time sample of tweets during spring 2023 on the social media platform X (then known as Twitter) and followed them for three months. We found that:

Nearly one-in-five tweets are no longer publicly visible on the site just months after being posted. In 60% of these cases, the account that originally posted the tweet was made private, suspended or deleted entirely. In the other 40%, the account holder deleted the individual tweet, but the account itself still existed.

Certain types of tweets tend to go away more often than others. More than 40% of tweets written in Turkish or Arabic are no longer visible on the site within three months of being posted. And tweets from accounts with the default profile settings are especially likely to disappear from public view.

How this report defines inaccessible links and webpages

There are many ways of defining whether something on the internet that used to exist is now inaccessible to people trying to reach it today. For instance, “inaccessible” could mean that:

The page no longer exists on its host server, or the host server itself no longer exists. Someone visiting this type of page would typically receive a variation on the “404 Not Found” server error instead of the content they were looking for.

The page address exists but its content has been changed – sometimes dramatically – from what it was originally.

The page exists but certain users – such as those with blindness or other visual impairments – might find it difficult or impossible to read.

For this report, we focused on the first of these: pages that no longer exist. The other definitions of accessibility are beyond the scope of this research.

Our approach is a straightforward way of measuring whether something online is accessible or not. But even so, there is some ambiguity.

First, there are dozens of status codes indicating a problem that a user might encounter when they try to access a page. Not all of them definitively indicate whether the page is permanently defunct or just temporarily unavailable. Second, for security reasons, many sites actively try to prevent the sort of automated data collection that we used to test our full list of links.

For these reasons, we used the most conservative estimate possible for deciding whether a site was actually accessible or not. We counted pages as inaccessible only if they returned one of nine error codes that definitively indicate that the page and/or its host server no longer exist or have become nonfunctional – regardless of how they are being accessed, and by whom. The full list of error codes that we included in our definition are in the methodology.

Here are some of the findings from our analysis of digital decay in various online spaces.

Webpages from the last decade

To conduct this part of our analysis, we collected a random sample of just under 1 million webpages from the archives of Common Crawl, an internet archive service that periodically collects snapshots of the internet as it exists at different points in time. We sampled pages collected by Common Crawl each year from 2013 through 2023 (approximately 90,000 pages per year) and checked to see if those pages still exist today.

We found that 25% of all the pages we collected from 2013 through 2023 were no longer accessible as of October 2023. This figure is the sum of two different types of broken pages: 16% of pages are individually inaccessible but come from an otherwise functional root-level domain; the other 9% are inaccessible because their entire root domain is no longer functional.

Not surprisingly, the older snapshots in our collection had the largest share of inaccessible links. Of the pages collected from the 2013 snapshot, 38% were no longer accessible in 2023. But even for pages collected in the 2021 snapshot, about one-in-five were no longer accessible just two years later.

Links on government websites

We sampled around 500,000 pages from government websites using the Common Crawl March/April 2023 snapshot of the internet, including a mix of different levels of government (federal, state, local and others). We found every link on each page and followed a random selection of those links to their destination to see if the pages they refer to still exist.

The vast majority go to secure HTTP pages (and have a URL starting with “https://”).

6% go to a static file, like a PDF document.

16% now redirect to a different URL than the one they originally pointed to.

When we followed these links, we found that 6% point to pages that are no longer accessible. Similar shares of internal and external links are no longer functional.

Links on news websites

For this analysis, we sampled 500,000 pages from 2,063 websites classified as “News/Information” by the audience metrics firm comScore. The pages were collected from the Common Crawl March/April 2023 snapshot of the internet.

Reference links on Wikipedia

For this analysis, we collected a random sample of 50,000 English-language Wikipedia pages and examined the links in their “References” section. The vast majority of these pages (82%) contain at least one reference link – that is, one that directs the reader to a webpage other than Wikipedia itself.

Posts on Twitter

For this analysis, we collected nearly 5 million tweets posted from March 8 to April 27, 2023, on the social media platform X, which at the time was known as Twitter. We did this using Twitter’s Streaming API, collecting 3,000 public tweets every 30 minutes in real time. This provided us with a representative sample of all tweets posted on the platform during that period. We monitored those tweets until June 15, 2023, and checked each day to see if they were still available on the site or not.

Which tweets tend to disappear?

Tweets were especially likely to be deleted or removed over the course of our collection period if they were:

Written in certain languages. Nearly half of all the Turkish-language tweets we collected – and a slightly smaller share of those written in Arabic – were no longer available at the end of the tracking period.

Posted by accounts using the site’s default profile settings. More than half of tweets from accounts using the default profile image were no longer available at the end of the tracking period, as were more than a third from accounts with a default bio field. Tweets from these accounts tend to disappear because the entire account has been deleted or made private, as opposed to the individual tweet being deleted.

Posted by unverified accounts.

We also found that removed or deleted tweets tended to come from newer accounts with relatively few followers and modest activityon the site. On average, tweets that were no longer visible on the site were posted by accounts around eight months younger than those whose tweets stayed on the site.

1% of tweets are removed within one hour

3% within a day

10% within a week

15% within a month

Put another way: Half of tweets that are eventually removed from the platform are unavailable within the first six days of being posted. And 90% of these tweets are unavailable within 46 days.

When Online Content Disappears

When Online Content Disappears

We research words and content for our transcripts. Words that we aren’t sure of, we look up to verify spelling and validity. We got this article from https://www.pewresearch.org/?p=167501. If you’re intrigued, sign up for the Pew Research Newsletter.

Our object is to capture the spoken word, and people make vague references, sometimes to places that no longer exist. Some of the oral histories we transcribe reference events that happened a long time ago—World War II, for example. So we spend a lot of time researching. AI doesn’t verify terms, but we do—we try to verify everything we can.

Thanks for reading Capturing Voices! Subscribe for free to receive new posts and support my work.

How they did the study:

How we did this

Pew Research Center conducted the analysis to examine how often online content that once existed becomes inaccessible. One part of the study looks at a representative sample of webpages that existed over the past decade to see how many are still accessible today. For this analysis, we collected a sample of pages from the Common Crawl web repository for each year from 2013 to 2023. We then tried to access those pages to see how many still exist.

A second part of the study looks at the links on existing webpages to see how many of those links are still functional. We did this by collecting a large sample of pages from government websites, news websites and the online encyclopedia Wikipedia.

We identified relevant news domains using data from the audience metrics company comScore and relevant government domains (at multiple levels of government) using data from get.gov, the official administrator for the .gov domain. We collected the news and government pages via Common Crawl and the Wikipedia pages from an archive maintained by the Wikimedia Foundation. For each collection, we identified the links on those pages and followed them to their destination to see what share of those links point to sites that are no longer accessible.

A third part of the study looks at how often individual posts on social media sites are deleted or otherwise removed from public view. We did this by collecting a large sample of public tweets on the social media platform X (then known as Twitter) in real time using the Twitter Streaming API. We then tracked the status of those tweets for a period of three months using the Twitter Search API to monitor how many were still publicly available. Refer to the report methodology for more details.

The internet is an unimaginably vast repository of modern life, with hundreds of billions of indexed webpages. But even as users across the world rely on the web to access books, images, news articles and other resources, this content sometimes disappears from view.

A new Pew Research Center analysis shows just how fleeting online content actually is:

A quarter of all webpages that existed at one point between 2013 and 2023 are no longer accessible, as of October 2023. In most cases, this is because an individual page was deleted or removed on an otherwise functional website.

For older content, this trend is even starker. Some 38% of webpages that existed in 2013 are not available today, compared with 8% of pages that existed in 2023.

This “digital decay” occurs in many different online spaces. We examined the links that appear on government and news websites, as well as in the “References” section of Wikipedia pages as of spring 2023. This analysis found that:

23% of news webpages contain at least one broken link, as do 21% of webpages from government sites. News sites with a high level of site traffic and those with less are about equally likely to contain broken links. Local-level government webpages (those belonging to city governments) are especially likely to have broken links.

54% of Wikipedia pages contain at least one link in their “References” section that points to a page that no longer exists.

To see how digital decay plays out on social media, we also collected a real-time sample of tweets during spring 2023 on the social media platform X (then known as Twitter) and followed them for three months. We found that:

Nearly one-in-five tweets are no longer publicly visible on the site just months after being posted. In 60% of these cases, the account that originally posted the tweet was made private, suspended or deleted entirely. In the other 40%, the account holder deleted the individual tweet, but the account itself still existed.

Certain types of tweets tend to go away more often than others. More than 40% of tweets written in Turkish or Arabic are no longer visible on the site within three months of being posted. And tweets from accounts with the default profile settings are especially likely to disappear from public view.

How this report defines inaccessible links and webpages

There are many ways of defining whether something on the internet that used to exist is now inaccessible to people trying to reach it today. For instance, “inaccessible” could mean that:

The page no longer exists on its host server, or the host server itself no longer exists. Someone visiting this type of page would typically receive a variation on the “404 Not Found” server error instead of the content they were looking for.

The page address exists but its content has been changed – sometimes dramatically – from what it was originally.

The page exists but certain users – such as those with blindness or other visual impairments – might find it difficult or impossible to read.

For this report, we focused on the first of these: pages that no longer exist. The other definitions of accessibility are beyond the scope of this research.

Our approach is a straightforward way of measuring whether something online is accessible or not. But even so, there is some ambiguity.

First, there are dozens of status codes indicating a problem that a user might encounter when they try to access a page. Not all of them definitively indicate whether the page is permanently defunct or just temporarily unavailable. Second, for security reasons, many sites actively try to prevent the sort of automated data collection that we used to test our full list of links.

For these reasons, we used the most conservative estimate possible for deciding whether a site was actually accessible or not. We counted pages as inaccessible only if they returned one of nine error codes that definitively indicate that the page and/or its host server no longer exist or have become nonfunctional – regardless of how they are being accessed, and by whom. The full list of error codes that we included in our definition are in the methodology.

Here are some of the findings from our analysis of digital decay in various online spaces.

Webpages from the last decade

To conduct this part of our analysis, we collected a random sample of just under 1 million webpages from the archives of Common Crawl, an internet archive service that periodically collects snapshots of the internet as it exists at different points in time. We sampled pages collected by Common Crawl each year from 2013 through 2023 (approximately 90,000 pages per year) and checked to see if those pages still exist today.

We found that 25% of all the pages we collected from 2013 through 2023 were no longer accessible as of October 2023. This figure is the sum of two different types of broken pages: 16% of pages are individually inaccessible but come from an otherwise functional root-level domain; the other 9% are inaccessible because their entire root domain is no longer functional.

Not surprisingly, the older snapshots in our collection had the largest share of inaccessible links. Of the pages collected from the 2013 snapshot, 38% were no longer accessible in 2023. But even for pages collected in the 2021 snapshot, about one-in-five were no longer accessible just two years later.

Links on government websites

We sampled around 500,000 pages from government websites using the Common Crawl March/April 2023 snapshot of the internet, including a mix of different levels of government (federal, state, local and others). We found every link on each page and followed a random selection of those links to their destination to see if the pages they refer to still exist.

The vast majority go to secure HTTP pages (and have a URL starting with “https://”).

6% go to a static file, like a PDF document.

16% now redirect to a different URL than the one they originally pointed to.

When we followed these links, we found that 6% point to pages that are no longer accessible. Similar shares of internal and external links are no longer functional.

Links on news websites

For this analysis, we sampled 500,000 pages from 2,063 websites classified as “News/Information” by the audience metrics firm comScore. The pages were collected from the Common Crawl March/April 2023 snapshot of the internet.

Reference links on Wikipedia

For this analysis, we collected a random sample of 50,000 English-language Wikipedia pages and examined the links in their “References” section. The vast majority of these pages (82%) contain at least one reference link – that is, one that directs the reader to a webpage other than Wikipedia itself.

Posts on Twitter

For this analysis, we collected nearly 5 million tweets posted from March 8 to April 27, 2023, on the social media platform X, which at the time was known as Twitter. We did this using Twitter’s Streaming API, collecting 3,000 public tweets every 30 minutes in real time. This provided us with a representative sample of all tweets posted on the platform during that period. We monitored those tweets until June 15, 2023, and checked each day to see if they were still available on the site or not.

Which tweets tend to disappear?

Tweets were especially likely to be deleted or removed over the course of our collection period if they were:

Written in certain languages. Nearly half of all the Turkish-language tweets we collected – and a slightly smaller share of those written in Arabic – were no longer available at the end of the tracking period.

Posted by accounts using the site’s default profile settings. More than half of tweets from accounts using the default profile image were no longer available at the end of the tracking period, as were more than a third from accounts with a default bio field. Tweets from these accounts tend to disappear because the entire account has been deleted or made private, as opposed to the individual tweet being deleted.

Posted by unverified accounts.

We also found that removed or deleted tweets tended to come from newer accounts with relatively few followers and modest activityon the site. On average, tweets that were no longer visible on the site were posted by accounts around eight months younger than those whose tweets stayed on the site.

1% of tweets are removed within one hour

3% within a day

10% within a week

15% within a month

Put another way: Half of tweets that are eventually removed from the platform are unavailable within the first six days of being posted. And 90% of these tweets are unavailable within 46 days.

When Online Content Disappears

38% of webpages that existed in 2013 are no longer accessible a decade later BY ATHENA CHAPEKIS , SAMUEL BESTVATER , EMMA REMY AND GONZALO RIVERO

We research words and content for our transcripts. Words that we aren’t sure of, we look up to verify spelling and validity. We got this article from https://www.pewresearch.org/?p=167501. If you’re intrigued, sign up for the Pew Research Newsletter.

Our object is to capture the spoken word, and people make vague references, sometimes to places that no longer exist. Some of the oral histories we transcribe reference events that happened a long time ago—World War II, for example. So we spend a lot of time researching. AI doesn’t verify terms, but we do—we try to verify everything we can.

Thanks for reading Capturing Voices! Subscribe for free to receive new posts and support my work.

How they did the study:

How we did this

Pew Research Center conducted the analysis to examine how often online content that once existed becomes inaccessible. One part of the study looks at a representative sample of webpages that existed over the past decade to see how many are still accessible today. For this analysis, we collected a sample of pages from the Common Crawl web repository for each year from 2013 to 2023. We then tried to access those pages to see how many still exist.

A second part of the study looks at the links on existing webpages to see how many of those links are still functional. We did this by collecting a large sample of pages from government websites, news websites and the online encyclopedia Wikipedia.

We identified relevant news domains using data from the audience metrics company comScore and relevant government domains (at multiple levels of government) using data from get.gov, the official administrator for the .gov domain. We collected the news and government pages via Common Crawl and the Wikipedia pages from an archive maintained by the Wikimedia Foundation. For each collection, we identified the links on those pages and followed them to their destination to see what share of those links point to sites that are no longer accessible.

A third part of the study looks at how often individual posts on social media sites are deleted or otherwise removed from public view. We did this by collecting a large sample of public tweets on the social media platform X (then known as Twitter) in real time using the Twitter Streaming API. We then tracked the status of those tweets for a period of three months using the Twitter Search API to monitor how many were still publicly available. Refer to the report methodology for more details.

The internet is an unimaginably vast repository of modern life, with hundreds of billions of indexed webpages. But even as users across the world rely on the web to access books, images, news articles and other resources, this content sometimes disappears from view.

A new Pew Research Center analysis shows just how fleeting online content actually is:

  • A quarter of all webpages that existed at one point between 2013 and 2023 are no longer accessible, as of October 2023. In most cases, this is because an individual page was deleted or removed on an otherwise functional website.

  • For older content, this trend is even starker. Some 38% of webpages that existed in 2013 are not available today, compared with 8% of pages that existed in 2023.

This “digital decay” occurs in many different online spaces. We examined the links that appear on government and news websites, as well as in the “References” section of Wikipedia pages as of spring 2023. This analysis found that:

  • 23% of news webpages contain at least one broken link, as do 21% of webpages from government sites. News sites with a high level of site traffic and those with less are about equally likely to contain broken links. Local-level government webpages (those belonging to city governments) are especially likely to have broken links.

  • 54% of Wikipedia pages contain at least one link in their “References” section that points to a page that no longer exists.

To see how digital decay plays out on social media, we also collected a real-time sample of tweets during spring 2023 on the social media platform X (then known as Twitter) and followed them for three months. We found that:

  • Nearly one-in-five tweets are no longer publicly visible on the site just months after being posted. In 60% of these cases, the account that originally posted the tweet was made private, suspended or deleted entirely. In the other 40%, the account holder deleted the individual tweet, but the account itself still existed.

  • Certain types of tweets tend to go away more often than others. More than 40% of tweets written in Turkish or Arabic are no longer visible on the site within three months of being posted. And tweets from accounts with the default profile settings are especially likely to disappear from public view.

How this report defines inaccessible links and webpages

There are many ways of defining whether something on the internet that used to exist is now inaccessible to people trying to reach it today. For instance, “inaccessible” could mean that:

  • The page no longer exists on its host server, or the host server itself no longer exists. Someone visiting this type of page would typically receive a variation on the “404 Not Found” server error instead of the content they were looking for.

  • The page address exists but its content has been changed – sometimes dramatically – from what it was originally.

  • The page exists but certain users – such as those with blindness or other visual impairments – might find it difficult or impossible to read.

For this report, we focused on the first of these: pages that no longer exist. The other definitions of accessibility are beyond the scope of this research.

Our approach is a straightforward way of measuring whether something online is accessible or not. But even so, there is some ambiguity.

First, there are dozens of status codes indicating a problem that a user might encounter when they try to access a page. Not all of them definitively indicate whether the page is permanently defunct or just temporarily unavailable. Second, for security reasons, many sites actively try to prevent the sort of automated data collection that we used to test our full list of links.

For these reasons, we used the most conservative estimate possible for deciding whether a site was actually accessible or not. We counted pages as inaccessible only if they returned one of nine error codes that definitively indicate that the page and/or its host server no longer exist or have become nonfunctional – regardless of how they are being accessed, and by whom. The full list of error codes that we included in our definition are in the methodology.

Here are some of the findings from our analysis of digital decay in various online spaces.

Webpages from the last decade

To conduct this part of our analysis, we collected a random sample of just under 1 million webpages from the archives of Common Crawl, an internet archive service that periodically collects snapshots of the internet as it exists at different points in time. We sampled pages collected by Common Crawl each year from 2013 through 2023 (approximately 90,000 pages per year) and checked to see if those pages still exist today.

We found that 25% of all the pages we collected from 2013 through 2023 were no longer accessible as of October 2023. This figure is the sum of two different types of broken pages: 16% of pages are individually inaccessible but come from an otherwise functional root-level domain; the other 9% are inaccessible because their entire root domain is no longer functional.

Not surprisingly, the older snapshots in our collection had the largest share of inaccessible links. Of the pages collected from the 2013 snapshot, 38% were no longer accessible in 2023. But even for pages collected in the 2021 snapshot, about one-in-five were no longer accessible just two years later.

Links on government websites

We sampled around 500,000 pages from government websites using the Common Crawl March/April 2023 snapshot of the internet, including a mix of different levels of government (federal, state, local and others). We found every link on each page and followed a random selection of those links to their destination to see if the pages they refer to still exist.

  • The vast majority go to secure HTTP pages (and have a URL starting with “https://”).

  • 6% go to a static file, like a PDF document.

  • 16% now redirect to a different URL than the one they originally pointed to.

When we followed these links, we found that 6% point to pages that are no longer accessible. Similar shares of internal and external links are no longer functional.

Links on news websites

For this analysis, we sampled 500,000 pages from 2,063 websites classified as “News/Information” by the audience metrics firm comScore. The pages were collected from the Common Crawl March/April 2023 snapshot of the internet.

Reference links on Wikipedia

For this analysis, we collected a random sample of 50,000 English-language Wikipedia pages and examined the links in their “References” section. The vast majority of these pages (82%) contain at least one reference link – that is, one that directs the reader to a webpage other than Wikipedia itself.

Posts on Twitter

For this analysis, we collected nearly 5 million tweets posted from March 8 to April 27, 2023, on the social media platform X, which at the time was known as Twitter. We did this using Twitter’s Streaming API, collecting 3,000 public tweets every 30 minutes in real time. This provided us with a representative sample of all tweets posted on the platform during that period. We monitored those tweets until June 15, 2023, and checked each day to see if they were still available on the site or not.

Which tweets tend to disappear?

Tweets were especially likely to be deleted or removed over the course of our collection period if they were:

  • Written in certain languages. Nearly half of all the Turkish-language tweets we collected – and a slightly smaller share of those written in Arabic – were no longer available at the end of the tracking period.

  • Posted by accounts using the site’s default profile settings. More than half of tweets from accounts using the default profile image were no longer available at the end of the tracking period, as were more than a third from accounts with a default bio field. Tweets from these accounts tend to disappear because the entire account has been deleted or made private, as opposed to the individual tweet being deleted.

  • Posted by unverified accounts.

We also found that removed or deleted tweets tended to come from newer accounts with relatively few followers and modest activityon the site. On average, tweets that were no longer visible on the site were posted by accounts around eight months younger than those whose tweets stayed on the site.

  • 1% of tweets are removed within one hour

  • 3% within a day

  • 10% within a week

  • 15% within a month

Put another way: Half of tweets that are eventually removed from the platform are unavailable within the first six days of being posted. And 90% of these tweets are unavailable within 46 days.