Abstract
The opioid epidemic persists in the U.S., with over 80,000 deaths annually since 2021, primarily driven by synthetic opioids. Responding to this evolving epidemic requires reliable and timely information. One source of data is social media platforms. We assessed the utility of Reddit data for surveillance, covering heroin, prescription, and synthetic drugs. We built a natural language processing pipeline to identify opioid-related content and created a cohort of 1,689,039 Reddit users, each assigned to a state based on their previous Reddit activity. We measured their opioid-related posts over time and compared rates against CDC overdose and NFLIS report rates. To simulate the real-world prediction of synthetic opioid overdose rates, we added near real-time Reddit data to a model relying on CDC mortality data with a typical 6-month reporting lag. Reddit data significantly improved the prediction accuracy of overdose rates. This work suggests that social media can help monitor drug epidemics.
Similar content being viewed by others
Introduction
The United States is currently experiencing an opioid epidemic. The number of opioid-related deaths has been rising almost every year since 1999, increasing significantly in 2021 and 20221. Much of this increase has been driven by synthetic opioids, such as illicitly manufactured fentanyl and its analogs. Fentanyl is 50 times stronger than heroin and 100 times stronger than morphine1. Overdose rates from synthetic opioids have increased by over 22% from 2020 to 2022, with the 2021 annual rate of 73,838 deaths per US population being 22 times greater than the 2013 rate1. While fentanyl overdose rates have decreased slightly in 2023, these rates remain very high relative to past years1,2. To facilitate a more effective response to this epidemic, public health agencies must be able to quickly identify, monitor, and address both established and emerging patterns of non-medical use3.
To that end, the U.S. spends billions of dollars every year to conduct public health surveillance, and this surveillance relies on the integration of data from many different sources. For instance, the National Institute on Drug Abuse (NIDA) has several existing systems for tracking rates of non-medical drug use at the population level. NIDA integrates information on drug use from community epidemiologists and compiles annual surveys of thousands of students from hundreds of schools4. The National Vital Statistics program at the CDC collects data on deaths from regional entities within the United States, identifying causes of mortality, including drug overdoses5. The National Forensic Laboratory Information System (NFLIS) monitors national drug patterns through reports from state and local forensic labs6.
These systems are accepted by research and policy communities; they continue to grow in scope and have established benchmarks, methods, and resources for analysis. Despite these benefits, these systems have limitations in monitoring ongoing and developing drug epidemics. Specifically: (1) they are not real-time, (2) some survey outcomes are subject to interpretation and rely on an expert’s perception of another person’s health (i.e., epidemiologists, doctors, social workers), (3) they are sparse in terms of geographic, demographic, and/or disease coverage, (4) are constrained in their ability to discover novel and unexpected trends, and (5) they rely on data collection methods that can be biased7. For example, a recent study found that the National Survey on Drug Use and Health (NSDUH) general population survey of 70,000 participants was inadequate for monitoring heroin use and describes many of the above limitations8.
The U.S. Food and Drug Administration (FDA) and others have identified social media as a potentially useful source of real-world evidence for drug monitoring9,10. However, public health efforts have been slow to adopt social media surveillance due to challenges currently posed by the data. Social media datasets can be very large, and most of their content is not relevant to public health, thus, it requires additional expertise and resources to mine and analyze the data11. Additionally, evidence of the utility of social media data is still relatively limited compared to evidence for traditional methods of public health surveillance.
Social media could offer unique benefits to epidemic monitoring, such as access to an observational data source that is potentially uncorrelated with the data streams currently used in public health surveillance. Moreover, the pseudo-anonymous nature of these platforms can make some users more inclined to discuss stigmatized behaviors and their health and health issues12,13. The data collected from social media are mostly first-person reports, thus not potentially biased by expert interpretation. Users’ posts often include additional details about their lives or environments, providing contextual information often lacking from other reporting systems. Finally, social media data is both retroactively and rapidly available, potentially providing a robust longitudinal data set to identify trends.
The majority of research on drug and disease surveillance in social media has focused on Twitter (recently renamed X), although some research has used Reddit, Facebook, Instagram, or smaller online discussion forums14. A review article by Sarker et al. found over 1,000 articles since 2012 related to social media and drug use monitoring15. Several studies used Twitter data for geolocation, as Twitter post metadata can be used to infer a county-level location. Katsuki et al. found an association between prescription non-medical drug misuse and illicit online pharmacies on Twitter16. Chary et al. used a set of keywords to scan Twitter for opioid mentions and filtered the resulting mentions based on semantic distance to a set of manually labeled tweets. They found a strong correlation between normalized Twitter frequency values and opioid use statistics from the NSDUH17. A recent study by Sarker et al. followed 9006 Twitter posts from Pennsylvania and demonstrated that machine learning could derive drug use tweet rates that were correlated with county-level opioid overdose death rates18. Twitter data can be constrained by the character limit of posts and the (previously) increased oversight and decreased anonymity relative to Reddit14. Thus, some studies source Reddit data to study opioid use discussions online19. Pandrekar et al. used 51,537 opioid-related posts, performed topic modeling, and found the psychological categories of the opioid posts20. Chancellor et al. used Reddit posts about opioid recovery and discovered potential alternative treatments21. Garg et al. determined individual risk levels of fentanyl use using a machine-learning model trained on Reddit data22. Harrigian et al. attempted the first geolocation of Reddit posts using text-based heuristic schema with 45% accuracy on US data23. These prior studies have not established the usefulness of Reddit data to improve the accuracy of CDC-based forecasting of opioid overdose death rates at different geographic scales.
We aimed to benchmark the value of Reddit as an input to a time series prediction system. Reddit, a pseudo-anonymous platform, was founded in 2005 and has been in the top ten most-visited websites in the United States for over ten years24. User activity on the Reddit platform occurs in topic-oriented subreddits, denoted with a leading ‘r/’, where users can congregate to discuss a particular topic of interest. Community moderators and users work to keep discussions focused on particular topics. For example, ‘r/addiction’ is a subreddit where community members discuss and support each other in overcoming their addictions. In 2022, there were over 57 million daily active users on Reddit who contributed over 422 million posts and 2.86 billion comments25. Unlike other social media platforms like Twitter and Facebook, the majority of the historical activity on Reddit was archived and is publicly available26. Overall, 11% of adults in the U.S. report using Reddit27. The user base is skewed towards males, younger individuals, White individuals, those of Hispanic descent, and those that have at least some college education27. Although these demographics differ from those of the United States overall and would be limiting in the study of drugs used primarily in older populations, they may be sufficient for monitoring the rate of drug use in represented groups.
In this study, we create a cohort of 1,689,039 users and follow them over a period of 10+ years, measuring drug-related discussion over time. We create an approximate geolocation algorithm to map users to their likely locations. We examine the overall discussion rate of opioids on the Reddit platform, including that of individual drugs and drugs grouped by opioid class. We demonstrate that our social media-derived synthetic opioid trends are comparable to both CDC overdose death rates and NFLIS drug report rates across national and regional analyses. We also examine the utility of Reddit data in the predictive modeling of overdose death rates using autoregressive models. We focus on the relatively recent rise in discussions of synthetic opioids, including fentanyl, a powerful and deadly synthetic opioid that has risen in use in the United States. Finally, we examine recent changes during the COVID-19 pandemic. Our work demonstrates the utility of monitoring social media for conducting timely surveillance of the opioid epidemic.
Results
Summary statistics
Our final comment dataset included 6,344,026 opioid mentions from 6,065,600 comments. Heroin (and synonyms from the RedMed drug term lexicon) is the most frequently mentioned opioid on Reddit, followed by morphine, fentanyl, and oxycodone (Table 1, Fig. 1)28. We categorized mentions by opioid class. Opioids are either naturally occurring or synthesized and act as either agonists or antagonists (Fig. 1). Opioids that are agonists have the highest prevalence of non-medical use7. Fentanyl was the most frequently mentioned agonist, and Naloxone was the most frequently mentioned antagonist. Partial agonists and mixed agonists were mentioned less.
a The discussion about opioids on Reddit over time for the top 10 most discussed opioids. Fentanyl has become the most frequently mentioned non-heroin opioid in recent months. b Comment counts for opioids categorized by mu-opioid receptor activity. Mentions of full agonist opioids, those with the highest pain relief and non-medical use potential, have been on the rise since 2010 at a rate similar to heroin mentions and fully surpassing heroin mentions by mid-2017. This change in rates is not identifiable from analyzing the individual opioids.
Cohort statistics
We mapped 1,689,039 unique users to 46 states by exploiting the fact that some Reddit users have also posted in location-specific subreddits (e.g., r/Philadelphia). Within this cohort, 258,591 users mentioned an opioid on Reddit at some point in their comment history. Kendall’s Tau correlation between a state’s estimated 2020 population and our number of observed users was 0.595 (Supplementary Fig. 1). We followed this group from 2009 to December 2022, during which time they posted about 1.7 billion comments on Reddit.
Benchmark: CDC vital statistics overdose data
We compared 12-month trailing (that is, aggregation across a 12-month rolling window) U.S. opioid comment rates on Reddit against 12-month trailing opioid overdose death rates for different geographies (U.S., region) and drug categories (Fig. 2). We explored lead/lag differences between the two time series using cross-correlations for different lag offsets (Fig. 2). We found similar trends between Reddit comment rates and overdose death rates, especially for synthetic opioids (with a simple time series cross-correlation of r = 0.89 with no lag). To guard against spuriously elevated correlations of time series data, we detrended the time series (first-order differencing), resulting in a cross-correlation of r = 0.59, which was still the highest with no lag. Consistency of distributions was confirmed through the augmented Dickey-Fuller unit root test (p < 0.001) for synthetic opioids in both CDC and Reddit time series, meaning that correlations between the time series remained after detrending.
a–c 12-month trailing monthly U.S. opioid overdose CDC death rates per 10,000 people (top panel of subplots, in blue), and monthly opioid comment rates on Reddit per 10,000 total comments (bottom panel of subplots, in orange) for different drug categories, along with d–f: the results of 6-month lead/lag cross-correlation analysis. The cross-correlation analysis indicates that for synthetic opioids, a lead/lag of zero yields the best correlation, indicating that while the magnitude of changes may differ, the shape of the Reddit comment rate over time vs. the benchmark over time is relatively synchronous. The same trend is observed for heroin. Natural and Semi-synthetic CDC trends lag behind Reddit trends by about 5 months. The COVID-19 pandemic is shaded in red from January 2020 to December 2021 and is excluded from the cross-correlation analysis.
We also examined heroin and natural/semi-synthetic opioids and found similar trends for all years except 2020 and 2021. In general, Reddit posts seemed to be less accurate proxies for these substance classes than they were for synthetic opioids. Heroin had a correlation value of r = 0.71 with no lag, while Natural and Semi-synthetic opioids had a correlation of r = 0.62 when CDC rates trailed Reddit rates by five months (Fig. 2). After de-trending (first-order differencing), the correlation values became r = 0.60 for Heroin and r = 0.25 for natural and semi-synthetic drugs (stationarity was confirmed through augmented Dickey-Fuller tests in all cases, p’s < 0.05). While r values lowered after de-trending, the remaining time series were still positively correlated.
Benchmark: NFLIS data
We compared social media activity associated with opioid terms with National Forensic Laboratory Information System (NFLIS) drug report rates over time, which are semi-annual published reports that provide the estimated number of total drug reports submitted to laboratories with region-level geographic resolution (Fig. 3). We compared semi-annual Reddit opioid activity with semi-annual NFLIS drug report rates for 2014–2022 for fentanyl, heroin, and hydrocodone as other key drugs for the opioid epidemic. We compared rates over time at the national level and across the Midwest, West, South, and Northeast regions. For fentanyl, we observed Reddit trends to match NFLIS trends across all regions. All trend lines show divergence between Reddit and NFLIS during the COVID-19 pandemic in 2020 and 2021, which appeared to resolve by 2022.
Fentanyl (a), heroin (b), and hydrocodone (c) semi-annual NFLIS opioid report rates per 100,000 people (orange line, left axes) and semi-annual Reddit opioid comment rates per 100,000 total comments (blue line, right axes) from 2014 to 2022 for, plotted at the national (left) and regional levels. We observe similar trends between Reddit and CDC rates for all drugs except during the COVID-19 pandemic, shaded in red from January 2020 to December 2021.
Reddit and NFLIS time series for fentanyl correlated at r = 0.91; after detrending through first-order differencing, this reduced to r = 0.41, suggesting similarity of the time series even after the general trends were removed (stationarity confirmed through an augmented Dickey-Fuller test, p < 0.05). For heroin and hydrocodone, Reddit and NFLIS time series correlated at r = 0.63 and r = 0.77, respectively, but could not be detrended (stationarity could not be achieved using first or second-order differencing transformations).
Time Series Predictive Modeling
In real-world applications, CDC mortality estimates are available with at least six months of delay and thus are only available for the prediction of future mortality with such a lag. Social media data, however, is available in near real-time, which may yield increased predictive accuracy when added to the lagged CDC data. Thus, we assessed the utility of the Reddit data at the national level in predicting CDC overdose death rates using an autoregressive integrated moving average model (ARIMA) from 2015 through 2022. We used rolling origin forecasts to estimate the absolute error at 1-month prediction horizons of overdose death rates. We compared the performance of 6-month-lagged CDC data alone and in combination with 1-month-lagged Reddit data in forecasting monthly overdose death rates per 10,000 people (Fig. 4). We found that adding near real-time Reddit data as an exogenous variable significantly improved the model’s prediction accuracy at the national level for synthetic opioids (Wilcox signed rank test of absolute errors, p = 0.019) (Fig. 4). We also report the comparative distribution of absolute errors for both models (Fig. 4). The combination Reddit/CDC model had more instances of lower absolute errors than the CDC model alone. The average absolute error for CDC alone was 0.0287 and 0.0246 in normalized monthly opioid overdose deaths (per 10,000 people) for the combination model. In both models, we see notably decreased performance during the COVID-19 years from 2020 until 2022. To see ARIMA time series modeling for heroin and natural/semi-synthetic opioids, see Supplementary Figs. 2 and 3, respectively.
a Methods summary. As an example, to obtain the mortality rate estimate for January 2023, we combined 6-month-lagged CDC data (blue trendline) with 1-month-lagged Reddit data (orange trendline). Grey bars show the observed monthly overdose death rates per 10,000 people as reported by the CDC once available; solid lines represent data incorporated into the prediction, but dashed line data is not included. In this example, the predicted overdose death rate for January 2023 is generated by CDC data from July 2022 alone (left) or in combination with Reddit data from December 2022 (right). b Predicted monthly overdose death rates per 10,000 people of rolling-origin forecast models are shown based on 1-month prediction horizons. Observed mortality is shown in grey, and monthly overdose death rates per 10,000 people as predicted by ARIMA models fitted on 6-month-lagged CDC overdose death (blue) are shown along with predictions from a model that additionally included 1-month-lagged Reddit data (orange). c The absolute errors over time of the monthly overdose deaths per 10,000 people predicted by the lagged CDC model alone (in blue) and the combined CDC/Reddit model (in orange); d shows the corresponding distributions.
Discussion
Since 2013, synthetic opioids like fentanyl have emerged as a major contributor to the opioid epidemic and the total number of opioid deaths1. The COVID-19 pandemic made the opioid epidemic worse, especially among those with pre-existing social, economic, and health disadvantages and potentially due to unemployment and isolation as a result of COVID-19 policies29,30,31. The ubiquity of social media platforms can provide a lens into the drug use of millions of Americans. The continued growth of opioid-related discussions on Reddit demonstrates that social discourse around drug use is expanding online (Fig. 1). In this paper, we benchmark a method for monitoring the online social footprint of drug epidemics spatiotemporally using Reddit. We find that Reddit can be a useful, alternative data stream as the number of opioid mentions (across all linguistic contexts) correlates robustly with other opioid statistics while also providing spatial granularity, long-term continuity, and additional ethnographic information that contextualizes drug mentions.
We found that Reddit comment rates have risen in sync with overdose (CDC) and laboratory (NFLIS) rates at both the national and regional levels and show strong correlations with CDC overdoses even after de-trending. The combined near real-time Reddit synthetic opioid ARIMA model had significantly better accuracy in predicting overdose death rates at the national level compared to the lagged CDC-only model, suggesting Reddit data provides incremental predictive validity.
Specifically, we observed that the Reddit comment rates for synthetic opioids closely follow the CDC benchmark rates over time at the national level (simple correlation with CDC data: r = 0.89: detrended: r = 0.59) (Fig. 2), with similar trends for heroin and less signal for natural and semi-synthetic opioids. For synthetic opioids and heroin, Reddit trends tend to match CDC trends best with no lag. We generally observed strong correlations between CDC and Reddit trends over time (r = 0.60 to r = 0.89), which reduced to r = 0.25 – 0.60 after trend differencing, suggesting that some of the measured correlations were due to linear trends in the data. However, the fact that stationary cross-correlations remained significant demonstrates that Reddit captures drug trend data over and above broad overall trends and points to the usefulness of Reddit as an alternative data stream to monitor the opioid epidemic.
Beyond broad drug categories reported by the CDC, we used data from the National Forensic Laboratory Information System (NFLIS), which reports only semi-annually but reports individual drugs. We found Reddit mentions of fentanyl to be highly similar to semi-annual NFLIS fentanyl laboratory data at both the regional and national levels (r = 0.91, detrended: r = 0.41; Fig. 3). We also observed heroin and hydrocodone comment rates to roughly match NFLIS benchmarks. This robust pattern of correlations indicates that Reddit surveillance may capture the use of single drugs, especially at the national level. This range of drug coverage on Reddit may allow for the monitoring of other drugs of interest that are not currently monitored by the NFLIS, including emerging drugs.
We wanted to assess how incorporating Reddit data might improve real-world drug overuse forecasting. We hypothesized that social media may provide more frequent and more recent trend information than traditional monitoring systems, such as the CDC Vital Statistics Provisional Drug Overdose data, which often lags by at least four months32. However, the wait time is much longer (sometimes years) for finalized overdose drug data to be released by the CDC. To model this scenario where delayed CDC estimates were available, but Reddit data was up to date, we created an ARIMA model for predicting overdose death rates in which the CDC data lagged by six months. We then compared the performance of this model to an ARIMA that included near real-time (1-month lag) Reddit mention rates as an exogenous variable (Fig. 4) and observed significant improvement in the prediction accuracy of overdose death rates from synthetic opioids. This demonstrates the utility of social media data to predict synthetic opioid trends in near real-time using a combination of social media and traditional reporting systems and points to the potential of future systems to build on this work.
These prediction accuracies matter: Fentanyl overdoses have been rising steadily, and a timely response can save lives1. Fentanyl is very potent, often undetectable, and can be laced into other recreational drugs. Fentanyl overdose can be treated using fast-acting opioid antagonists like naloxone, which need to be allocated and distributed as quickly as possible. Our results suggest that Reddit data may become part of an early-warning system that predicts geographic changes in fentanyl use, allowing harm reduction programs to adequately stock and distribute naloxone and deploy drug testing strips and other public health interventions.
In contrast, we observe that incorporating Reddit data into predictive models for heroin and natural/semi-synthetic opioids did not yield significant improvements in predictive accuracy. This may be because both heroin and natural/semi-synthetic opioid trends have been stable or decreasing over the period we were assessing. Future work should evaluate the model’s performance on 2024 fentanyl overdose predictions, given the recent downturn in fentanyl overdose deaths. We also attempted to achieve single-drug resolution in ARIMA models using NFLIS data. However, since NFLIS data is only reported semi-annually, we found that this increased forecast horizon dramatically reduced the prediction performance of the ARIMA models. This demonstrates that the frequency of data collection impacts the quality of the time series models and points to the value of near real-time social media data as a data source.
Using social media as a proxy for opioid overdose rates proved challenging during the COVID-19 pandemic. While societal changes resulted in unemployment and social upheaval, we observed that overall Reddit opioid comment rates increased dramatically (Fig. 1). In terms of a normalized rate metric in which opioid mentions are divided by the total number of comments (as seen in Figs. 2 and 3), we observed the rate to plateau or drop during 2020 and 2021, as other non-opioid discourse ‘drowned out’ opioid mentions. In contrast, the CDC observed that the rate of opioid overdoses drastically accelerated during the COVID-19 pandemic (Fig. 2, top). Reddit comment rates realigned with benchmark trends in 2022 after many pandemic measures had been lifted (and, presumably, Reddit discourse had normalized). These patterns suggest that our method, like many other monitoring systems, is not robust to large-scale behavioral changes and major systemic shocks such as the pandemic33. During this timeframe, patterns of discourse across social media platforms, including Reddit, changed drastically, which affected our normalized rates14. In our time series predictive modeling, we also saw that all models performed better pre-COVID (before 2020). This shows that while Reddit data improved models that included pandemic data, this system is more useful when trends are not affected by global societal upheaval.
Our approach employed a relatively simple location mapping strategy. It is based on the assumption that Reddit users are most likely to comment on the location subreddit for their own location, not accounting for people who move cities or post in multiple location subreddits –incorrect location information adds noise to our results. Although this assumption may be reasonable at aggregate levels, the creation of geo-located cohorts with more sophisticated location-mapping techniques could yield an improved drug-monitoring system. Combining Reddit data with other social media data streams, such as X (Twitter), could further improve our results, and techniques have evolved to geolocate X users down to the county level34. Moreover, we do not perform a fine dissection of the linguistic content or demographic information surrounding opioid mentions. We elected to build our model using opioid mentions alone because folks with opioid use disorder and other affected groups may discuss opioids in a variety of ways, and our rate metric sums across all these mentions. Demographic information is not directly available on Reddit. However, additional linguistic analysis and subreddit analysis can be used to infer location, interests, gender, and other demographic features of interest. This is of interest because the demographic of Reddit users differs from the US population, leaning towards younger male users.
Monitoring systems such as ours depend on the continuing availability of data, which has been changing in recent years. Before late 2023, Reddit data could be easily accessed through the PushShift Application Programming Interface (API)26. This was advantageous for a near real-time warning system. However, the PushShift API ceased to be available, and the official Reddit API, although more limited, would be an alternative for accessing new data. However, as of September 2024, up-to-date Reddit archives are circulating on the open web and can be accessed by researchers. Twitter accessibility has also changed since 2022, with more stakeholders asking for access for public health and civic purposes. Our work stresses the importance of social media data as a “common good” for public health and adds force to the demands made by the US Surgeon General in May 2024 and the regulation passed by the EU parliament that requires researchers to be given access to the data of large social media platforms35,36.
There are critical ethical considerations when using online discussions to monitor non-medical drug use, including ensuring that user anonymity is preserved. While Reddit is a pseudo-anonymous platform (some handles contain no identifying information, whereas others may be identifying to some degree), we do not share any usernames in this study, and we do not analyze any user on the individual level in accordance with ethical guidelines37. Our study relies on the creation of a cohort whose Reddit activity spans over a decade, and all analyses are performed aggregated by location. A formal system implemented to monitor social media data continuously with higher spatial resolution would have to balance protecting user anonymity and societal benefit from improved drug use surveillance.
If social media surveillance came to greater public awareness, people who use drugs might change their posting behavior and thereby change the reliability of our results. For example, the community could find or create new private communication platforms. Indeed, there have already been conversations on Reddit about moving such discussions to private chat rooms (such as Telegram and WhatsApp), which would hinder surveillance by third parties38. The degree to which these discussions occur in public versus private domains will be a key determinant of the performance of a potential future surveillance system.
A key limitation of this study is that we do not distinguish between drug mentions and probable drug use (or any behavioral context) when computing drug mention rates. Our results suggest that drug mention frequency is a reasonable indicator of drug use in a region, regardless of why the drugs are mentioned – we have not established that these drug mentions are coming from drug users or their close contacts. It is also conceivable that news cycles impact rates of Reddit drug mentions, but we did not observe such patterns in the data.
Further, the way drugs are referenced on social media is continuously evolving, which the research community is tracking. For example, a recent work by Alhamadani et al. uses the r/drugs and r/opiates subreddits to recognize evolving drug terms39. In addition, the increasing sophistication of Large Language Models (LLMs) may be helpful in assessing the likely context of drug mentions at scale, which may improve accuracy and provide contextual information, which we are exploring in current research. An advantage of the high volume of drug discussions on Reddit combined with the availability of LLMs to streamline Natural Language Processing tasks means that systems can be built relatively cost-effectively and quickly to spot emerging drug trends that are not yet monitored by the CDC or NFLIS – a true early warning system.
In conclusion, we introduce a cohort-based approach for monitoring geographical rates of opioid mentions on Reddit over time, which corresponds reasonably well to established official mortality and laboratory monitoring systems. We develop time-series forecasting models using both CDC and Reddit data, showing that Reddit data adds predictive power to these models, especially for synthetic opioids, such as fentanyl. Social media has the potential to provide near real-time information about evolving drug use trends across a wide variety of known and emerging drugs, and it contains additional ethnographic information that can be used to contextualize drug use. This work has practical utility for researchers and public health authorities currently relying on CDC or NFLIS data for time-sensitive information about changes in opioid death rates. Currently, CDC data is released with a delay from 4 months to 2 years, and our model can help mitigate the challenges in prediction accuracy introduced by this delay. More generally, our work demonstrates that Reddit contains valuable and unique public health information, adding to the growing evidence of the utility of social media for this type of public health surveillance and stressing the importance of researchers continuing to have access to such data14,15,16,17,18,19,20,21,22. With further development, social media surveillance systems could assist in identifying, monitoring, and predicting future drug epidemics.
Materials and methods
Reddit opioid comment database
Reddit comments and their associated metadata from January 2006 to December 2022 were downloaded from PushShift.io and the Reddit API26. Our initial drug lists came from Levertu et al. and the WHO’s Anatomical Therapeutic Chemical Classification System40,41,42,43. Recognizing that Reddit users often use non-standard drug vocabularies, we used the RedMed word embedding model, which was trained on a health-oriented subset of Reddit and which provides a lexicon of misspellings and synonyms for different drugs28. We initialized our opioid term list using terms linked to these two drug classes within the RedMed lexicon. To avoid the inclusion of ambiguous terms, we manually filtered the RedMed term list. We curated our opioid mention dataset from Pushshift data by selecting all comments that contained at least one of our opioid search terms.
Estimating geolocation and creating a user cohort
Neither Reddit nor the Pushshift API provides location metadata for Reddit users, so we created a simple proxy to estimate user location. We extracted an initial set of location-specific subreddits from a Redditor-curated list44. We manually mapped these subreddits to their city, state, and region. We grouped our results at the national and regional level because we had sufficient Reddit signal at these geographic granularities. We identified all Reddit users who posted at least once in a location-based subreddit during the observation period from 2010 to 2020 for all users. We then filtered out users who had posted in multiple location-based subreddits. Our final cohort consisted of all Reddit users who had posted in a single location-based subreddit at least once. We note that the inclusion criteria for this cohort was only our ability to get a proxy location for an individual user and was agnostic to whether they had ever mentioned opioids. We also selected a normalization strategy that minimized the impact inactive users had on the normalized rates by normalizing our opioid discussion rates using the number of comments made by the cohort instead of the number of users in the cohort.
Opioid receptor activity-based classes
The opioids labeled as full agonists are morphine, codeine, oxycodone, pethidine, diamorphine, hydromorphone, levorphanol, methadone, fentanyl, sufentanyl, remifentanyl, tramadol, tapedolol, oxymorphone, and hydrocodone. The opioids labeled as partial agonists are buprenorphine, meptazinol, and loperamide. Mixed agonist opioids were nalorphine, pentazocine, nalbuphine, butorphanol, and dezocine. Opioid antagonists were naloxone, naltrexone, nalmefene, and diprenorphine. Heroin was maintained as an independent opioid class due to the volume of heroin mentions.
Opioid synthesis-based classes
Opioids labeled as synthetic were tramadol and fentanyl. The class of natural and semi-synthetic opioids consisted of morphine, codeine, hydrocodone, oxycodone, oxymorphone, hydromorphone, naloxone, buprenorphine, and naltrexone. Heroin was maintained as an independent opioid class, and the remaining opioids listed above and in Table 1 were grouped into the Opioid class. Methadone was excluded from these classes based on the CDC classification.
CDC vital statistics overdose data
The CDC vital statistics unit has published trailing monthly drug overdose data, which we collected for 2015–2022 for a subset of states and cities in the United States5. The data is 12-month trailing provisional overdose deaths for several different opioid categories (all opioids, heroin, semi-synthetic opioids, and synthetic opioids). To convert the overdose data to overdose death rates, we normalized the opioid death counts by the size of the relevant population (i.e., location and year per 10,000 people) based on census data.
NFLIS data
The NFLIS monitors drug use trends in different communities in the U.S. through reports from state and local forensic reports, toxicology reports, and reports from morgues and hospitals6. These reports represent ~98% of the data from the 1.5 million annual U.S. drug cases. The laboratory network includes data from 50 states and 104 local forensic labs. NFLIS provides drug identification rates for 25 drugs that are most commonly identified in national laboratory reports. This data is reported on a semi-annual basis, and we collected data from 2014 to 2022 for the entire U.S. and for each region of the U.S.: South, Midwest, West, and Northeast. To convert the drug report data to drug report rates, we normalized the opioid report counts by the size of the relevant population (i.e., location and year per 100,000 people) based on census data.
Summary statistics and unnormalized opioid mention counts
We counted the total number of mentions in our opioid database for each drug to create summary statistics for the opioid discussion on Reddit. We counted the number of comments by month for each drug and opioid category and visualized these counts over time (Fig. 1).
Normalized rates: comparing reddit comment rates with CDC overdose data
To compare the Reddit opioid conversation with CDC overdose data, we calculated 12-month trailing opioid comment rates within our cohort for the geographies we covered and for the categories the CDC reported: all opioids, heroin, semi-synthetic opioids, and synthetic opioids. The numerator for the Reddit opioid comment rate was the 12-month trailing total number of comments (based on the opioid category) for the geography (U.S., region, state, city) and the denominator was the 12-month trailing total number of comments made by users within our location cohort for the geographic area (U.S., region, state, city). We normalized these rates per 10,000 total comments.
We compared the 12-month trailing comment rates and the 12-month trailing overdose rates for each U.S. state. We grouped these rates and reported them for the entire U.S. and regions in the U.S. We visualized the 12-month trailing opioid comment rates and the 12-month trailing opioid overdose rates side by side for the entire U.S. and different regions in the U.S. for 2015–2022 (Fig. 3).
We compared the 12-month trailing comment rates and the 12-month trailing overdose rates for each category for the entire U.S., for regions in the U.S. using cross-correlation analysis, i.e., calculating correlations using different leading and lagging time intervals. We excluded the COVID-19 years, 2020 and 2021, from this analysis. We also performed an augmented Dickey-Fuller unit root test to evaluate the stationarity of the data. We did a subsequent cross-correlation analysis as described above after first or second-order differencing to achieve stationarity.
Normalized rates: comparing reddit comment rates with NFLIS drug report data
To compare the Reddit opioid conversation with the NFLIS drug report rates, we started by calculating the total number of semi-annual opioid comments for the geographies and select drugs: heroin, oxycodone, hydrocodone, buprenorphine, and fentanyl. The numerator for this normalized rate was the total number of semi-annual comments for the drug and geography (U.S., region), and the denominator for this was the total number of semi-annual comments made by users with location data for the geographic area (U.S., region). We normalized these rates per 100,000 total comments.
We then compared the semi-annual comment rate and the semi-annual drug report rate for these drugs for the entire U.S., and for NFLIS reporting regions in the U.S. We visualized the semi-annual drug comment rate and the semi-annual drug report rate side by side for the entire U.S., and for different regions in the U.S. for 2014 to 2022.
ARIMA modeling
We fit our initial ARIMA model using 6-month-lagged national CDC rates as described above from 2015 to 2022 using the python pmdarima auto_arima function45. We then used the rolling forecast function with an initial training dataset spanning two years and a predictive time horizon of 1 month45. We did not introduce seasonality in the model. We then calculated the average error of the predicted values against the true CDC monthly normalized overdose death rates (per 10,000 people) over time. We also fit an ARIMA model using the same CDC data but with real-time (1-month lagged) monthly Reddit comment rates (opioid comments per 10,000 total comments) incorporated as an exogenous variable. Finally, we fit the combined ARIMA model with the same parameters as with the Reddit-only model. To simulate real-world applications, we elected to lag the CDC data by 6 months, within the lag range reported by the CDC Vital Statistics Provisional Drug Overdose Death Counts32. Performance was evaluated via Wilcox Signed Rank testing of absolute errors between predicted and observed values.
Data availability
The datasets used and/or analyzed during the current study are at https://github.com/smithdelaney/Reddit_Opioid_Tracking/. The raw data is available from Reddit.com and the CDC Vital Statistics and NFLIS sites.
Code availability
The analysis code used in this study are publicly available at the following URL: https://github.com/smithdelaney/Reddit_Opioid_Tracking.
References
-
Centers for Disease Control and Prevention. Drug Overdose Deaths. https://nida.nih.gov/research-topics/trends-statistics/overdose-death-rates#:~:text=Opioid%2Dinvolved%20overdose%20deaths%20rose,All%20Ages%2C%201999%2D2022 (2024).
-
Kiang, M. V. & Humphreys, K. Recent drug overdose mortality decline compared with pre-COVID-19 trend. JAMA Netw. Open 8, e2458090 (2025).
-
Humphreys, K. et al. Responding to the opioid crisis in North America and beyond: Recommendations of the Stanford-Lancet Commission. Lancet 399, 555–604 (2022).
-
National Institute on Drug non-medical use. National Drug Early Warning System (NDEWS). https://ndews.org/about/ (2024).
-
Vital Statistics Rapid Release – Provisional Drug Overdose Data. https://www.cdc.gov/nchs/nvss/vsrr/drug-overdose-data.htm (2024).
-
Drug Enforcement Administration. National Forensic Laboratory Information System FAQ. https://www.nflis.deadiversion.usdoj.gov/FAQ.aspx (2024).
-
Pain Management and the Opioid Epidemic: Balancing Societal and Individual Benefits and Risks of Prescription Opioid Use. National Academies Press, Washington, D.C. https://doi.org/10.17226/24781 (2017).
-
Reuter, P., Caulkins, J. P. & Midgette, G. Heroin use cannot be measured adequately with a general population survey. Addiction 116, 2600–2609 (2021).
-
Sarker, A. et al. Utilizing social media data for pharmacovigilance: A review. J. Biomed. Inform. 54, 202–212 (2015).
-
Office of the Commissioner. Statement from FDA Commissioner Scott Gottlieb, M.D., on FDA’s new strategic framework to advance use of real-world evidence to support development of drugs and biologics. https://www.fda.gov/news-events/press-announcements/statement-fda-commissioner-scott-gottlieb-md-fdas-new-strategic-framework-advance-use-real-world (2018).
-
Lavertu, A., Vora, B., Giacomini, K. M., Altman, R. & Rensi, S. A new era in pharmacovigilance: toward real-world data and digital monitoring. Clin. Pharmacol. Therapeutics 109, 1197–1202 (2021).
-
Ong, A. D. & Weiss, D. J. The impact of anonymity on responses to sensitive questions. J. Appl. Soc. Psychol. 30, 1691–1708 (2000).
-
Hanson, C. L., Cannon, B., Burton, S. & Giraud-Carrier, C. An exploration of social circles and prescription drug non-medical use through Twitter. J. Med. Internet Res. 15, e189 (2013).
-
Carpenter, K. A. et al. Which social media platforms facilitate monitoring the opioid crisis? Preprint at https://doi.org/10.1101/2024.07.06.24310035 (2024).
-
Sarker, A., DeRoos, A. & Perrone, J. Mining social media for prescription medication non-medical use monitoring: a review and proposal for a data-centric framework. J. Am. Med. Inform. Assoc. 27, 315–329 (2020).
-
Katsuki, T., Mackey, T. K. & Cuomo, R. Establishing a link between prescription drug non-medical use and illicit online pharmacies: analysis of Twitter Data. J. Med. Internet Res. 17, e280 (2015).
-
Chary, M. et al. Epidemiology from Tweets: estimating misuse of prescription opioids in the USA from Social Media. J. Med. Toxicol. 13, 278–286 (2017).
-
Sarker, A., Gonzalez-Hernandez, G., Ruan, Y. & Perrone, J. Machine learning and natural language processing for geolocation-centric monitoring and characterization of opioid-related social media chatter. JAMA Netw. Open 2, e1914672 (2019).
-
Almeida, A. et al. The use of natural language processing methods in reddit to investigate opioid use: scoping review. JMIR Infodemiology 4, e51156 (2024).
-
Pandrekar, S. et al. Social Media-Based Analysis of Opioid Epidemic Using Reddit. AMIA Annu. Symp. Proc., 867–876 (2018).
-
Chancellor, S., Nitzburg, G., Hu, A., Zampieri, F. & De Choudhury, M. Discovering Alternative Treatments for Opioid Use Recovery Using Social Media. in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems 1–15 (Association for Computing Machinery, 2019). https://doi.org/10.1145/3290605.3300354.
-
Garg, S. et al. Detecting risk level in individuals misusing fentanyl utilizing posts from an online community on Reddit. Internet Interventions 26, 100467 (2021).
-
Harrigian, K. Geocoding Without Geotags: A Text-based Approach for Reddit (arXiv:1810.03067). arXiv. https://doi.org/10.48550/arXiv.1810.03067 (2018).
-
Statista. Reddit – Statistics & Facts. https://www.statista.com/topics/5672/reddit/#editorsPicks (2024).
-
Reddit Recap. https://www.redditinc.com/blog/reddit-recap-2022-global#:~:text=In%202022%2C%20redditors%20created%20430,%2C%20and%2024%2B%20billion%20upvotes (2022).
-
Baumgartner, J., Zannettou, S., Keegan, B., Squire, M. & Blackburn, J. The Pushshift Reddit Dataset. ICWSM 14, 830–839 (2020).
-
Pew Research Center. Demographics of Social Media Users and Adoption in the United States. https://www.pewresearch.org/internet/fact-sheet/social-media/.
-
Lavertu, A. & Altman, R. B. RedMed: Extending drug lexicons for social media applications. J. Biomed. Inform. 99, 103307 (2019).
-
Chiappini, S., Guirguis, A., John, A., Corkery, J. M. & Schifano, F. COVID-19: The Hidden Impact on Mental Health and Drug Addiction. Front. Psychiatry 11, 767 (2020).
-
Manchikanti, L. et al. COVID-19 and the opioid epidemic: two public health emergencies that intersect with chronic pain. Pain. Ther. 10, 269–286 (2021).
-
Gomes, T. et al. Trends in opioid toxicity-related deaths in the US before and after the start of the COVID-19 Pandemic, 2011–2021. JAMA Netw. Open 6, e2322303 (2023).
-
CDC National Center for Health Statistics. https://www.cdc.gov/nchs/nvss/vsrr/drug-overdose-data.htm (2024).
-
Weiss, E. & Ariyachandra, T. Resilience during times of disruption: the role of data analytics in a healthcare system. J. Inf. Syst. Appl. Res. 17, 53–63 (2024).
-
Giorgi, S. et al. Predicting US county opioid poisoning mortality from multi-modal social media and psychological self-report data. Sci. Rep. 13, 9027 (2023).
-
U.S. Department of Health and Human Services. https://www.hhs.gov/about/news/2023/05/23/surgeon-general-issues-new-advisory-about-effects-social-media-use-has-youth-mental-health.html (2024).
-
European Commission. The Digital Services Act package | Shaping Europe’s digital future. https://digital-strategy.ec.europa.eu/en/policies/digital-services-act-package (2024).
-
Moreno, M. A., Goniu, N., Moreno, P. S. & Diekema, D. Ethics of social media research: common concerns and practical considerations. Cyberpsychol. Behav. Soc. Netw. 16, 708–713 (2013).
-
Rolando, S., Arrighetti, G., Fornero, E., Farucci, O. & Beccaria, F. Telegram as a space for peer-led harm reduction communities and netreach interventions. Contemp. Drug Probl. 50, 190–201 (2023).
-
Alhamadani, A., Sarkar, S., Behal, S., Alkulaib, L., & Lu, C.-T. The Efficacy of PRISTINE: Revealing Concealed Opioid Crisis Trends via Reddit Examination. https://doi.org/10.21203/rs.3.rs-2758553/v1 (2023).
-
Lavertu, A., Hamamsy, T. & Altman, R. B. Quantifying the severity of adverse drug reactions using social media. Cold Spring Harb. Lab. https://doi.org/10.1101/2021.02.02.429445 (2021).
-
Organization, W. H. & Others. The anatomical therapeutic chemical classification system with defined daily doses-ATC/DDD (2009).
-
Drug Enforcement Administration. Drug Scheduling. https://www.dea.gov/drug-scheduling.
-
Al-Garadi, M. A. et al. Text classification models for the automatic detection of nonmedical prescription medication use from social media. BMC Med. Inform. Decis. Mak. 21, 27 (2021).
-
Reddit Locations. https://www.reddit.com/r/LocationReddits/wiki/faq/northamerica (2024).
-
Alkaline ML Pmdarima. https://alkaline-ml.com/pmdarima/1.3.0/modules/generated/pmdarima.arima.AutoARIMA.html (2024).
Acknowledgements
The authors would like to thank Shashanka Subrahmanya for his assistance. This work is supported by the NIH DA057598 and MH125702. Delaney Smith is supported by the Stanford Biochemistry Department and NSF GRFP 2019286895; Kieth Humphreys is supported by a Senior Research Career Scientist Award (RCS 04-141-3) from the Department of Veterans Affairs Health System Research Service; Doctors Eichstaedt, Altman, and Lembke are supported by the Stanford Institute for Human-Centered AI; Russ Altman is supported by the Chan Zuckerberg Biohub.
Author information
Authors and Affiliations
Contributions
D.A.S. analyzed the data, created some figures, and drafted the manuscript. A.L. developed the geo-location method used in this paper. A.S. processed raw Reddit data and created some figures and tables. TH analyzed the data. K.H. provided field expertise in addiction. M.K. provided expertise in statistical analysis and epidemiology. R.B.A. assembled the research team and provided overall management of project. J.C.E. edited the manuscript and provided guidance in analyzing, interpreting, and communicating the results.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Smith, D.A., Lavertu, A., Salecha, A. et al. Monitoring the opioid epidemic via social media discussions.
npj Digit. Med. 8, 284 (2025). https://doi.org/10.1038/s41746-025-01642-x
-
Received:
-
Accepted:
-
Published:
-
DOI: https://doi.org/10.1038/s41746-025-01642-x
This post was originally published on this site be sure to check out more of their content