Background & Summary

Within the geo-simulation research domain (e.g., micro-simulation and agent-based modeling) often require the generation of synthetic populations. These synthetic populations have been used to study a plethora of topics within urban systems, such as human mobility, public health, and disaster resilience1,2,3. Researchers such as4 have emphasized the role of geographically- explicit synthetic population in geo-simulation and have created the workflow to nest individuals into different spatial settings (e.g., home, school and workplace). Over time, modelers have also realized that social networks play an important role in human activities as they drive the interactions and lead to aggregate patterns emerging especially in the case of agent-based modeling5,6,7. Over time, researchers have also realized the importance of incorporating social networks as additional layers in the GIS systems8,9,10. By representing people’s relationships using nodes and edges, social networks are suitable to capture complex human interactions (e.g., communication, information sharing and opinion dynamics)8. Integrating social networks into geo-simulation allows researchers to better understand how individuals’ interactions give rise to the emergence of non-linear patterns at a large scale during different circumstances such as green space usage, social segregation and disaster response2,11,12. In our study, we define three types of social networks, home, work and educational as we would argue these can capture most daily interactions13. As modelers, we often spend a significant amount of time creating synthetic populations especially those grounded with data, due to the time needed to collect, preprocess and generate the final synthetic population. Moreover, synthetic populations are often built for a specific purpose which limits their use in other situations. Our aim of this paper is to build and provide a geographically explicit synthetic population along with its social networks using open data including that from the latest U.S. Census which can be used in a variety of geo-simulation models.

Over the last several years, there have been numerous efforts to create synthetic populations (see14 for a review), however, many of these synthetic populations seldom include social networks. Currently, there are several national-level synthetic population datasets, e.g.15,16,17. For example16, the author created a synthetic population for the whole of the US based on 2010 data at the census block level (roughly equivalent to 600 to 3,000 persons) and included various demographic information (e.g., housing type, age, sex, race, and ethnicity). While from a recent work, the authors built a synthetic population for the whole of Canada based on 2016 data and projected out to 2042 using the Canadian dissemination areas (roughly equivalent to 400 to 700 persons)17. In their work both demographic and socio-economic variables (e.g., educational background and income status) where included as agent attributes. Even though these two examples create synthetic populations they present data at an areal unit and do not assign individuals to specific locations, nor do they assign individuals to specific workplaces. At a more local level, efforts have been made to assign home and work locations e.g.4, but social networks are still missing. In what we present below, we not only provide the code and the resulting data but also provide an explicit geographical location (i.e., latitude and longitude) for both home and work locations and include basic social networks. By doing so we provide a way to enable the exploration of basic patterns of life.

When it comes to creating synthetic populations, several methods exist. Each method has its strengths and weaknesses when generating a synthetic population14. Generally speaking, traditional population synthesis methods can be broken into two main approaches: 1) synthetic reconstruction (SR) such as Hierarchical sampling (HS) and Iterative Proportional Fitting (IPF); 2) combinatorial optimization (CO) or re-weighting such as Fitness computation procedures. Many approaches require micro-level data especially those using CO or IPF (e.g., Public Use Microdata Areas (PUMAs) in the US or Samples of Anonymized Records (SAR) in the UK) to calibrate the synthetic populations, e.g.15,17. Additionally, such approaches are computationally expensive and require micro-level data which might not always be available. While HS is a flexible method that does not require such data as input and it only requires data at the aggregated level (e.g., census tract level)18,19. Specifically, HS creates synthetic individuals in a specific order, based on the discrete attributes from the aggregated level data that describe individuals’ characteristics (e.g., males aged from 0 to 4, which are then grouped into a household with married couples and kids). Other than demographic information, our population will have attributes related to work or educational locations along with their explicit geographical locations. Adding these attributes requires our method to handle data from multiple formats (e.g., shapefiles), which needs additional computational resources. Thus, in this work, we utilize HS to generate demographic information due to its flexibility of implementation (e.g., less input data and computational efficiencies). In what follows, we will introduce our method, then present the basic information of the resulting dataset such as data structure, data formats and demonstration of sample data along with the efforts to validate our methodology. The last section will conclude this paper and discuss areas of applications of this dataset.

Methods

Overview

As discussed above, our aim is to generate geographically explicit synthetic population dataset along with their social networks for all 308,745,538 individuals in the United States (i.e., 50 states and Washington D.C.) in 2020. Each individual is assigned a latitude and longitude along with demographic characteristics of gender, age, household information, household structure, work or educational information, which is stored in the GeoPackage (i.e.,.gpkg) format. Additionally, the workplace information (i.e., Workplace ID, latitude and longitude) is also stored using GeoPackage formats. As for the education facilities (i.e., daycares and schools), we use shapefile (i.e.,.shp) to store their unique IDs, latitude and longitude. The social network datasets are stored in comma-separated values (i.e.,.csv) format. Figure 1 shows the workflow of the synthetic generation processes and the following describes it in more detail.

Fig. 1
figure 1

Data Generation Workflow and Resulting Datasets.

Data collection and preprocessing

Due to this work generating the whole U.S. Synthetic population, we created various Python scripts to collect and preprocess data on a state by state basis. Table 1 shows all data collected and utilized in this work along with their data sources. All data for the demographics and workplaces are from the latest 2020 Census Data. While information about educational sites comes from 2015 which is the last time the data was updated20. Data preprocessing included data cleaning (e.g., removing duplicate and null records) from all data, integrating various data (e.g., census data, count business pattern and Origin-Destination Employment Statistics data) into the census tract boundaries, simplifying the road network topology to minimize its size while ensuring all road segments are connected and removing duplicate records from the network (i.e., road segments), while for the education sites data, a unique identifier was added to each location.

Table 1 Input Data Sources.

Step1: Create individuals and assign home locations

Within Step 1 which is shown in Fig. 1, there are four tasks: (1) creating individuals based on the 2020 US Census data; (2) grouping them into households; (3) placing each household generated in tasks 1 and 2 on residential roads; (4) identifying urban and rural population. As mentioned above, we utilized the Hierarchical sampling (HS) from the Synthetic Reconstruction (SR) method to generate the individuals18. During the generation process, we created individuals within every census tract using gender and age group information extracted from the 2020 US Census data21. The method creates the exact number of individuals by iterating over the 18 age groups for both males and females, such as aged 0 to 4; to 85 and older for males. By doing so, we created individuals whose demographic information such as age and gender can match the U.S. Census’s distribution21.

To group the generated individuals into households, we first created a set of empty household containers of 12 types for each census tract based on the definitions from the US census21. Then, we randomly assign individuals into households to match their household types and conditions. Table 2 shows these household types and the conditions and assumptions used to place individuals while ensuring they fit the characteristics expected for each household type. As for group quarters (which in the US refer to places like college residence halls, aging facilities and correctional institutions), the exact number of the population is assigned into group quarters. Since we do not have information about the number of group quarters, we aggregate all group quarters in a census tract into one household and assign them a unique identifier. These assumptions on assignments can be refined if readers so desire, this is one reason we provide the code to generate the synthetic population.

Table 2 Household Types and Assumptions.

Once the households have been created, we then give them a home location by using the road networks as a proxy for actual buildings. Our rationale for this is that assigning individuals/households a home location allows modelers to incorporate the ability to add movement to agents and thus the ability to explore a wide range of issues (e.g., urban mobility, commuting activities, transportation). Meanwhile, using street networks rather than building footprints can preserve a certain level of privacy. Thus, the home location is extracted from the road network, specifically, we identified all residential roads within a tract and placed each household along these roads. We randomly assign individuals to any residential road and attempt to keep them 50 meters apart to distribute them evenly throughout the census tract. However, when this is not possible, household locations will be placed on top of each other (like in dense urban areas). An example of this is shown in Fig. 2. Additionally, we assign individuals with urban or rural attributes based on where they have been assigned utilizing US census definitions22. At the end of this step, each generated synthetic individual will have a unique ID along with basic demographics such as age, gender, household type, household ID, home location, and urban or rural tag.

Fig. 2
figure 2

Examples of Generated Household Locations in Census tracts that are (a) Suburban and (b) a High-Density Urban.

Step2: Create locational information

As shown in Fig. 1, four tasks related to individuals’ daytime locational information are constructed in Step 2: (1) creating workplaces with unique IDs; (2) assigning workplace to work population; (3) placing each workplace generated in Tasks 1 along the roads; (4) assigning educational sites to children. The workplace information is based on the County Business Patterns data23 and Longitudinal Employer-Household Dynamics (LEHD) Origin-Destination Employment Statistics (LODES 8)24 from the 2020 Census. Additionally, using LODES 8, we can extract the number of the employed population. Then, we assign individuals created in Step 1 with a workplace as their daytime location. For individuals not assigned a workplace, their daytime locations will be their household locations.

To assign each workplace a geographical location, we place the workplaces on the secondary roads 20 meters apart and the intersections of secondary and residential roads. the secondary road refers to the main road without limited access, including U.S. highway, state highway, or county highway systems25. Furthermore, we assigned children (aged 0 to 17) to the closest daycare/schools based on their ages (e.g., daycare, elementary, middle, and high school) whose locations were sourced from the US Environmental Protection Agency (EPA) Office of Environmental Information (OEI)20. The assumptions used to assign children to educational sites are as follows: ages 0–4 to Daycare, ages 5–11 to Elementary School, ages 12–14 to Middle School, and ages 15–17 to High School, which could also be refined based on different modeling circumstances.

Step3: Create social networks

Lastly, as mentioned above, social networks are included in our synthetic population. Three types of social networks are created based on (1) being in the same household, (2) working in the same workplace, or (3) attending the same educational site. Small-world networks26 are created for people whose household, workplace or education site has more than 5 people, where the number 5 is chosen to indicate the size of the core social group with 5 people based on the work of Dunbar27, where the size of an individual’s educational and work networks ranged from a minimum of 0 to the maximum of 14. While for people in the same household, workplace or education site with 5 or less than 5 people, everyone is fully connected. Within this step, we use the Python package called networkx to create such social networks with its built-in function “newman_watts_strogatz_graph(n, k, p)”. The n indicates the total number of the population (i.e., nodes). To mimic the core social group of 5 people, we should set up the following parameters, specifically, the k is 4, which means one person can be connected to 4 people to make up a 5 people social group, the p is set as 0.3, which indicates the probability of adding a new edge between non-adjacent nodes, to enable us to have a variation on edges, where indicates some of them have more or less connection. It should be noted that unlike the other networks, the work network was generated at the national level due to people working across state boundaries and then partitioned at the state level.

Data Records

The dataset is available at OSF28. Interested readers can download the geographically explicit synthetic population along with their social networks by state. After the generation process, we have generated 330,526,186 individuals for America’s 50 states and Washington D.C., where each has six resulting datasets. Table 3 shows the basic information of the resulting datasets such as data format and detailed descriptions. Tables 4, 5 shows the description of the variables from the generated synthetic population and workplace datasets. Each individual has a set of geographical locations that represent their home, work or school addresses. Additionally, these individuals are not isolated, they are embedded in a larger social setting based on their household, working and studying relationships. As for the social network datasets, each row represents a social network. The index 0 of each row is the starting node, the rest n of the row is a set of neighbors of the social network, where n differs in size depending on the network, n  [0, 14]). To show how our synthetic networks can be related to geographically explicit locations, Fig. 3 shows a selected synthetic household’s geographic location and its member’s workplace, school and daycare locations, where the zoomed-in figure lays out the selected household’s social networks (i.e., educational, work and household). Figure 4 shows our four sample networks extracted from the City of Buffalo in New York State while the resulting networks averaged properties at the national-level are presented in Table 6.

Table 3 Summary of the Resulting Datasets.
Table 4 Population Dataset Variables.
Table 5 Workplace Dataset Variables.
Fig. 3
figure 3

A Sample of a Social Networks for one Household and their Home, Work and Educational Social Networks from the Generated Data.

Fig. 4
figure 4

Sample of Generated Social Networks Extracted from the City of Buffalo, New York: (a) Household; (b) Work; (c) School; (d) Daycare.

Table 6 Resulting Networks for Whole U.S.

Technical Validation

In this work, we first conducted internal validation to ensure that our resulting synthetic population aligns with the input census data at multiple levels (i.e., individual, household and census tract). As for the individual level, other than checking if the number of synthetic individuals matches the census records by reporting the total absolute error (TAE) and Absolute Percentage Difference (APD), we also ensured the male and female populations under the different 18 age groups matched with the U.S. Census data. We compared the number of individuals in our synthetic population and Census under each age group using APD. Similar procedures are conducted at the household level, we compared the total synthetic household number to census records using APD. Additionally, at the census tract level, we compared the household size of synthetic and census data by reporting TAE. TAE and APD are commonly used error metrics for the quality check and validation of generated synthetic population14,18.

During the synthesis process, we found some census tracts had errors. The majority of these problematic tracts are located in parks or nature reserves, which had inconsistent counts with respect to total numbers of males and females or no data was given from the official US Census data. Thus, we can not generate a synthetic population for those problematic tracts and we only input the census tracts excluding the problematic tracts (i.e., Good Census Tracts) to generate the synthetic population. In total, there were 428 problematic tracts which only represent 0.51% of all census tracts. Table 7 shows the TAE on population between All Census Tracts (i.e., 83,848 tracts) and Good Census Tracts (i.e., 83,420 tracts), which indicates the population living in problematic tracts is only 0.279% compared to the whole U.S. (50 states and D.C.) population. In addition, the matching number on the total number of synthetic population and the population living in Good Census Tracts shows that our method can generate the exact number of population from the input census tracts. As for the total household numbers, we generate 126,442,118 households which when compared to Good Census Tracts is 5,601 less, but the difference is only 0.004% as shown in Table 7. This could be due to how we handle group quarters or household assignments (e.g., Table 2).

Table 7 Whole U.S. Population and Household Validation.

As our method generates the synthetic population and assigns them into households based on the household type, we also compared both the male and female population for different age groups and households (excluding group quarter households) from the synthetic data with the Good Census and All Census Tracts to report APD. As our method generates the identical number under each age group as the Good Census Tracts, the APD for both males and females are all zero. When comparing synthetic data to All Census Tracts, the male’s average APD is 0.3% and for female is 0.2%. With respect to the comparison on household types’, when compared to All Census Tracts, the average APD is 0.3%. While comparing to the Good Census Tracts, the average APD is 0.02%. It should be noted that these low APDs indicate that our generated synthetic population has very minor differences when compared to the input census data as shown in Fig. 5.

Fig. 5
figure 5

Validation of the Synthetic Population at Different Levels: (a) Population under Different 18 Age Groups; (b) Household under Different Household Types.

Turning to average household size, based on the 2020 American Community Survey (ACS), the overall estimated average household size was 2.6 with a margin of error ±0.0129. As Table 7 shows, the average household size from the 2020 Census Data and our synthetic population is 2.61, which is the same as the ACS data. In addition, we also compare the average household size at the census tract level with our synthetic data and calculate the average household size using the tract’s total population divided by the tract’s total number of households. Figure 6 shows the comparison on the average household size between our synthetic population and census data. Each blue dot represents a census tract. If the blue dots overlap with the red line (i.e., the line of equality), it indicates no difference between our synthetic population data and census data regarding the average household size for that specific tract. If the blue dots are above the red line, the average household size in our synthetic data is larger. Conversely, if the blue dots are below the red line, the average household size in our synthetic data is smaller. The closer the dots are to the red line, the smaller the differences areTo better show the distribution of the differences, we use the ln for both the x and y axes. Out of the total of 83,420 census tracts, our method generated 32,231 tracts’ population with the same average household size and 82,968 tracts’ where the average household sizes have absolute differences of less than 0.1. One potential reason for this is that we consider group quarters as households in our method (see Step 1 in the Methodology Section). Which in turn means that in some tracts with universities for example, there might be 5000 people who are grouped into one household. This increases the average household size potentially when scaling to the whole of the US.

Fig. 6
figure 6

Validation of Average Household Size: Synthetic Population and Census Data on a Logarithmic Scale (ln) where each blue dot represents a census tract.

After comparing our synthetic population to the input census data (i.e., in tract level), we also conducted two external validation experiments by utilizing the American Community Survey Public Use Microdata Sample (ACS PUMS) and the census data at the block group level. As for the first external validation experiment, we compared our synthetic population to ACS PUMS, which contains a sample of five percent of individuals who have been surveyed and recorded in the census. Each individual from ACS PUMS has the same attributes of age and gender as our resulting synthetic population. Thus, we can check if the male and female populations under the different age groups from the ACS PUMS and our resulting synthetic population share similar distributions. In this process, we collected 2022 ACS PUMS data30, because this is the only data that uses the latest 2020 Public Use Microdata Areas (PUMA) as the geographical boundaries. Each PUMA contains several census tracts from the latest 2020 census, which allows us to aggregate our resulting synthetic population into PUMAs. While the 2020 and 2021 ACS PUMS use 2010 PUMAs and it’s not possible to conduct the same aggregation with 2020 and 2021 ACS PUMS. Following the same approach presented by16’s external validation of calculating cosine similarity for each PUMAs, we aggregate the resulting synthetic population and 2022 ACS PUMS into 2462 PUMAs to make comparisons. The cosine similarity ranges from −1 to 1, where a similarity of 1 means the two sets of data are identical. Figure 7 demonstrates the distribution of cosine similarity for the 2462 PUMAs and shows that 96% of PUMAs have a cosine similarity greater than 0.95. This indicates that the resulting synthetic population generated using our method shares a similar age distribution to ACS PUMS.

Fig. 7
figure 7

The Distribution of Cosine Similarity between Synthetic Population and the American Community Survey Public Use Microdata Sample (ACS PUMS) for the 2462 Public Use Microdata Areas (PUMA).

Turning to the utilization of block group level’s census data, we have calculated the Spearman’s rank correlation of all census tracts of the whole U.S. Specifically, we have calculated the number of our synthetic population in each census block group by using individuals’ latitude and longitude information, where a census block group is one of the multiple subdivisions of a census tract. Next, for each tract, we calculated the Spearman’s rank correlation between our synthetic population and the real population in its census block groups. A Spearman’s rank correlation value in the range of 0.5 to 1 indicates a near-perfect match (strong positive correlation) when comparing our synthetic population to the census record at the census block group level. As the Table 8 shows, the percentage of tracts with a Spearman’s rank correlation value in the range of 0.5 to 1 are 54% for the whole US. for the whole US. In the sense, our method can capture the population’ spatial distribution on a finer scale to some extent, which indicates our method works better in high population density states, such as New Jersey (NJ) with 59.09%, Rhode Island (RI) 61.13% and Massachusetts (MA) 59.5%. However, one point to be noted is that the initial design of our method did not intend to capture the accurate geographic locations with data (e.g., building footprints) in order to avoid the privacy issues.

Table 8 Spearman’s Rank Correlation between Synthetic Population and the Ground-truth Data at Census Block Group Level Aggregated to Census Tract Level by State.

To summarise, our method can generate the number of individuals for good census tracts (see Table 7) and the resulting population age distribution aligns to the census records and ACS PUMS. In addition, the Spearman’s rank correlation calculated from the census data from block group level suggests that our method can capture a finer scale of the population’s spatial distribution. Other than these, the method groups individuals into 12 household types (shown in Table 2) and the distribution of the number of households under each type also corresponds with census records. For example, with respect to the average household size, our method can generate ± 99.5 of census tracts with absolute differences of less than 0.1. Thus, our method can generate a baseline synthetic population dataset with stylized social networks.

Usage Notes

The resulting dataset with geographical information can be loaded using GIS software (e.g., ArcGIS), R and Python data management packages (e.g., pandas, geopanadas). To apply these data for geo-simulation modeling purposes, these data can be utilized to initialize microsimulations and agent-based models within various platforms (e.g., MATsim, Netlogo and GAMA) and programming languages (e.g., Pyhton, Java and R). The social network datasets can be loaded and applied using the Python networkx package and Gephi for further analysis and visualization.

Potential use cases for this data, for example, within agent-based modeling, this data could be used to model the emergence of phenomena through individual interactions. These topics could fall into urban planning, e.g.31, transportation, e.g.32, and public health research e.g.2,33. In addition, the method and the data from this work could potentially address the concerns with urban digital twins which often lack demographic, economic, and social processes34, in the sense by providing agents to populate such worlds.

However, as with all work, there is always room for improvement. We would like to point out the use cases where this data might not be applicable. First, because of the nature of the systemic data which is only a snapshot in time (i.e., 2020), the data can not be used directly to study the long-term evolution of the population such as long-term migration and aging populations, however, approaches (e.g., dynamic microsimulation methods) could be further applied on the data to extend this data’s capability to explore such topics e.g.17. Furthermore, the dataset was not designed to account for different modes of transportation (e.g., taking public transportation, carpooling, driving, walking) or for the fine-scale movement of individuals such as building evacuation styles of models. We chose to omit building footprints to avoid potential privacy issues or misrepresentation. However, the method presented here could be extended by incorporating such data (e.g., high- resolution building footprint data or land use data or travel surveys for model of commuting) to guide the geographic location assignments4 and commute type assignments for the synthetic population, which may allow the resulting data to study finer-scale dynamics such as detailed individual-level mobility dynamics, building evacuation. Even with these limitations, the baseline geographically explicit synthetic population and the estimated social networks can be utilized to explore various topics in America’s 50 states and Washington D.C. We look forward to learning how researchers will utilize this data.