In order to make Facebook as open and connected as possible for everyone, one of our goals is to understand how different populations of users join and use the service. With that objective in mind, the Facebook Data team recently sought to answer the question, “How diverse are the ethnic backgrounds of the people using Facebook?” This is a tough question to answer because, unlike information such as gender or age, Facebook does not ask users to share their ethnicity or race on their profiles. In order to answer it, we focused on a single country with a large and diverse population—the United States. Comparing people’s surnames on Facebook with data collected by the U.S. Census Bureau, we are able to estimate the racial breakdown of Facebook users over the history of the site.
We discovered that Facebook has always been diverse and that the diversity has increased significantly over the past year to the point where U.S. Facebook users nearly mirror the diversity of the overall population of the country. The graph above shows the proportion of the three largest minorities on Facebook over time as predicted by our model, while the dashed lines show the proportion of the Internet population for the same ethnicities.
In this report, we’ll discuss how we are able to measure diversity without user-supplied race or ethnicity. We’ll also explain how race and ethnicity have varied over the course of Facebook’s history and explore future research for understanding friendship diversity on the site.
The U.S. Census Bureau’s Genealogy Project publishes a data set containing the frequency of popular surnames along with a breakdown by race and ethnicity. These data are the key to our analysis, so we will spend some time describing them in some detail. An example of the raw data is shown below for the three most-frequent surnames in the census: Smith, Johnson and Williams. These data provide the rank in the population, the total count of people with the name, their proportion per 100,000 Americans, and the percent for various races: White, Black, Asian/Pacific Islander, American-Indian/Alaskan Native, two or more races and Hispanic respectively.
This data set allows us to predict what a person’s race is based solely on his or her surname. While these predictions will be often be wrong, in aggregate they will be correct. For example, suppose you select 10,000 people with the name Smith from the U.S. population at random. The data above s