Group member: Huanghe YaoJing, Minxue Gu, Jinpu Cao
We discuss and implement the first two parts together. Besides, I finish the percent method in the data equity and give some insights

Here is our dashboard link:
Geographic Equity
https://yaojinghuanghe.shinyapps.io/dashboard_pm25/
PM2.5 Equity among different groups:
https://yaojinghuanghe.shinyapps.io/dashboard_pm25_equity/
Data Equity - Percent Score
https://yaojinghuanghe.shinyapps.io/dashboard_data_equity_score_perc/
Data Equity - Rank Score
https://yaojinghuanghe.shinyapps.io/dashboard_data_equity_score/

The report develops an assessment of the equity implications of air quality (via PurpleAir) in San Mateo County. The report takes two cities (Menlo Park and Redwood City) as examples and analyzes their geographic equity by mapping the Air Quality Index(AQI) of each block group and plotting the average PM 2.5 in February. The population equity is conducted by comparing the PM 2.5 distribution across different income groups and different racial groups. Finally, in data equity part, the report gives some suggestions to the suppliers of PM 2.5 probes by proposing two sets of metrics for selecting which block groups have greater demand to PM 2.5 probes.

Geographic Equity

Raw senors data is extracted from PurpleAir and then is converted into general PM2.5 and AQI. There are 1055 sensors in San Mateo County in total. The following mapping shows the relative AQI of each sensor in the county. From the mapping we can see the AQI in the place near East Palo Alto and Redwood city is not as good as other places. Actually, the absolute AQI of most block groups is Good in the county.

Menlo Park City

We use the voronoi technique to transform point-estimates of outdoor air quality to census block groups. The following mapping shows the results after voronoi splitting.

The following mapping shows the result after voronoi interpolation at the block groups level. From the mapping we can see that the places near the bay tends to have higher PM2.5.

The following chart shows the outdoor PM2.5 level in February 2022 in the Menlo Park. The PM2.5 in Menlo Park City fluctuates in this month.

RedWood City

Similarily, the following mapping shows the results after voronoi splitting in Redwood City.

The following mapping shows the result after voronoi interpolation at the block groups level. From the mapping we can see that compared to Menlo Park, the city’s air quality seems a little worse than Menlo Park just according to the PM2.5 level. Besides, the PM 2.5 level is higher in the east and also in the center than other places.

The following chart shows the outdoor PM2.5 level in February 2022 in the Redwood City. Similarly, the PM2.5 in the city also fluctuates in this month. Generally, we find that the PM2.5 tends to be relatively high in weekend and low in weekday for these two cities, which makes sense because more people tend to go out in weekends.

Population Equity

Comparison among Income Groups

We collect the income data in San Mateo County using ACS 5-years dataset (2019) at the block groups and divide income into four levels: Less than $24,999(low), $25,000 to $44,999(median low), $45,000 to $99,999(median high), $100,000 or more(high). We split the PM 2.5 into 5 levels. From the following equity analysis figure we can see that PM 2.5 exposure degree is unequal among different income groups. High income groups are less exposed to bad air quality (in terms of PM 2.5) than they ‘should’ be (based on their group population percentage). Nevertheless, low income and median low income groups are more exposed to bad air quality than they ‘should’ be.

Comparison among Races

We collect the census race data in San Mateo County using decennial data (2010-2020) at the block level and divide races into six categories: American Indian and Alaska Native alone, Asian alone, Black or African American alone, Native Hawaiian and Other Pacific Islander aloneTwo or more races and White alone. From the following equity analysis figure we can see that PM 2.5 exposure degree is obviously unequal among different races than that among different income groups. White people are less exposed to bad air quality (in terms of PM 2.5) than they ‘should’ be (based on their group population percentage) and vice versa.

Data Equity

The population equity analysis above is based on the assumption that our PM2.5 data is collected equally or evenly among different groups. But in reality, it will never happen because of some reasons. For example, suppliers may not be willing to install in places with relatively small population or relatively backward economic level, because it will not produce a lot of economic benefits. But we still need it. We still need to make the data collection as equal as possible since that is the promise of any further analysis. So, We try to design a score/scores for the County which should communicate the degree to which information on the air quality of different population groups is disproportionately available, due to the availability of sensors. In this section, we propose a set of score metric at the block group level which can shows the neediness of different races, income groups and areas(since there is still no sensor in some block groups). Two different quantitative models are presented when the neediness scores of every jurisdiction’s score is calculated. The main idea of our method is if a place already has more sensors than they ‘should’ have (in terms of races, income groups and area coverage), the neediness score of the group of area should be low.

First, we need to identify the reasonable coverage of one sensor. Taking the detection point of each outside pure air as the center of the circle, draw a series of circular areas with a radius of 1 / 8 mile (200 meters). We believe that the air quality within the distance of 1 / 8 mile can be represented by one air detection point. Therefore, the drawn figure is the area covered by all air monitoring points in San Mateo county.

Then we look into all census block groups of San Mateo to study the demand degree of each census block for additional monitoring sensors, and design a scoring rules to give out score. The higher the score, the more vulnerable the area is and the more monitoring sensors are needed.

We want to collect data among different races equally. For example, assume the population of a certain race is \(p\) in the county and the population of this race who are in the monitoring area (percent with data) is \(p_s\). Ideally, \(p_w=p/p_s\) should be same among different races. But it will never happen as mentioned above. We assign the high \(p_w\) race with low score and low \(p_w\) race with high score as a way to balance them. We can use similar principle to achieve collect data among different income groups equally and collect data among different area equally.

For races, income groups, and cover areas, we can get the percent with data table in the step.

Race Coverage
race pop_withdata pop perc_withdata
American Indian and Alaska Native alone 581.191 6812 0.0853187
Asian alone 27173.845 230242 0.1180230
Black or African American alone 1534.139 15707 0.0976723
Native Hawaiian and Other Pacific Islander alone 833.622 9302 0.0896175
Some Other Race alone 8778.023 107924 0.0813352
Two or more races 13181.564 94267 0.1398322
White alone 55477.500 300188 0.1848092
Cover Area Coverage (first 5 rows)
cbg perc_area
060816001001 0.0290757
060816001002 0.2605928
060816001003 0.8335262
060816003002 0.0413046
060816004011 0.1475284
Income Coverage
income pop_withdata pop perc_withdata
$100,000 or more 30716.023 154403 0.1989341
$25,000 to $44,999 4062.411 23512 0.1727803
$45,000 to $99,999 10563.832 61629 0.1714101
Less than $24,999 4080.153 23999 0.1700135

Based on the principle, we have two quantitative methods for assigning the neediness scores.

Percent Method

This method calculates the score based on the comparison between the percent with data of each member. For example, in race coverage, we compare the different races. Specifically, We map the percent with data into range (0,1) or we standardize the percent with data as score. Besides, the score should be low (like penalty) for high percent with data. So, we use the following equation to assign scores.

\[score=1-\frac{p_w-min(p_w)}{max(p_w)-min(p_w)}\]

The following table shows the scores we get with this method.

Race Coverage
race pop_withdata pop perc_withdata score
White alone 55477.500 300188 0.1848092 0.0000000
Two or more races 13181.564 94267 0.1398322 0.4346694
Asian alone 27173.845 230242 0.1180230 0.6454398
Black or African American alone 1534.139 15707 0.0976723 0.8421137
Native Hawaiian and Other Pacific Islander alone 833.622 9302 0.0896175 0.9199578
American Indian and Alaska Native alone 581.191 6812 0.0853187 0.9615026
Some Other Race alone 8778.023 107924 0.0813352 1.0000000
Income Coverage
income pop_withdata pop perc_withdata score
$100,000 or more 30716.023 154403 0.1989341 0.0000000
$25,000 to $44,999 4062.411 23512 0.1727803 0.9043295
$45,000 to $99,999 10563.832 61629 0.1714101 0.9517088
Less than $24,999 4080.153 23999 0.1700135 1.0000000
Cover Area Coverage (first 5 rows)
cbg perc_area score
060816001001 0.0290757 0.9661246
060816001002 0.2605928 0.6963898
060816001003 0.8335262 0.0288795
060816003002 0.0413046 0.9518771
060816004011 0.1475284 0.8281183

After we get the score for each item, we can calculate the final neediness score for each block group. For example, when we calculate the race score for a specific block group, we just need to multiply the score of each race with the population of each race in this block group and sum them up and finally divide the sum by total population in the block group (weighted average). Finally, we get the follow mapping. There are three scores for each block groups.

We can select the block groups that very need more sensors based on different scores. For example, if we just want to make the data collection among different races become more equal, we can select the places with high scores in Race Score layer, such as some block groups near East Palo Alto. Or one can weight these scores according to their concerns and get a new score.

Rank Method

Another method is rank method with exponential decay (as follow), which we have used before. \[score=e^{-\lambda Rank(p_w)}\] Take race score as an example, We give the highest score (1) to the race whose \(p_w\) is minimum (in our case, Some Other Race alone), and the rest decrease exponentially, with the race whose \(p_w\) is maximum (in our case, white) accounting for half of the score. Next, we give scores according to the coverage area of air quality inspection sensors. Since the coverage rate of many regions is as high as 100%, we regard them as the first place in parallel. According to the ranking, the higher the coverage, the lower the score, which proves that they have received enough coverage. Similar to investigating ethnic differences, we investigated whether there were income differences in the distribution of air quality probes. We found that probes were least distributed among middle-income people and were most distributed among the people with the highest income. We therefore rated air quality probe exposure for each income group and calculated a weighted average for each census block group.

The following table shows the scores we get with this method.

Race Coverage
race pop_withdata pop perc_withdata rank score
Some Other Race alone 8778.023 107924 0.0813352 1 1.0000000
American Indian and Alaska Native alone 581.191 6812 0.0853187 2 0.9057237
Native Hawaiian and Other Pacific Islander alone 833.622 9302 0.0896175 3 0.8203354
Black or African American alone 1534.139 15707 0.0976723 4 0.7429971
Asian alone 27173.845 230242 0.1180230 5 0.6729501
Two or more races 13181.564 94267 0.1398322 6 0.6095068
White alone 55477.500 300188 0.1848092 7 0.5520448
Income Coverage
income pop_withdata pop perc_withdata rank score
Less than $24,999 4080.153 23999 0.1700135 1 1.0000000
$45,000 to $99,999 10563.832 61629 0.1714101 2 0.8408964
$25,000 to $44,999 4062.411 23512 0.1727803 3 0.7071068
$100,000 or more 30716.023 154403 0.1989341 4 0.5946036
Cover Area Coverage (first 5 rows)
cbg score_cover_area
060816001001 0.8184677
060816001002 0.6284733
060816001003 0.5017759
060816003002 0.8083739
060816004011 0.7039799

Similarly, we can also get a score mapping using this method. After comparison we can find that the main results of these two score methods are similar. Physically, the rank method more intuitive for race score and income score. However, percent method seems more sensitive for cover area score. This might because there is no sensor monitoring area in many block groups (too many rank = 1, next rank might be 40 rather rather 2 or 3) . These places’ scores should be high but should not be too far away from the places with a little sensor monitoring coverage.

Summary

It is an interesting and critical topic to conduct equity analysis, especially in our last assignment. We can always get something new when doing this. Geographic equity is common or intuitive. We can also think of it in our brains. We might can deduce the distribution of geographic equity just based on some geographical knowledge (just for example, maybe air circulation is poor in some places, which cause higher PM2.5 than other places). For population equity, we might also have a whole picture based on our experience. For example, high income might always bring high life quality, such living in a place with low PM2.5. However, data equity is an important or essential stuff that we always ignore or we can not have a whole picture in our brain. Even we develop a method to promote the data equity, that is to say, to collect data among different areas, groups equally, there will always be some obstructs to stop the application of methods, such as the economic benefits, etc. But maybe our job is to solve these obstacles!