Group member: Huanghe YaoJing, Minxue Gu, Jinpu Cao
We discuss and implement the first two parts together. Besides, I finish the percent method in the data equity and give some insights
Here is our dashboard link:
Geographic Equity
https://yaojinghuanghe.shinyapps.io/dashboard_pm25/
PM2.5 Equity among different groups:
https://yaojinghuanghe.shinyapps.io/dashboard_pm25_equity/
Data Equity - Percent Score
https://yaojinghuanghe.shinyapps.io/dashboard_data_equity_score_perc/
Data Equity - Rank Score
https://yaojinghuanghe.shinyapps.io/dashboard_data_equity_score/
The report develops an assessment of the equity implications of air quality (via PurpleAir) in San Mateo County. The report takes two cities (Menlo Park and Redwood City) as examples and analyzes their geographic equity by mapping the Air Quality Index(AQI) of each block group and plotting the average PM 2.5 in February. The population equity is conducted by comparing the PM 2.5 distribution across different income groups and different racial groups. Finally, in data equity part, the report gives some suggestions to the suppliers of PM 2.5 probes by proposing two sets of metrics for selecting which block groups have greater demand to PM 2.5 probes.
Raw senors data is extracted from PurpleAir and then is converted into general PM2.5 and AQI. There are 1055 sensors in San Mateo County in total. The following mapping shows the relative AQI of each sensor in the county. From the mapping we can see the AQI in the place near East Palo Alto and Redwood city is not as good as other places. Actually, the absolute AQI of most block groups is Good
in the county.
We use the voronoi technique to transform point-estimates of outdoor air quality to census block groups. The following mapping shows the results after voronoi splitting.
The following mapping shows the result after voronoi interpolation at the block groups level. From the mapping we can see that the places near the bay tends to have higher PM2.5.
The following chart shows the outdoor PM2.5 level in February 2022 in the Menlo Park. The PM2.5 in Menlo Park City fluctuates in this month.
Similarily, the following mapping shows the results after voronoi splitting in Redwood City.
The following mapping shows the result after voronoi interpolation at the block groups level. From the mapping we can see that compared to Menlo Park, the city’s air quality seems a little worse than Menlo Park just according to the PM2.5 level. Besides, the PM 2.5 level is higher in the east and also in the center than other places.
The following chart shows the outdoor PM2.5 level in February 2022 in the Redwood City. Similarly, the PM2.5 in the city also fluctuates in this month. Generally, we find that the PM2.5 tends to be relatively high in weekend and low in weekday for these two cities, which makes sense because more people tend to go out in weekends.
We collect the income data in San Mateo County using ACS 5-years dataset (2019) at the block groups and divide income into four levels: Less than $24,999
(low), $25,000 to $44,999
(median low), $45,000 to $99,999
(median high), $100,000 or more
(high). We split the PM 2.5 into 5 levels. From the following equity analysis figure we can see that PM 2.5 exposure degree is unequal among different income groups. High income groups are less exposed to bad air quality (in terms of PM 2.5) than they ‘should’ be (based on their group population percentage). Nevertheless, low income and median low income groups are more exposed to bad air quality than they ‘should’ be.
We collect the census race data in San Mateo County using decennial data (2010-2020) at the block level and divide races into six categories: American Indian and Alaska Native alone
, Asian alone
, Black or African American alone
, Native Hawaiian and Other Pacific Islander alone
,Two or more races
and White alone
. From the following equity analysis figure we can see that PM 2.5 exposure degree is obviously unequal among different races than that among different income groups. White people are less exposed to bad air quality (in terms of PM 2.5) than they ‘should’ be (based on their group population percentage) and vice versa.
The population equity analysis above is based on the assumption that our PM2.5 data is collected equally or evenly among different groups. But in reality, it will never happen because of some reasons. For example, suppliers may not be willing to install in places with relatively small population or relatively backward economic level, because it will not produce a lot of economic benefits. But we still need it. We still need to make the data collection as equal as possible since that is the promise of any further analysis. So, We try to design a score/scores for the County which should communicate the degree to which information on the air quality of different population groups is disproportionately available, due to the availability of sensors. In this section, we propose a set of score metric at the block group level which can shows the neediness of different races, income groups and areas(since there is still no sensor in some block groups). Two different quantitative models are presented when the neediness scores of every jurisdiction’s score is calculated. The main idea of our method is if a place already has more sensors than they ‘should’ have (in terms of races, income groups and area coverage), the neediness score of the group of area should be low.
First, we need to identify the reasonable coverage of one sensor. Taking the detection point of each outside pure air as the center of the circle, draw a series of circular areas with a radius of 1 / 8 mile (200 meters). We believe that the air quality within the distance of 1 / 8 mile can be represented by one air detection point. Therefore, the drawn figure is the area covered by all air monitoring points in San Mateo county.
Then we look into all census block groups of San Mateo to study the demand degree of each census block for additional monitoring sensors, and design a scoring rules to give out score. The higher the score, the more vulnerable the area is and the more monitoring sensors are needed.
We want to collect data among different races equally. For example, assume the population of a certain race is \(p\) in the county and the population of this race who are in the monitoring area (percent with data
) is \(p_s\). Ideally, \(p_w=p/p_s\) should be same among different races. But it will never happen as mentioned above. We assign the high \(p_w\) race with low score and low \(p_w\) race with high score as a way to balance them. We can use similar principle to achieve collect data among different income groups equally and collect data among different area equally.
For races, income groups, and cover areas, we can get the percent with data
table in the step.
race | pop_withdata | pop | perc_withdata |
---|---|---|---|
American Indian and Alaska Native alone | 581.191 | 6812 | 0.0853187 |
Asian alone | 27173.845 | 230242 | 0.1180230 |
Black or African American alone | 1534.139 | 15707 | 0.0976723 |
Native Hawaiian and Other Pacific Islander alone | 833.622 | 9302 | 0.0896175 |
Some Other Race alone | 8778.023 | 107924 | 0.0813352 |
Two or more races | 13181.564 | 94267 | 0.1398322 |
White alone | 55477.500 | 300188 | 0.1848092 |
cbg | perc_area |
---|---|
060816001001 | 0.0290757 |
060816001002 | 0.2605928 |
060816001003 | 0.8335262 |
060816003002 | 0.0413046 |
060816004011 | 0.1475284 |
income | pop_withdata | pop | perc_withdata |
---|---|---|---|
$100,000 or more | 30716.023 | 154403 | 0.1989341 |
$25,000 to $44,999 | 4062.411 | 23512 | 0.1727803 |
$45,000 to $99,999 | 10563.832 | 61629 | 0.1714101 |
Less than $24,999 | 4080.153 | 23999 | 0.1700135 |
Based on the principle, we have two quantitative methods for assigning the neediness scores.
This method calculates the score based on the comparison between the percent with data
of each member. For example, in race coverage, we compare the different races. Specifically, We map the percent with data
into range (0,1) or we standardize the percent with data
as score. Besides, the score should be low (like penalty) for high percent with data
. So, we use the following equation to assign scores.
\[score=1-\frac{p_w-min(p_w)}{max(p_w)-min(p_w)}\]
The following table shows the scores we get with this method.
race | pop_withdata | pop | perc_withdata | score |
---|---|---|---|---|
White alone | 55477.500 | 300188 | 0.1848092 | 0.0000000 |
Two or more races | 13181.564 | 94267 | 0.1398322 | 0.4346694 |
Asian alone | 27173.845 | 230242 | 0.1180230 | 0.6454398 |
Black or African American alone | 1534.139 | 15707 | 0.0976723 | 0.8421137 |
Native Hawaiian and Other Pacific Islander alone | 833.622 | 9302 | 0.0896175 | 0.9199578 |
American Indian and Alaska Native alone | 581.191 | 6812 | 0.0853187 | 0.9615026 |
Some Other Race alone | 8778.023 | 107924 | 0.0813352 | 1.0000000 |
income | pop_withdata | pop | perc_withdata | score |
---|---|---|---|---|
$100,000 or more | 30716.023 | 154403 | 0.1989341 | 0.0000000 |
$25,000 to $44,999 | 4062.411 | 23512 | 0.1727803 | 0.9043295 |
$45,000 to $99,999 | 10563.832 | 61629 | 0.1714101 | 0.9517088 |
Less than $24,999 | 4080.153 | 23999 | 0.1700135 | 1.0000000 |
cbg | perc_area | score |
---|---|---|
060816001001 | 0.0290757 | 0.9661246 |
060816001002 | 0.2605928 | 0.6963898 |
060816001003 | 0.8335262 | 0.0288795 |
060816003002 | 0.0413046 | 0.9518771 |
060816004011 | 0.1475284 | 0.8281183 |
After we get the score for each item, we can calculate the final neediness score for each block group. For example, when we calculate the race score for a specific block group, we just need to multiply the score of each race with the population of each race in this block group and sum them up and finally divide the sum by total population in the block group (weighted average). Finally, we get the follow mapping. There are three scores for each block groups.
We can select the block groups that very need more sensors based on different scores. For example, if we just want to make the data collection among different races become more equal, we can select the places with high scores in Race Score layer, such as some block groups near East Palo Alto. Or one can weight these scores according to their concerns and get a new score.
Another method is rank method with exponential decay (as follow), which we have used before. \[score=e^{-\lambda Rank(p_w)}\] Take race score as an example, We give the highest score (1) to the race whose \(p_w\) is minimum (in our case, Some Other Race alone), and the rest decrease exponentially, with the race whose \(p_w\) is maximum (in our case, white) accounting for half of the score. Next, we give scores according to the coverage area of air quality inspection sensors. Since the coverage rate of many regions is as high as 100%, we regard them as the first place in parallel. According to the ranking, the higher the coverage, the lower the score, which proves that they have received enough coverage. Similar to investigating ethnic differences, we investigated whether there were income differences in the distribution of air quality probes. We found that probes were least distributed among middle-income people and were most distributed among the people with the highest income. We therefore rated air quality probe exposure for each income group and calculated a weighted average for each census block group.
The following table shows the scores we get with this method.
race | pop_withdata | pop | perc_withdata | rank | score |
---|---|---|---|---|---|
Some Other Race alone | 8778.023 | 107924 | 0.0813352 | 1 | 1.0000000 |
American Indian and Alaska Native alone | 581.191 | 6812 | 0.0853187 | 2 | 0.9057237 |
Native Hawaiian and Other Pacific Islander alone | 833.622 | 9302 | 0.0896175 | 3 | 0.8203354 |
Black or African American alone | 1534.139 | 15707 | 0.0976723 | 4 | 0.7429971 |
Asian alone | 27173.845 | 230242 | 0.1180230 | 5 | 0.6729501 |
Two or more races | 13181.564 | 94267 | 0.1398322 | 6 | 0.6095068 |
White alone | 55477.500 | 300188 | 0.1848092 | 7 | 0.5520448 |
income | pop_withdata | pop | perc_withdata | rank | score |
---|---|---|---|---|---|
Less than $24,999 | 4080.153 | 23999 | 0.1700135 | 1 | 1.0000000 |
$45,000 to $99,999 | 10563.832 | 61629 | 0.1714101 | 2 | 0.8408964 |
$25,000 to $44,999 | 4062.411 | 23512 | 0.1727803 | 3 | 0.7071068 |
$100,000 or more | 30716.023 | 154403 | 0.1989341 | 4 | 0.5946036 |
cbg | score_cover_area |
---|---|
060816001001 | 0.8184677 |
060816001002 | 0.6284733 |
060816001003 | 0.5017759 |
060816003002 | 0.8083739 |
060816004011 | 0.7039799 |
Similarly, we can also get a score mapping using this method. After comparison we can find that the main results of these two score methods are similar. Physically, the rank method more intuitive for race score and income score. However, percent method seems more sensitive for cover area score. This might because there is no sensor monitoring area in many block groups (too many rank = 1, next rank might be 40 rather rather 2 or 3) . These places’ scores should be high but should not be too far away from the places with a little sensor monitoring coverage.
It is an interesting and critical topic to conduct equity analysis, especially in our last assignment. We can always get something new when doing this. Geographic equity is common or intuitive. We can also think of it in our brains. We might can deduce the distribution of geographic equity just based on some geographical knowledge (just for example, maybe air circulation is poor in some places, which cause higher PM2.5 than other places). For population equity, we might also have a whole picture based on our experience. For example, high income might always bring high life quality, such living in a place with low PM2.5. However, data equity is an important or essential stuff that we always ignore or we can not have a whole picture in our brain. Even we develop a method to promote the data equity, that is to say, to collect data among different areas, groups equally, there will always be some obstructs to stop the application of methods, such as the economic benefits, etc. But maybe our job is to solve these obstacles!