Using the OpenAQ API¶

The openaq api is an easy-to-use wrapper built around the OpenAQ Api. Complete API documentation can be found on their website.

There are no keys or rate limits (as of March 2017), so working with the API is straight forward. If building a website or app, you may want to just use the python wrapper and interact with the data in json format. However, the rest of this tutorial will assume you are interested in analyzing the data. To get more out of it, I recommend installing seaborn for manipulating the asthetics of plots, and working with data as DataFrames using pandas. For more information on these, check out the installation section of this documentation.

From this point forward, I assume you have at least a basic knowledge of python and matplotlib. This documentation was built using the following versions of all packages:

import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import openaq
import warnings

warnings.simplefilter('ignore')

%matplotlib inline

# Set major seaborn asthetics
sns.set("notebook", style='ticks', font_scale=1.0)

# Increase the quality of inline plots
mpl.rcParams['figure.dpi']= 500

print ("pandas v{}".format(pd.__version__))
print ("matplotlib v{}".format(mpl.__version__))
print ("seaborn v{}".format(sns.__version__))
print ("openaq v{}".format(openaq.__version__))

pandas v0.21.0
matplotlib v2.1.0
seaborn v0.8.1
openaq v1.1.0

OpenAQ API¶

The OpenAQ APi has only eight endpoints that we are interested in:

cities: provides a simple listing of cities within the platforms
countries: provides a simple listing of countries within the platform
fetches: providing data about individual fetch operations that are used to populate data in the platform
latest: provides the latest value of each available parameter for every location in the system
locations: provides a list of measurement locations and their meta data
measurements: provides data about individual measurements
parameters: provides a simple listing of parameters within the platform
sources: provides a list of data sources

For detailed documentation about each one in the context of this API wrapper, please check out the API documentation.

Your First Request¶

Real quick, let’s go ahead and initiate an instance of the openaq.OpenAQ class so we can begin looking at data:

api = openaq.OpenAQ()

Cities¶

The cities API endpoint lists the cities available within the platform. Results can be subselected by country and paginated to retrieve all results in the database. Let’s start by performing a basic query with an increased limit (so we can get all of them) and return it as a DataFrame:

resp = api.cities(df=True, limit=10000)

# display the first 10 rows
resp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2020 entries, 0 to 2019
Data columns (total 4 columns):
city         2020 non-null object
count        2020 non-null int64
country      2020 non-null object
locations    2020 non-null int64
dtypes: int64(2), object(2)
memory usage: 63.2+ KB

So we retrieved 1400+ entries from the database. We can then take a look at them:

print (resp.head(10))

                                           city   count country  locations
                          Escaldes-Engordany   13324      AD          2
                                      unused     314      AD          1
                                   Abu Dhabi     477      AE          1
                                Buenos Aires   14976      AR          4
            Amt der K�rntner Landesregierung  104663      AT         16
                                     Austria  121987      AT        174
Amt der Ober�sterreichischen Landesregierung  154329      AT         16
    Amt der Burgenl�ndischen Landesregierung   35608      AT          3
          Amt der Salzburger Landesregierung   94197      AT         12
    Amt der Burgenländischen Landesregierung     471      AT          1

Let’s try to find out which ones are in India:

print (resp.query("country == 'IN'"))

                   city    count country  locations
            Mumbai   309477      IN          3
              Kota    23264      IN          2
             Delhi  1140259      IN         35
         Ghaziabad    99087      IN          2
        Barddhaman     2470      IN          3
           Asansol     1590      IN          2
           Lucknow   271912      IN          5
       Muzaffarpur   116841      IN          1
         Hyderabad   465962      IN         15
     Visakhapatnam   208237      IN          8
          Amritsar    77849      IN          1
           Bhiwadi    20846      IN          1
           Kolkata   168542      IN          7
         Bengaluru   371649      IN          8
            Howrah    50263      IN          4
       Navi Mumbai     7725      IN          1
         Ahmedabad    57714      IN          2
         Faridabad   113546      IN          2
            Nashik    76217      IN          4
            Haldia   115284      IN          2
Thiruvananthapuram    46124      IN          2
            Rohtak    95050      IN          1
             Medak     2671      IN          1
              Pune   145450      IN          1
          Tirupati   159117      IN          4
             Ajmer    25266      IN          2
        Vijayawara    34902      IN          1
          Durgapur    78761      IN          2
        Jorapokhar    35558      IN          1
             NOIDA    12061      IN          1
..                  ...      ...     ...        ...
              Gaya    76810      IN          1
        Chandrapur   232244      IN          2
           Chennai   290415      IN          4
          Siliguri       30      IN          2
             Thane   130025      IN          3
            Nagpur    72328      IN          5
         Mandideep     7847      IN          1
             Patna    75338      IN          1
           Dhanbad        3      IN          1
        Aurangabad   113529      IN          1
            Kanpur   159678      IN          2
         Moradabad    24595      IN          1
          Chittoor     2013      IN          1
             Alwar    14017      IN          1
        Vijayawada    11624      IN          2
           Udaipur    25699      IN          1
            Jaipur   190441      IN          6
          Ludhiana    72308      IN          1
          Varanasi   181673      IN          1
              Pali    23627      IN          2
            Ujjain    16924      IN          1
         Singrauli    14636      IN          1
           Jodhpur   151172      IN          1
              Agra    84301      IN          1
             Dewas    11671      IN          1
         Amaravati    11432      IN          1
  Mandi Gobindgarh    48579      IN          1
           Solapur   253940      IN          1
         Pithampur    12173      IN          1
           Gurgaon   147910      IN          1

[62 rows x 4 columns]

Great! For the rest of the tutorial, we are going to focus on Delhi, India. Why? Well..because there are over 500,000 data points and my personal research is primarily in India. We will also take a look at some \(SO_2\) data from Hawai’i later on (another great research locale).

Countries¶

Similar to the cities endpoint, the countries endpoint lists the countries available. The only parameters we have to play with are the limit and page number. If we want to grab them all, we can just up the limit to the maximum (10000).

res = api.countries(limit=10000, df=True)

print (res.head())

   cities code    count  locations       name
     2   AD    13638          3    Andorra
     1   AR    14976          4  Argentina
    18   AU  3248142         99  Australia
    16   AT  1521351        306    Austria
     1   BH    13820          1    Bahrain

Fetches¶

If you are interested in getting information pertaining to the individual data fetch operations, go ahead and use this endpoint. Most people won’t need to use this. This API method does not allow the df parameter; if you would like it to be added, drop me a message.

Otherwise, here is how you can access the json-formatted data:

status, resp = api.fetches(limit=1)

# Print out the meta info
resp['meta']

{'found': 92506,
 'license': 'CC BY 4.0',
 'limit': 1,
 'name': 'openaq-api',
 'page': 1,
 'pages': 92506,
 'website': 'https://docs.openaq.org/'}

Parameters¶

The parameters endpoint will provide a listing off all the parameters available:

res = api.parameters(df=True)

print (res)

                                         description    id   name  0                                       Black Carbon    bc     BC
1                                    Carbon Monoxide    co     CO
2                                   Nitrogen Dioxide   no2    NO2
3                                              Ozone    o3     O3
4  Particulate matter less than 10 micrometers in...  pm10   PM10
5  Particulate matter less than 2.5 micrometers i...  pm25  PM2.5
6                                     Sulfur Dioxide   so2    SO2

  preferredUnit
0         µg/m³
1           ppm
2           ppm
3           ppm
4         µg/m³
5         µg/m³
6           ppm

Sources¶

The sources endpoint will provide a list of the sources where the raw data came from.

res = api.sources(df=True)

# Print out the first one
res.ix[0]

active                                                       True
adapter                                                 arpalazio
city                                                          NaN
contacts                                        [info@openaq.org]
country                                                        IT
description             Air quality data from Lazio region, Italy
location                                                      NaN
name                                                    ARPALAZIO
organization                                                  NaN
region                                                      Lazio
resolution                                                    NaN
sourceURL                               http://www.arpalazio.net/
timezone                                                      NaN
url             http://www.arpalazio.net/main/aria/sci/annoinc...
Name: 0, dtype: object

Locations¶

The locations endpoint will return the list of measurement locations and their meta data. We can do quite a bit of querying with this one:

Let’s see what the data looks like:

res = api.locations(df=True)

res.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 11 columns):
city                     100 non-null object
coordinates.latitude     100 non-null float64
coordinates.longitude    100 non-null float64
count                    100 non-null int64
country                  100 non-null object
firstUpdated             100 non-null datetime64[ns]
lastUpdated              100 non-null datetime64[ns]
location                 100 non-null object
parameters               100 non-null object
sourceName               100 non-null object
sourceNames              100 non-null object
dtypes: datetime64[ns](2), float64(2), int64(1), object(6)
memory usage: 8.7+ KB

# print out the first one
res.ix[0]

city                                  Ulaanbaatar
coordinates.latitude                      47.9329
coordinates.longitude                     106.921
count                                      294682
country                                        MN
firstUpdated                  2015-09-01 00:00:00
lastUpdated                   2018-01-24 13:15:00
location                                  100 ail
parameters               [pm10, no2, so2, o3, co]
sourceName                               Agaar.mn
sourceNames                            [Agaar.mn]
Name: 0, dtype: object

What if we just want to grab the locations in Delhi?

res = api.locations(city='Delhi', df=True)

res.ix[0]

city                                              Delhi
coordinates.latitude                            28.6508
coordinates.longitude                           77.3152
count                                            102326
country                                              IN
distance                                    6.32199e+06
firstUpdated                        2015-06-29 14:30:00
lastUpdated                         2017-11-28 10:15:00
location                                    Anand Vihar
parameters               [pm10, pm25, so2, o3, co, no2]
sourceName                                         CPCB
sourceNames                         [Anand Vihar, CPCB]
Name: 0, dtype: object

What about just figuring out which locations in Delhi have \(PM_{2.5}\) data?

res = api.locations(city='Delhi', parameter='pm25', df=True)

res.ix[0]

city                                   Delhi
coordinates.latitude                 28.6508
coordinates.longitude                77.3152
count                                  23891
country                                   IN
firstUpdated             2015-06-29 14:30:00
lastUpdated              2017-11-28 10:15:00
location                         Anand Vihar
parameters                            [pm25]
sourceName                              CPCB
sourceNames              [CPCB, Anand Vihar]
Name: 0, dtype: object

Latest¶

Grab the latest data from a location or locations.

What was the most recent \(PM_{2.5}\) data in Delhi?

res = api.latest(city='Delhi', parameter='pm25', df=True)

res.head()

	averagingPeriod.unit	averagingPeriod.value	city	country	location	parameter	sourceName	unit	value
lastUpdated
2017-11-28 10:15:00	hours	0.25	Delhi	IN	Anand Vihar	pm25	CPCB	b'\xc2\xb5g/m\xc2\xb3'	70.00
2018-01-24 09:45:00	hours	0.25	Delhi	IN	Anand Vihar, Delhi - DPCC	pm25	CPCB	b'\xc2\xb5g/m\xc2\xb3'	160.00
2018-01-24 01:15:00	hours	0.25	Delhi	IN	Aya Nagar, Delhi - IMD	pm25	CPCB	b'\xc2\xb5g/m\xc2\xb3'	192.84
2018-01-24 01:15:00	hours	0.25	Delhi	IN	Burari Crossing, Delhi - IMD	pm25	CPCB	b'\xc2\xb5g/m\xc2\xb3'	53.45
2018-01-24 01:15:00	hours	0.25	Delhi	IN	CRRI Mathura Road, Delhi - IMD	pm25	CPCB	b'\xc2\xb5g/m\xc2\xb3'	185.60

What about the most recent \(SO_2\) data in Hawii?

res = api.latest(city='Hilo', parameter='so2', df=True)

res

	averagingPeriod.unit	averagingPeriod.value	city	country	location	parameter	sourceName	unit	value
lastUpdated
2018-01-24 06:00:00	hours	1	Hilo	US	Hawaii Volcanoes NP	so2	AirNow	ppm	0.000
2018-01-24 06:00:00	hours	1	Hilo	US	Hilo	so2	AirNow	ppm	0.001
2018-01-24 06:00:00	hours	1	Hilo	US	Kona	so2	AirNow	ppm	0.003
2018-01-24 06:00:00	hours	1	Hilo	US	Ocean View	so2	AirNow	ppm	0.002
2018-01-24 06:00:00	hours	1	Hilo	US	Pahala	so2	AirNow	ppm	0.021
2017-01-26 17:00:00	hours	1	Hilo	US	Puna E Station	so2	AirNow	ppm	0.002

Measurements¶

Finally, the endpoint we’ve all been waiting for! Measurements allows you to grab all of the dataz! You can query on a whole bunhc of parameters listed in the API documentation. Let’s dive in:

Let’s grab the past 10000 data points for \(PM_{2.5}\) in Delhi:

res = api.measurements(city='Delhi', parameter='pm25', limit=10000, df=True)

# Print out the statistics on a per-location basiss
res.groupby(['location'])['value'].describe()

	count	mean	std	min	25%	50%	75%	max
location
Anand Vihar, Delhi - DPCC	430.0	259.625581	147.207770	49.00	127.0000	217.000	345.0000	644.00
Aya Nagar, Delhi - IMD	372.0	149.792312	63.647824	3.00	106.8725	133.675	185.5775	362.89
Burari Crossing, Delhi - IMD	363.0	130.473774	54.497222	41.82	92.0150	119.260	157.5000	333.67
CRRI Mathura Road, Delhi - IMD	361.0	154.678227	106.200845	2.22	80.3300	141.950	200.2300	842.68
Delhi Technological University, Delhi - CPCB	1095.0	297.557991	154.132360	59.00	167.5000	286.000	401.0000	764.00
IGI Airport Terminal-3, Delhi - IMD	361.0	151.609806	73.086274	5.62	95.0700	143.700	200.3100	375.96
IHBAS, Delhi - CPCB	922.0	122.903471	45.207670	0.00	88.9000	118.400	147.4000	308.30
Income Tax Office, Delhi - CPCB	1095.0	210.021918	94.767962	0.00	141.5000	190.000	277.5000	477.00
Lodhi Road, Delhi - IMD	339.0	160.147699	72.395593	9.69	107.4550	151.380	208.2750	383.10
Mandir Marg, Delhi - DPCC	264.0	198.306818	87.983323	42.00	132.0000	186.000	254.0000	443.00
NSIT Dwarka, Delhi - CPCB	1030.0	217.039612	82.230388	0.00	155.9250	197.600	262.2000	527.10
North Campus, Delhi - IMD	368.0	212.198560	112.838311	0.41	123.1100	190.870	275.0175	633.12
Punjabi Bagh, Delhi - DPCC	318.0	227.411950	116.249854	47.00	125.0000	208.500	314.0000	559.00
Pusa, Delhi - IMD	382.0	125.018351	58.187861	28.51	77.6475	114.430	163.2325	320.82
R K Puram, Delhi - DPCC	400.0	208.190000	98.169711	64.00	139.0000	188.000	245.7500	593.00
Shadipur, Delhi - CPCB	1055.0	165.118104	104.136933	0.20	94.5000	139.000	214.8500	798.70
Sirifort, Delhi - CPCB	439.0	213.398633	98.462672	0.00	145.5000	195.000	277.5000	979.00
US Diplomatic Post: New Delhi	406.0	213.339901	139.081139	-999.00	144.0000	202.000	269.7500	1985.00

Clearly, we should be doing some serious data cleaning ;) Why don’t we go ahead and plot all of these locations on a figure.

fig, ax = plt.subplots(1, figsize=(10, 6))

for group, df in res.groupby('location'):
    # Query the data to only get positive values and resample to hourly
    _df = df.query("value >= 0.0").resample('1h').mean()

    _df.value.plot(ax=ax, label=group)

ax.legend(loc='best')
ax.set_ylabel("$PM_{2.5}$  [$\mu g m^{-3}$]", fontsize=20)
ax.set_xlabel("")
sns.despine(offset=5)

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

plt.show()

Don’t worry too much about how ugly and uninteresting the plot above is…we’ll take care of that in the next tutorial! Let’s go ahead and look at the distribution of \(PM_{2.5}\) values seen in Delhi by various sensors. This is the same data as above, but viewed in a different way.

fig, ax = plt.subplots(1, figsize=(14,7))

ax = sns.boxplot(
    x='location',
    y='value',
    data=res.query("value >= 0.0"),
    fliersize=0,
    palette='deep',
    ax=ax)

ax.set_ylim([0, 750])
ax.set_ylabel("$PM_{2.5}\;[\mu gm^{-3}]$", fontsize=18)
ax.set_xlabel("")

sns.despine(offset=10)

plt.xticks(rotation=90)
plt.show()

If we remember from above, there was at least one location where many parameters were measured. Let’s go ahead and look at that location and see if there is any correlation among parameters!

res = api.measurements(city='Delhi', location='Anand Vihar', limit=1000, df=True)

# Which params do we have?
res.parameter.unique()

array(['o3', 'no2', 'so2', 'pm10', 'pm25'], dtype=object)

df = pd.DataFrame()

for u in res.parameter.unique():
    _df = res[res['parameter'] == u][['value']]
    _df.columns = [u]

    # Merge the dataframes together
    df = pd.merge(df, _df, left_index=True, right_index=True, how='outer')

# Get rid of rows where not all exist
df.dropna(how='any', inplace=True)

g = sns.PairGrid(df, diag_sharey=False)

g.map_lower(sns.kdeplot, cmap='Blues_d')
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=3)

plt.show()

For kicks, let’s go ahead and look at a timeseries of \(SO_2\) data in Hawai’i. Quiz: What do you expect? Did you know that Hawai’i has a huge \(SO_2\) problem?

res = api.measurements(city='Hilo', parameter='so2', limit=10000, df=True)

# Print out the statistics on a per-location basiss
res.groupby(['location'])['value'].describe()

	count	mean	std	min	25%	50%	75%	max
location
Hawaii Volcanoes NP	355.0	0.007392	0.035338	0.000	0.000	0.000	0.000	0.408
Hilo	438.0	0.002486	0.004487	0.000	0.001	0.001	0.002	0.050
Kona	450.0	0.003273	0.004264	0.001	0.001	0.002	0.004	0.039
Ocean View	455.0	0.011490	0.021762	0.000	0.001	0.004	0.011	0.182
Pahala	422.0	0.039166	0.070299	0.000	0.004	0.010	0.040	0.554

fig, ax = plt.subplots(1, figsize=(10, 5))

for group, df in res.groupby('location'):
    # Query the data to only get positive values and resample to hourly
    _df = df.query("value >= 0.0").resample('6h').mean()

    # Convert from ppm to ppb
    _df['value'] *= 1e3

    # Multiply the value by 1000 to get from ppm to ppb
    _df.value.plot(ax=ax, label=group)

ax.legend(loc='best')
ax.set_ylabel("$SO_2 \; [ppb]$", fontsize=18)
ax.set_xlabel("")

sns.despine(offset=5)

plt.show()

NOTE: These values are for 6h means. The local readings can actually get much, much higher (>5 ppm!) when looking at 1min data.