Artificial Intelligence | Lesson 3.3

Open Data and Nonprofits

In the previous section while discussing sources of data, it was mentioned that an organization can utilize existing, publicly available data sets.

If this topic is of interest, the Policy short course within LEAP for PIT also offers a detailed and practical explanation of how to access publicly available data sets. Following are some examples of open datasets:

World Bank Open Data

World Bank Open Data is a massive source of data. It has 3000 datasets and 14000 indicators encompassing microdata, time-series statistics, and geospatial data. World Bank Open Data also provides a search engine to find your desired dataset. They also provide data filtering, which makes the data preparation faster as you only download what you need. Moreover, there are tools provided on the website to facilitate visualization and data analysis. The data is available free and can be downloaded in multiple formats. For more information go to World Bank Open Datasets.

WHO (World Health Organization)

WHO’s Open Data repository is where WHO shares health-specific statistics of its 194 Member States for researchers to use. This data repository is systematically organized. It can be accessed as per different needs. For instance, whether it is mortality or burden of diseases, one can access data classified under 100 or more categories such as the Millennium Development Goals (child nutrition, child health, maternal and reproductive health, immunization, HIV/AIDS, tuberculosis, malaria, neglected diseases, water and sanitation), non-communicable diseases and risk factors, epidemic-prone diseases, health systems, environmental health, violence and injuries, equity, etc. They also provide tools to visualize the dataset information online before downloading the dataset.

European Union Open Data Portal

The European Union (EU) Open Data Portal is home to vital open data pertaining to EU policy domains. These policy domains include economy, employment, science, environment, and education. You can access whatever EU institutions, agencies and organizations publish on a single platform

UNICEF DATA – Child Statistics

UNICE’s Data & Analytics (D&A) team is the global go-to for data on children. It leads the collection, validation, analysis, use and communication of the most statistically sound, internationally comparable data on the situation of children and women around the world. D&A upholds the quality, integrity and organization of these data and makes them accessible as a global public good on their website. Their team generates datasets, ranging from short brochures to in-depth analyses like major reports, that inform UNICEF’s evidence-based program strategy and advocacy and help identify emerging areas where children are in need. They also detail important progress in actions to support children. In parallel, the team’s work arms governments, other UN agencies, international NGOs, think tanks and academics, media and individuals with the necessary insight to prompt action to improve the lives of women and children.

UNESCO Data

The UNESCO Institute for Statistics (UIS) is the official and trusted source of internationally comparable data on education, science, culture and communication. As the official statistical agency of UNESCO, the UIS produces a wide range of state-of-the-art databases to fuel the policies and investments needed to transform lives and propel the world towards its development goals. The UIS provides free access to data for all UNESCO countries and regional groupings from 1970 to the most recent year available. The UIS encourages developers and researchers to build websites and applications that make rich use of UIS dissemination data. (“UIS Statistics”)

UNdata

UNdata is a web-based data service for the global user community. It brings international statistical databases within easy reach of users through a single-entry point. Users can search and download a variety of statistical resources compiled by the United Nations (UN) statistical system and other international agencies. The numerous databases or tables are collectively known as “datamarts” contain over 60 million data points and cover a wide range of statistical themes, including agriculture, crime, communication, development assistance, education, energy, environment, finance, gender, health, labor market, manufacturing, national accounts, population and migration, science and technology, tourism, transport and trade.

Google Public Data Explorer

Launched in 2010, Google Public Data Explorer can help you explore vast amounts of public-interest datasets. Just like the Google search engine, it will navigate you to your desired data source. You can visualize and communicate the data for your respective uses.

It makes the data from different agencies and sources available. For instance, you can access data from World Bank, U. S. Bureau of Labor Statistics and U.S. Bureau, OECD, IMF, and others.

Uses of Open Data

There are many nonprofit real-world examples that their main activity relies upon the use of open data sources. The following list includes examples of applications built upon Open Data sources. Many were developed for competitions, hackathons or other demand-side activities to promote the use of Open Data for specific purposes (examples are taken from “Open Data Essentials | Data – World Bank“).

  • mWater: A free suite of tools that use GPS, cloud-based computing and mobile technology to create an integrated approach to managing water and sanitation services and preventing waterborne diseases and their impacts on communities.
  • Save the Rain: This app shows users how they can help reduce the impact of the alarming worldwide drop in annual rainfall predicted by The World Bank. “Using maps and a novel “rooftop rainwater harvesting” approach, users estimate the amount of rainwater they can potentially save each year.
  • Ecofacts: A small app that provides information about energy consumption and climate change, and how their effects on the environment can be influenced by citizens and communities.
  • CheckMySchool: In an effort to improve public education services, this monitoring program combines digital technology and community mobilization to promote accountability and transparency. It provides easy access to information and a platform for feedback and helps citizens and government officials collaboratively resolve education issues.
  • Merge of HealthFacility: A health facilities location application from the Ghana Open Data Initiative (GODI) at National Information Technology Agency (NITA).

When using third-party datasets, you should be concerned about the quality of the data and legal issues. Many of these datasets are produced on a “best effort” basis and collected as needed to support new algorithms or fields of application; often, data quality is not guaranteed. Also, many publicly available datasets are released with a noncommercial license, meaning that they cannot be used for business purposes; always check that you can use an open dataset for the purposes you intend.

Combining internal and external data is often a clever idea. In fact, you are recommended to use all kinds of data.

Regardless of how you get your data, a piece of information can make or break your project: labels.

Remember that if you are training a supervised learning algorithm, the computer needs to learn to produce a number or class (label) based on other numbers (features). Most of the time, you can work around missing features, but you may be in big trouble if you do not have labels. In other words, when it comes to data, labels are way more important than features. For instance, if you develop a home-price predictor, then the label is the price of a house and features are square footage, number of rooms, presence of a garden, and so forth. You may still develop an acceptable AI model if you are missing and do not have any of these features. However, if you do not have the labels, there is no way to develop an AI model regardless of how many features you have.

Labels can be collected in three ways:

Naturally

Natural labels are generated by your business processes. For instance, if your real estate platform asks clients to input their home sales price as they delete their listing, you will naturally get the label. Amazon stores in a database everything you bought. All this information is stored to make the business run and can be used as labels if needed.

Hacking

Sometimes, labels are not as easy to get, but you can still find clever hacks to get them. An example is what Amazon does with product reviews. When you write a review about your love for your new vacuum cleaner, you also add a star rating (say, from 1 to 5). The score can be used as a label for a sentiment analysis system. Basically, you are giving Amazon both the input (the text review) and the label (the star score) it could use to build sentiment analysis technology for free.

Paying

In some cases, your only option is to pay people to label examples. A common pattern for labeling data is to use a crowdsourcing platform such as Amazon Mechanical Turk, which gives you on-demand, paid-by-the-minute access to a temporary workforce distributed across the globe.

Figure 3.3 shows an example of two labeling interfaces for collecting labels for image classification and object localization tasks.

Labeling interfaces for image recognition and object localization.

Figure 3.3 | Labeling interfaces for image recognition and object localization. Choices can be made using keyboard shortcuts to increase data entry speed.
Attribution: Zero to AI, Figure 8.3. Nicolò Valigi and Gianluca Mauro. Link to source. All rights reserved.

In general, crowdsourcing platforms are good for labeling tasks that do not need much training. If you are working on a project requiring high-level human reasoning (say, finding cancer cells on microscope scans), you would be better off putting together your highly skilled workforce. Table 1 summarizes the cost and time required to collect labels for the three labeling strategies.

Table 3.1: Cost and time requirements for the three labeling strategies.
Labeling Strategy  Cost Time Required
Natural (free) labels Zero—You are collecting these labels already. Zero—You already have them.
Hacked labels Low and fixed— you just need to set up new data collection processes. Depends on your traffic.
Paid labels High and variable—You pay based on the number of labels you want Depends on the time needed to label an example and the number of labelers you have.