Home > Images of Disability > Data Visualizations > Scraping Data

 Scraping data

 

Scraping data from the United Nations and WHO Photo Libraries online represented the single most difficult task throughout the entire process. Each website is outdated in its archival format; furthermore, its accompanying photograph data is often not standardized. This required multiple scraping attempts before readable CSV files and full thumbnail downloads of the images could be achieved.


UN Photo Library


The UN Photo Library’s search results are displayed in a tile format of image thumbnails and associated titles. Clicking on the tiles leads to a page with more detailed information on the selected photograph. This individual content is generated dynamically by Javascript.
I sought to scrape the following information from each photograph:

  1. Title of Photo
  2. Description (Caption)
  3. Date
  4. Location
  5. Photo Number
  6. Credits
  7. Link

The UN Photo Library website’s search function is limited, and doesn’t provide much room to organize the results within the site before scraping. Organizing the search results is limited to relevance, and ascending and descending dates, and no function exists to view more than 18 photographs per page. Thus, a web-scraping job needed to take into account pagination, and know to continue to the following pages.


This was where my first attempt, using Grepsr, a simple Chrome extension, failed. While Grepsr could recognize the text fields (e.g. title, description, date), it could not recognize the “next” button on either the main search page or the individual photo page. Thus, the scraper failed to scrape data from all the search results, and ended either after the first 18 results, or following multiple extractions of the same page.


    (Click on the captions for a full size view of the screenshots.)

 

Web Scraper, another Chrome extension, proved to be a more powerful tool for the project. The scraper, perhaps most importantly, allowed me to build a sitemap. By manipulating the start URL, I was able to direct the scrape to take into account pagination when navigating the site, and defined the total number of results and the number of photos per page. The following start URL was used for the search term “disability”:

http://www.unmultimedia.org/photo/gallery.jsp?query=disability&startat=[0-663:18]

There were 663 results, and 18 images per page. The scrape was defined to start at 0. Following this first step, I used different selectors to extract multiple types of data - text and links in the form of a CSV file, and images in a single folder.

I repeated this process three more times to scrape data from the other search terms. As of writing, there were 663 results for “disabled,” 62 for “handicap,” 17 for “crippled,” and 8 for “retard.” As I was dealing with a few hundred search results and photo downloads, I ran the scraper on a virtual machine – this allowed me to leave the scrape running and come back to it later.

The HTML code for my four searches is available on Github: "disability"; "handicap"; "cripple"; "retard".

 

Organizing, Cleaning, and Manipulating Data


I cleaned the data to showcase the photographs in as many layers and perspectives as possible – each column of the data needed to be as specific as possible. For example, I sought to organize the dataset not only location, but by city and country. Furthermore, I had four UN Photo Library search term scrapes and the WHO disability scrape, and needed to ultimately consolidate them into a single CSV file, taking into account overlapping search results.

 

OpenRefine

OpenRefine provided a quick way to clean messy data, and was used mainly to separate columns that displayed multiple pieces of data (e.g. location).

In order for a mapping system like Carto and a future visualization to work, I needed to make sure I had distinct columns that listed locations that could be geotagged – the raw CSV file had one column for location, which usually listed both a city and a country, separated by a comma. Because the locations on the UN Photo Library pages were consistently listed in the “City, Country” format, and a number of photos also listed “United Nations” as a location marker, I set out to reorganize each location into a comma-separated entry: city, country, and location 3. Location 3 would provide a space to keep a note of locations like “United Nations” that are neither a city nor a country. As OpenRefine would allow me to separate the location entries into three columns based on commas, I used a text editor to ensure each entry went in the right column. “United Nations, New York” was edited to read “New York, United States, United Nations” for the sake of geotagging. “Nairobi, Kenya” became “Nairobi, Kenya,” with an extra comma to account for what would later be the third location column. Locations that listed only a city or a country had commas placed accordingly: “Lebanon” became “, Lebanon,” and “Jerusalem” “Jerusalem, ,”.

The same was done for the photograph credit. The UN Photo Library listed credit usually listed two to three credits, each separated by a slash: “Primary Organization/Secondary Organization/Photographer Name.” I standardized each entry to three credits, adding additional slashes for empty data points.

 

FileMaker Pro

To account for overlapping search results from the four scrapes, I added a separate column for “keyword” that listed one of the four search terms (disability, handicap, retard, cripple). This ensured that when the CSVs were consolidated into a single database on FileMaker, and repeated photograph entries were deleted, I could keep track of how many keywords those photographs showed up in. Furthermore, in order to differentiate between the WHO and UN scrapes, another column was added to indicate originating organization. Once this process was complete, I exported all the files to FileMaker Pro.

FileMaker allowed me to define relationships between images and keywords, and between images and WHO archival files on photographers and subjects. Furthermore, I could consolidate all the different CSV files in one spot without having to shift between programs to organize and clean data.

The biggest problem I encountered was that I wasn’t getting all of the records from the imported CSV files – the import had done something unexpected. To remedy this, I created a validation on the column “Photo #” so that it was unique and always had to be validated during import. Existing records could be rejected on import. Moreover, by setting up a keyword field, I could ensure that photos had one to many relationships with keywords. This connection was based on Photo #.

Ultimately, I was left with a keyword field that listed the different keywords each photograph showed up in. The formatting of the keywords did pose a slight problem when the consolidated dataset was exported as a CSV. Keywords were formatted so that they were listed individually on different rows, rather than in a single cell under the keyword column, separated by commas. A new field was created in FileMaker in a comma-separated format to fix this issue.


FileMaker layout for organizing data.


Final Clean

After an initial look through the consolidated data, issues with encoding were the most visually apparent. I made sure to export as a CSV from FileMaker in Unicode-8, yet special characters such as accents and em-dashes were still rendered incorrectly. As a precaution, I opened the data in Google Sheets, as character encoding on Excel is known to be unreliable. However, the same issues persisted, indicating that encoding errors occurred during the scrape, as Web Scraper attempted to recognize discrepancies on the website. I made the necessary edits using regular expressions on a search and replace function on a text editor, and copied the cleaned entries onto the CSV files.

There were also a few instances where the WHO had made some bookkeeping errors, where certain names were misspelled, and random spaces were inserted in between letters. For this, I used OpenRefine again, doing a cluster and edit on the cells to keep names and spellings consistent.

Cleaned CSV datasets are provided here  for download.