Hacking the News: The NYTimes API
Why just read the news when you can hack it?
The above question posed by the New York Times makes an important point: data is not merely raw material to be finessed into content, but content in its own right. The New York Times goes on to say that “when you build applications, create mashups and otherwise reveal the potential of our data, we learn more about what our readers want and gain insight into how news and information can be reimagined.” The following tutorial demonstrates one possible reimagining of New York Times content for use in research and digital scholarship.
In order to access the API, you will need to request an API key from the New York Times developer website. This will provide access with certain limitations (most APIs have usage limits in order to prevent individual users from overburdening the server). With that in mind, please also note that data retrieved from the New York Times API is not exhaustive. This is because intellectual property considerations often prevent certain data from being made publicly available in the first place.
There are multiple ways of querying the API, but this particular example will use Python to make the calls. You will write a script in your text editor of choice that can communicate with the API, and then run it locally from your computer. Copy the following code into your text editor and save it as nytimes.py. You can run the script with the command python nytimes.py.
There are three import statements here, json, nytimes, and time. These are references to specific Python packages (prebuilt collections of code). In the case of nytimes, you will need to install this package using pip. For more information on how to install specific Python packages, including pip, check out the following documentation.
The nytimes package provides the tools for querying the New York Times API, the json package will convert the results into JSON format for ease of use, and the time package allows us to set a small delay in between API calls. There are limits on how many times you can query the API per second, and adding a one second delay prevents any connection errors.
In the variable search_obj, you will need to change YOUR_API_KEY_HERE to the API key that you received from the New York Times developer page. Make sure you preserve the single quotations inside the parentheses.
In the above example, we are searching for the term ‘cybersecurity’ beginning after the date listed, in this case, 20000101 (January 1, 2000). This date can be modified to a date of your choosing, or removed completely to search for all instances across time. There are other parameters available as well, for more information please see the API documentation.
The API returns query results in the form of pages. Each page contains ten results—in this case ten unique articles. At present, the maximum amount of pages the API will return is 100. This script will therefore iterate 100 times before stopping. In the case that there are more than 100 pages of content (greater than 1000 individual results), you may need to come up with a workaround. Perhaps the easiest way to manage this would be to change the parameters in the initial query to pull all results one year at a time using begin_date and end_date parameters. After gathering results from individual years you can combine them into one large dataset.
When you run the script you’ve created, it will dump all of the data it retrieves into a file called results.json. The following is an example of an individual article’s metadata:
New York Times article metadata does not provide the complete text of an article, but it does provide useful information including headlines, abstracts, author information, word count, keywords, and more. The above example represents one single article, and therefore it’s likely that your script will return a substantial amount of content in one single file.
Using the following JSON data, you can create all manner of data visualizations. A few basic examples are included below.
The next two examples, “Number of Articles Published from 2000-now” and “Network Graph of Keywords,” relied on Google Fusion tables, a product that was deprecated in 2020.