Augmenta: A new tool for AI classification and research

Methodology

We are making the tool behind some of our investigations available for public use

Part of our work at Global Witness involves uncovering the ways in which the fossil fuels industry is trying to shape and weaken environmental policy.

One of the industry’s favourite means of doing this is to infiltrate its lobbyists into climate negotiations, political party conferences or meetings with government ministers.

Oil and gas companies also use their vast wealth to fund politicians and their parties. Our recent analysis found that fossil fuel donors contributed a staggering $19 million to organise Donald Trump’s second inauguration ceremony.

Action against the presence of fossil fuel lobbyists at COP28. Jasmin Qureshi / Global Witness

The data challenge

Doing this type of analysis normally requires downloading data from official sources or compiled by third parties, and then checking each entry against lists of known fossil fuels industry representatives. This approach raises several issues.

First, records can consist of hundreds of thousands of rows, many of which contain errors such as spelling mistakes, missing values and poorly formatted entries. For example, the list of COP29 participants published by UNFCCC contains several of these issues, including misspelled company names, incorrectly filled in forms (contact details instead of names) and incomplete entries.

Second, the names of organisations and individuals in these datasets are often not consistent. The same table could include entries attributed to “Shell”, “Shell Plc”, “Royal Dutch Shell”, “Shell Group of Companies” and “Shell International” for example.

While we use tools that match and standardise some of these duplicates, they may sometimes miss important entities, and making sure the deduplication is accurate involves a lot of manual work.

Finally, our own knowledge of which fossil fuels companies and organisations operate in the world isn’t exhaustive. We combine lists of lobbyists we manually identified at COP with lists published by other organisations to get as broad a picture of the industry as we can. However, companies often merge, split subsidiaries off, rebrand, and set up trade groups to represent their interests, making it difficult to track their activities.

To avoid some of these issues, we can process smaller datasets by hand, going through them row-by-row, or using a mix of manual and automated processing. This becomes incredibly more time-consuming and error-prone as the data grows, meaning we need a different approach.

Graphic collage showing methods for searching for data

Our AI-based solution

In recent months, we have started using an AI-based tool to classify records at scale. This approach has helped us test hypotheses more quickly and speeded up classification.

We are making the tool we’ve built for this public so that other journalists, researchers and campaigners can use it. We’re calling this tool Augmenta.

Augmenta is an AI agent that can take a dataset and perform basic research on each entry in it using a search engine and the ability to read the internet.

We use it to identify fossil fuels interests in places they don’t belong, such as COP or political donations. Given a list of diverse donors, Augmenta searches the internet for each one, extracts relevant information, optionally repeats the process with different keywords, and makes a call on whether that entity represents oil, gas and coal interests or not, along with an explanation and references to information on the internet. We then manually review all positive classifications to ensure they are accurate.

Access to Augmenta

You can find instructions on how to install and use it on the project’s GitHub.

In addition to the basic workflow we have described before, it introduces a few useful features, including:

No code: While some technical knowledge is needed to install and run Augmenta, you don’t need to write any code.
Search engine and model selection: Multiple search engines and Large Language Models supported out of the box.
Output validation: This is to ensure that the AI’s response matches the format you need.
Caching: Progress is saved so that the classification can be resumed if it was interrupted.
Asynchronicity: Augmenta classifies multiple records at the same time to make the entire process speedier.
Third-party tools: Support for MCP servers to extend the agent’s functionality.

Augmenta is a work in progress, and you might encounter a bug here or there. We welcome both code contributions and issue reports.