Exsclaim

EXtraction, Separation, and Caption-based natural Language Annotation of IMages from scientific figures [wiki] [paper]

Getting started

This repo is currently private in order to protect the MaterialEyes© source code while under development. Contact me if you are are a potential user or developer and I will set you up with the proper permissions to view/contribute to the code.

Example Usage

Formulate a JSON search query

A JSON search query is the singular point-of-entry for using the EXSCLAIM! search and retrieval tools. Here we will query Nature journals to find figures related to HAADF-STEM images of gold nanoparticles. Limiting the results to the top 5 most relevant hits, the query might look something like:

{   
    "name": "nature-haadf-gold-NP",
    "journal_family": "nature",
    "maximum_scraped": 5,
    "sortby": "relevant",
    "query":
    {
        "search_field_1":
        {
            "term":"gold nanoparticle",
            "synonyms":["Au nanoparticle","Au NP"]
        },
        "search_field_2": 
        {
            "term":"HAADF-STEM",
            "synonyms":["HAADF image"]
        }
    },
    "results_dir": "extracted/nature-haadf-gold-NP/"
}

Saving the query avoids having to completely reformulate the structure with each new search and establishes provenance for the extraction results. Additional JSON search query examples can be found in test in the root directory.

Create an annotated materials imaging dataset from extracted figures

With the query from above (nature-haadf-gold-NP.json), we extract relevant figures and construct a dataset of annotated images by using a Pipeline class to implement a JournalScraper, CaptionSeparator and FigureSeparator interface (in this order).

from exsclaim.pipeline import Pipeline 
from exsclaim.tool import *

# Set query path
query_path = "query/nature-haadf-gold-NP.json"

# Initialize EXSCLAIM! tool(s)
js = JournalScraper()
cs = CaptionSeparator()
fs = FigureSeparator()

# Define run order in a tools list
tools = [js,cs,fs] 

# Initialize EXSCLAIM! pipeline
exsclaim_pipeline = Pipeline(query_path=query_path)

# Run the tools through the pipeline
exsclaim_pipeline.run(tools) 

# Save image and label (.csv) results to file
exsclaim_pipeline.to_file()

Calling to_file() saves extracted images into folders created for each figure, and prints a ‘labels.csv’ which includes entries for each image and its respective annotation. For a more concise record of the search results, an ‘exsclaim.json’ is also created, which records urls for the extracted figures, as well as bounding box information, and the associated caption text for each detected image.

Getting started

Example Usage

Formulate a JSON search query

Create an annotated materials imaging dataset from extracted figures

Example Output

Acknowledgements