EXtraction, Separation, and Caption-based natural Language Annotation of IMages from scientific figures [wiki] [paper]
This repo is currently private in order to protect the MaterialEyes© source code while under development. Contact me if you are are a potential user or developer and I will set you up with the proper permissions to view/contribute to the code.
A JSON search query is the singular point-of-entry for using the EXSCLAIM! search and retrieval tools. Here we will query Nature journals to find figures related to HAADF-STEM images of gold nanoparticles. Limiting the results to the top 5 most relevant hits, the query might look something like:
{
"name": "nature-haadf-gold-NP",
"journal_family": "nature",
"maximum_scraped": 5,
"sortby": "relevant",
"query":
{
"search_field_1":
{
"term":"gold nanoparticle",
"synonyms":["Au nanoparticle","Au NP"]
},
"search_field_2":
{
"term":"HAADF-STEM",
"synonyms":["HAADF image"]
}
},
"results_dir": "extracted/nature-haadf-gold-NP/"
}
Saving the query avoids having to completely reformulate the structure with each new search and establishes provenance for the extraction results. Additional JSON search query examples can be found in test in the root directory.
With the query from above (nature-haadf-gold-NP.json
), we extract relevant figures and construct a dataset of annotated images by using a Pipeline
class to implement a JournalScraper
, CaptionSeparator
and FigureSeparator
interface (in this order).
from exsclaim.pipeline import Pipeline
from exsclaim.tool import *
# Set query path
query_path = "query/nature-haadf-gold-NP.json"
# Initialize EXSCLAIM! tool(s)
js = JournalScraper()
cs = CaptionSeparator()
fs = FigureSeparator()
# Define run order in a tools list
tools = [js,cs,fs]
# Initialize EXSCLAIM! pipeline
exsclaim_pipeline = Pipeline(query_path=query_path)
# Run the tools through the pipeline
exsclaim_pipeline.run(tools)
# Save image and label (.csv) results to file
exsclaim_pipeline.to_file()
Calling to_file()
saves extracted images into folders created for each figure, and prints a ‘labels.csv’ which includes entries for each image and its respective annotation. For a more concise record of the search results, an ‘exsclaim.json’ is also created, which records urls for the extracted figures, as well as bounding box information, and the associated caption text for each detected image.