Notes on Chemical Database Web Crawler

While I was leisurely slacking off in the lab, a WeChat message from my advisor suddenly broke my peace. Turns out he sent me a Selenium tutorial from a WeChat public account, asking me to use web scraping to get a reference table of chemical constituents for the mass spec data of medicinal herbs. That’s how this set of notes came about.

PubChem

I just learned Selenium, so I jumped straight in without much thought—only to realize afterward that this was definitely a mistake. My advisor’s research focuses on Dendrobium Nobile, so I searched for it on PubChem, and sure enough, found Taxonomy. Scrolling down, I saw that the Chemicals and Bioactivities list can be downloaded. Click and download—done, back to slacking off.

But it wasn’t that simple. The file pubchem_taxid_xxxxx_consolidatedcompoundtaxonomy.csv only contains the compound names, IDs, and links to the source databases. For mass spec data analysis, I at least need the chemical formula and molecular weight. Checking the links, I realized the Dendrobium data mainly comes from the following three databases:

This is the perfect use case for a web scraper. For each compound, all I need to do is visit the corresponding database entry sequentially and record its formula and molecular weight. Since PubChem exports Metabolites and Natural Products as two separate .csv files, I wrote a simple script to merge them:

1
#!/usr/bin/env python3
2
import os
3
import sys
4
import pandas as pd
5

6
def main():
7
    input_files = ['metabolites.csv', 'natural_products.csv']
8
    dfs = []
9

10
    for fname in input_files:
11

12
        # Load and validate each input file
13
        if not os.path.isfile(fname):
14
            sys.exit(f"Input file not found: {fname}")
15

16
        # Read only the necessary columns from the CSV
17
        df = pd.read_csv(
18
            fname,
19
            usecols=['Compound_CID', 'Compound', 'Source_Chemical', 'Source_Chemical_URL']
20
        )
21
        dfs.append(df)
22

23
    # Concatenate both DataFrames into a single one
24
    combined = pd.concat(dfs, ignore_index=True)
25
    # Remove duplicate entries based on Compound_CID, keeping the first occurrence
26
    combined = combined.drop_duplicates(subset=['Compound_CID'], keep='first')
27

28
    output_file = 'pubchem_combined.csv'
29
    combined.to_csv(output_file, index=False)
30
    print(f"Merged CSV written to: {output_file}")
31

32
if __name__ == "__main__":
33
    main()

Selenium

The logic for the web scraper script is fairly straightforward:

Use the merged .csv file from above as the input, with each row containing a Source_Chemical_URL column pointing to the respective database entry.
For each URL, use XPath and a site-specific parser to extract the required data from the page’s HTML structure.
1. Use modular design, with a dispatcher to call different parsers based on the domain (NPASS, Knapsack, or Wikidata).
After parsing, write the results back to the .csv file, adding two columns: molecular weight and chemical formula.

Basic Configuration

First, configure ChromeDriver:

1
def setup_driver(headless=False):
2

3
    # Finds the chromedriver executable in your system
4
    chromedriver_path = shutil.which("chromedriver")
5
    if not chromedriver_path:
6
        sys.exit("ERROR: chromedriver executable not found in PATH.")
7

8
    # Sets up Chrome options (headless if requested)
9
    options = webdriver.ChromeOptions()
10
    options.page_load_strategy = 'eager'
11
    if headless:
12
        options.add_argument('--headless=new')
13
        options.add_argument('--disable-gpu')
14

15
    # Returns a webdriver.Chrome instance
16
    service = Service(chromedriver_path)
17
    driver = webdriver.Chrome(service=service, options=options)
18
    # driver.set_page_load_timeout(PAGE_LOAD_TIMEOUT)   # npass websites are really slow
19
    return driver

Next, set the parameters. Here, the timeout is set to 30 seconds and the interval between fetches to one second. Since NPASS loads slowly, a longer timeout is actually needed. In practice, I removed the TimeoutException.

1
PAGE_LOAD_TIMEOUT = 30  # how long to wait for a page to load
2
DEFAULT_PAUSE = 1.0   # seconds to wait between scrapes to avoid overloading servers

Parsers

Knapsack

knapsack-page

Knapsack stores data in this structure:

1
<tr>
2
   <th class="inf">Formula</th>
3
   <td colspan="4">C15H14O3</td>
4
</tr>
5
<tr>
6
   <th class="inf">Mw</th>
7
   <td colspan="4">242.09429431</td>
8
</tr>

The data is stored as “structured rows”: in each <tr> element, <th> is the label and <td> is the value. So I wrote a simple get_text_label_in_table helper to reliably match the text in <th> (such as “Formula” or “Mw”) and then obtain the adjacent <td>.

Here’s the code:

1
# Looks in a table for a row with the given label and gets the corresponding value from the same row
2
# Find any table row <tr> where the first cell (whether it's a <th> or <td>) exactly matches the label.
3
def get_text_label_in_table(driver, label):
4
    try:
5
        row = driver.find_element(
6
            By.XPATH,
7
            f"//table//tr[normalize-space(.//th[1] | .//td[1])='{label}']"
8
        )
9
        return row.find_element(By.XPATH, './td[1]').text.strip()
10
    except NoSuchElementException:
11
        return None
12

13
# Parser for Knapsack
14
# Extracts Formula and Mw from chemical entry pages on knapsackfamily.com using table-based scraping
15
def parse_knapsack(driver):
16
    formula = get_text_label_in_table(driver, 'Formula')
17
    weight = get_text_label_in_table(driver, 'Mw')
18
    if weight is None:
19
        weight = get_text_label_in_table(driver, 'Molecular weight')
20
    return weight, formula

NPASS

npass-page

1
<tr>
2
    <td width="70%" align="right">Molecular Weight: &nbsp;</td>
3
    <td width="30%" align="center">154.03</td>
4
</tr>

NPASS is a bit trickier. Sometimes it uses <dt>/<dd> as definition tags, but in other cases it uses regular tables without <th>, placing the text directly in <td>. So I needed more logic:

First, try to extract from <dt>/<dd> (preferred method).
If not found, search for td[1][contains(normalize-space(.),'Molecular Weight')] and then get the value in the next <td>.

Here’s the code:

1
# Parser for NPASS
2
# Extracts Formula and Molecular Weight from npass.bidd.group using <dt>/<dd> tags and fallbacks to table parsing if needed
3
def parse_npass(driver):
4

5
    # Try extracting formula from <dt>/<dd>
6
    try:
7
        formula = driver.find_element(
8
            By.XPATH,
9
            "//dt[contains(normalize-space(),'Molecular Formula')]/following-sibling::dd[1]"
10
        ).text.strip()
11
    except NoSuchElementException:
12
        formula = None
13

14
    # Try extracting weight from <dt>/<dd>
15
    try:
16
        weight = driver.find_element(
17
            By.XPATH,
18
            "//dt[contains(normalize-space(),'Molecular Weight')]/following-sibling::dd[1]"
19
        ).text.strip()
20
    except NoSuchElementException:
21
        weight = None
22

23
    # NPASS often uses a <table class="table_with_border">…</table> for Mw;
24
    # if the dt/dd lookup failed or returned '0', fall back to grabbing from the table.
25
    if not weight or weight == '0':
26
        try:
27
            weight = driver.find_element(
28
                By.XPATH,
29
                "//table[contains(@class,'table_with_border')]"
30
                "//tr[td[1][contains(normalize-space(.),'Molecular Weight')]]/td[2]"
31
            ).text.strip()
32
        except NoSuchElementException:
33
            weight = None
34

35
    return weight, formula

Wikidata

When I wrote a parser for Wikidata, I realized that Wikidata provides an API, and there’s a wbgetentities function. So I can directly get a clean JSON file at https://www.wikidata.org/w/api.php?action=wbgetentities&ids=QXXX&props=claims&format=json:

1
{
2
  "entities": {
3
    "Qxxx": {
4
      "claims": {
5
        "P274": [...],  // Formula
6
        "P2067": [...]  // Molecular weight
7
      }
8
    }
9
  }
10
}

All I need is to read the values from the JSON file:

1
# Parser for Wikidata
2
# Uses the Wikidata API to extract:
3
#   P274: chemical formula
4
#   P2067: molecular weight
5
# Handles potential nested dictionary responses
6
def parse_wikidata(entity_id):
7
    import urllib.request, json
8

9
    api_url = (
10
        'https://www.wikidata.org/w/api.php'
11
        '?action=wbgetentities&ids=%s&props=claims&format=json' % entity_id
12
    )
13
    try:
14
        with urllib.request.urlopen(api_url, timeout=PAGE_LOAD_TIMEOUT) as f:
15
            data = json.load(f)
16
        claims = data['entities'][entity_id]['claims']
17
        formula = None
18
        weight = None
19
        if 'P274' in claims:
20
            formula = claims['P274'][0]['mainsnak']['datavalue']['value']
21
        if 'P2067' in claims:
22
            weight = claims['P2067'][0]['mainsnak']['datavalue']['value']
23
            # Wikidata returns a dict {'amount': '+<value>', 'unit': ...}; extract the numeric amount
24
            if isinstance(weight, dict):
25
                raw_amount = weight.get('amount')
26
                weight = raw_amount.lstrip('+') if raw_amount is not None else None
27
        return weight, formula
28
    except Exception as e:
29
        print(f"WARNING: failed to fetch Wikidata {entity_id}: {e}", file=sys.stderr)
30
        return None, None

Here, I realized that scraping content directly from the UI using a crawler might be a very poor choice, which I’ll revisit later in this article.

Dispatcher

After implementing all the parsers, I also needed a dispatcher to select the correct parser based on the domain in the URL. The structure is very simple:

1
# Dispatcher: Determine Which Parser to Use
2
def dispatch_parse(driver, url):
3
    hostname = urlparse(url).hostname or ''
4
    if 'knapsackfamily.com' in hostname:
5
        driver.get(url)
6
        return parse_knapsack(driver)
7
    if 'bidd.group' in hostname:
8
        driver.get(url)
9
        return parse_npass(driver)
10
    if 'wikidata.org' in hostname:
11
        entity_id = url.rstrip('/').rsplit('/', 1)[-1]
12
        return parse_wikidata(entity_id)
13
    print(f"WARNING: no parser available for {url}", file=sys.stderr)
14
    return None, None

CLI

I wanted my script to work as a general tool, so I wrote a simple command-line interface with 4 main parameters: input_csv, output_csv, --headless, and --pause.

1
# Command-line Interface
2
if __name__ == '__main__':
3
    p = argparse.ArgumentParser(
4
        description='Scrape molecular weight and formula for pubchem chemicals.'
5
    )
6
    p.add_argument('input_csv', help='Input CSV (final_pubchem.csv)')
7
    p.add_argument('output_csv', help='Output CSV with Mw and Formula')
8
    p.add_argument('--headless', action='store_true', help='Run Chrome in headless mode')
9
    p.add_argument('--pause', type=float, default=DEFAULT_PAUSE,
10
                   help='Seconds to pause between requests')
11
    args = p.parse_args()
12
    main(args.input_csv, args.output_csv, args.pause, args.headless)

PUG REST API

As mentioned above, when writing the Wikidata scraper, I realized that web scraping was not the most elegant solution for “retrieving reference tables of chemical constituents.” I soon learned that PubChem actually provides an API platform: PUG REST, and in fact, bulk scraping of web pages is discouraged. I quickly decided to correct my mistake and rewrote the whole script.

pug-rest-api-structure

PUG REST queries are all based on PubChem identifiers: SID represents substance IDs, CID stands for compound IDs, and AID for assay IDs. To query information related to a substance/compound/assay, you can use a URL structure like:

https://pubchem.ncbi.nlm.nih.gov/rest/pug /compound/name/vioxx /property/InChI /TXT
prolog input operation output

https://pubchem.ncbi.nlm.nih.gov/rest/pug	/compound/name/vioxx	/property/InChI	/TXT
prolog	input	operation	output

It also supports a wide array of output formats:

Output Format Description
XML standard XML, for which a schema is available
JSON JSON, JavaScript Object Notation
JSONP JSONP, like JSON but wrapped in a callback function
ASNB standard binary ASN.1, NCBI’s native format in many cases
ASNT NCBI’s human-readable text flavor of ASN.1
SDF chemical structure data
CSV comma-separated values, spreadsheet compatible
PNG standard PNG image data
TXT plain text

Output Format	Description
XML	standard XML, for which a schema is available
JSON	JSON, JavaScript Object Notation
JSONP	JSONP, like JSON but wrapped in a callback function
ASNB	standard binary ASN.1, NCBI’s native format in many cases
ASNT	NCBI’s human-readable text flavor of ASN.1
SDF	chemical structure data
CSV	comma-separated values, spreadsheet compatible
PNG	standard PNG image data
TXT	plain text

With this, I could quickly batch query a CID list via the API platform, while customizing what I wanted to retrieve. I need to call a URL like:

1
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/{namespace}/property/{props}/JSON"

Here, namespace is a comma-separated list of CIDs, and props can be any one or more supported PUG REST property fields.

So, I can write a helper for querying:

1
payload = {namespace: ",".join(ids), "property": props}
2
backoff = 1.0
3

4
for attempt in range(retries):
5
    try:
6
        r = SESSION.post(url, data=payload, timeout=30)
7
        r.raise_for_status()
8
        data = r.json()["PropertyTable"]["Properties"]
9
        key_field = "CID" if namespace == "cid" else "Name"
10
        return {str(item[key_field]): item for item in data}
11
    except Exception as exc:
12
        if attempt == retries - 1:
13
            raise
14
        time.sleep(backoff)
15
        backoff *= 2
16
        continue

For a compound table with a column of CIDs (such as the merged table from before), I can quickly batch query their chemical formula and molecular weight, then add the information as new columns:

1
df = pd.read_csv(in_path)
2
if args.id_column not in df.columns:
3
    sys.exit(f"Column {args.id_column!r} not found in {in_path}")
4

5
# ensure CID keys are clean strings without '.0'
6
if args.cid:
7
    ids = df[args.id_column].astype(int).astype(str).tolist()
8
else:
9
    ids = df[args.id_column].astype(str).tolist()
10
namespace = "cid" if args.cid else "name"
11

12
# load cache if present
13
cache: Dict[str, Dict[str, str]] = {}
14
if cache_path:
15
    cache = load_cache(cache_path)
16

17
# figure out which IDs still need querying
18
to_query = [i for i in ids if i not in cache]
19
print(f"{len(ids)} total IDs  /  {len(to_query)} to query (cached {len(ids)-len(to_query)})")
20

21
# batch loop
22
for i in range(0, len(to_query), args.batch_size):
23
    batch = to_query[i : i + args.batch_size]
24
    print(f"Fetching batch {i // args.batch_size + 1}  (size {len(batch)}) ...", end="", flush=True)
25
    try:
26
        props_dict = pug_request(namespace, batch, args.props)
27
        cache.update(props_dict)
28
        print(" done.")
29
    except Exception as exc:
30
        print(f" failed ({exc}).")
31
    time.sleep(args.sleep)
32

33
# save cache
34
if cache_path:
35
    save_cache(cache, cache_path)
36
    # optionally remove cache file after run
37
    if args.auto_delete_cache and cache_path.exists():
38
        cache_path.unlink()
39
        print("Deleted cache", cache_path)
40

41
# add columns back to DataFrame
42
prop_names = args.props.split(",")
43
for prop in prop_names:
44
    key_series = (
45
        df[args.id_column].astype(int).astype(str)
46
        if args.cid
47
        else df[args.id_column].astype(str)
48
    )
49
    df[prop] = key_series.map(lambda x: cache.get(x, {}).get(prop, ""))

Using the API is much faster than using a web crawler: the time it takes for the scraper to load a single web page is enough for the API to query hundreds of compounds. Efficiency is greatly improved. Plus, I don’t have to worry about triggering database rate limits, or slow UI rendering on some databases causing timeouts.

After this, I wanted to further streamline the workflow, ideally automating the step of downloading a compound list from PubChem—so that given a Taxonomy ID, I could directly output a table including all necessary compound information. But there was a snag. PUG REST doesn’t directly link Taxonomy ID with CID; at best, I can go Taxonomy ID -> AID -> CID. The documentation explains:

Assays and Bioactivities

The following operation returns a list of compounds involved in a given taxonomy. Valid output formats are XML, JSON(P), ASNT/B, and TXT.

https://pubchem.ncbi.nlm.nih.gov/rest/pug/taxonomy/taxid/2697049/aids/TXT

There is no operation available to directly retrieve the bioactivity data associated with a given taxonomy, as often the data volume is huge. However, one can first get the list of AIDs using the above link, and then aggregate the concise bioactivity data from each AID, e.g.:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1409578/concise/JSON

In practice, for many taxonomy entries—especially the natural herbal medicine taxa related to my project— /taxonomy/taxid/xxxxxxx/aids/ just returns 404, meaning there’s no corresponding Taxonomy ID -> AID mapping in the database. So I can’t get the CID list via AIDs.

Mass Spectrometry Software

After spending lots of time, with my amateur programming skills, building parsers and trying to automate data collection from various databases, I started to wonder if someone else had done this sort of thing—only much better. A quick search yielded this list of mass spectrometry software. The field has a mature tool ecosystem; many tools are well-suited for my needs, and some even employ machine learning to directly predict protein sequences from mass spec peaks.

For example, a friend at the University of Washington mentioned they use the Crux toolkit developed by their school, which covers all my needs much better and faster:

tide-index Create an index of all peptides in a fasta file, for use in subsequent calls to tide-search.

tide-search Search a collection of spectra against a sequence database, provided either as a FASTA file or an index, returning a collection of peptide-spectrum matches (PSMs). This is a fast search engine, but it runs most quickly if provided with a peptide index built with tide-index.

comet Search a collection of spectra against a sequence database, returning a collection of PSMs. This search engine runs directly on a protein database in FASTA format.

percolator Re-rank and assign confidence estimates to a collection of PSMs using the Percolator algorithm. Optionally, also produce protein rankings using the Fido algorithm.

kojak Search a collection of spectra against a sequence database, finding cross-linked peptide matches.

There are also many commercial tools with similar algorithms and full-featured frontends (including web UIs), such as InstaNovo. Just upload your spectral data and get Transformer model predictions right away.

Conclusion

After all my tinkering, it turns out I was just reinventing the wheel—a bit disappointing, but I don’t regret writing the scripts. It helped me understand how everything works behind the scenes; this was also my first real web scraping project. My biggest mistake was jumping straight into the “how” without considering the “why,” getting too hung up on the technique of “web scraping” rather than the goal of “analyzing MS data.” In scientific computing, standing on the shoulders of giants is often the way to go.