PubChem数据库爬虫笔记

在实验室美美摸鱼突然被微信消息打断，原来是导师发来一篇微信公众号里的Selenium教程，要求我用爬虫为药材的质谱数据获取参考化学成分表，于是有了这一篇笔记。

PubChem

正好刚刚学过 Selenium，当时我没有多想就开干了，复盘时才意识到这是一个非常错误的决定。导师的研究药物主要是石斛，我就到 PubChem 上搜索 Dendrobium Nobile, 果然在 Taxonomy 里有这一条目。往下翻一翻，发现 Chemicals and Bioactivities 的列表可以下载。点击下载，美美收工，继续摸鱼。

下载后发现没这么简单。 pubchem_taxid_xxxxx_consolidatedcompoundtaxonomy.csv 的文件里只有化合物名，编号和源数据库链接。对于质谱数据分析，我至少需要化合物的化学式和分子质量。仔细一看链接，发现石斛的数据来源主要是以下三个数据库：

这正是爬虫的应用场景。我只需要对每种化合物，依次访问对应的数据库条目，并记录他的化学式和分子质量。由于 PubChem 将 Metabolites 和 Natural Products 下载为了两个单独的 .csv 表格，先写一个简单的脚本将他们合并为一个：

1
#!/usr/bin/env python3
2
import os
3
import sys
4
import pandas as pd
5

6
def main():
7
    input_files = ['metabolites.csv', 'natural_products.csv']
8
    dfs = []
9

10
    for fname in input_files:
11

12
        # Load and validate each input file
13
        if not os.path.isfile(fname):
14
            sys.exit(f"Input file not found: {fname}")
15

16
        # Read only the necessary columns from the CSV
17
        df = pd.read_csv(
18
            fname,
19
            usecols=['Compound_CID', 'Compound', 'Source_Chemical', 'Source_Chemical_URL']
20
        )
21
        dfs.append(df)
22

23
    # Concatenate both DataFrames into a single one
24
    combined = pd.concat(dfs, ignore_index=True)
25
    # Remove duplicate entries based on Compound_CID, keeping the first occurrence
26
    combined = combined.drop_duplicates(subset=['Compound_CID'], keep='first')
27

28
    output_file = 'pubchem_combined.csv'
29
    combined.to_csv(output_file, index=False)
30
    print(f"Merged CSV written to: {output_file}")
31

32
if __name__ == "__main__":
33
    main()

Selenium

爬虫脚本的思路还是比较简单的：

将上一步准备好的 .csv 文件作为输入，其中每行都包含一个指向对应数据库条目的 Source_Chemical_URL 列。
对于每个 URL，借助 XPath 使用特定于站点的 parser 从页面的 HTML 结构中提取所需的数据。
1. 模块化，通过一个 dispatcher 根据域名（NPASS、Knapsack 或 Wikidata）调用不同的 parser。
解析后，结果被写回 .csv 表格，附加两列：分子质量和化学式。

基本配置

首先是配置 ChromeDriver:

1
def setup_driver(headless=False):
2

3
    # Finds the chromedriver executable in your system
4
    chromedriver_path = shutil.which("chromedriver")
5
    if not chromedriver_path:
6
        sys.exit("ERROR: chromedriver executable not found in PATH.")
7

8
    # Sets up Chrome options (headless if requested)
9
    options = webdriver.ChromeOptions()
10
    options.page_load_strategy = 'eager'
11
    if headless:
12
        options.add_argument('--headless=new')
13
        options.add_argument('--disable-gpu')
14

15
    # Returns a webdriver.Chrome instance
16
    service = Service(chromedriver_path)
17
    driver = webdriver.Chrome(service=service, options=options)
18
    # driver.set_page_load_timeout(PAGE_LOAD_TIMEOUT)   # npass websites are really slow
19
    return driver

之后设置参数。这里我将超时设置为 30 秒，每次爬取的间隔设置为 1 秒。由于 NPASS 的网站加载较慢，其实需要更长的超时时间。在实际运行中我直接去掉了 TimeoutException。

1
PAGE_LOAD_TIMEOUT = 30  # how long to wait for a page to load
2
DEFAULT_PAUSE = 1.0   # seconds to wait between scrapes to avoid overloading servers

Parsers

Knapsack

knapsack-page

Knapsack 以这样的结构存储数据：

1
<tr>
2
   <th class="inf">Formula</th>
3
   <td colspan="4">C15H14O3</td>
4
</tr>
5
<tr>
6
   <th class="inf">Mw</th>
7
   <td colspan="4">242.09429431</td>
8
</tr>

数据以「结构化的行」形式存储。<tr> 元素中， <th> 作为标签， <td> 作为值。这样我可以写一个简单的 get_text_label_in_table 的 helper 可靠地匹配 <th> 中的文本（如「Formula」、「Mw」），然后获取相邻的 <td>。

代码如下：

1
# Looks in a table for a row with the given label and gets the corresponding value from the same row
2
# Find any table row <tr> where the first cell (whether it's a <th> or <td>) exactly matches the label.
3
def get_text_label_in_table(driver, label):
4
    try:
5
        row = driver.find_element(
6
            By.XPATH,
7
            f"//table//tr[normalize-space(.//th[1] | .//td[1])='{label}']"
8
        )
9
        return row.find_element(By.XPATH, './td[1]').text.strip()
10
    except NoSuchElementException:
11
        return None
12

13
# Parser for Knapsack
14
# Extracts Formula and Mw from chemical entry pages on knapsackfamily.com using table-based scraping
15
def parse_knapsack(driver):
16
    formula = get_text_label_in_table(driver, 'Formula')
17
    weight = get_text_label_in_table(driver, 'Mw')
18
    if weight is None:
19
        weight = get_text_label_in_table(driver, 'Molecular weight')
20
    return weight, formula

NPASS

npass-page

1
<tr>
2
    <td width="70%" align="right">Molecular Weight: &nbsp;</td>
3
    <td width="30%" align="center">154.03</td>
4
</tr>

NPASS 相比之下就复杂一些。首先，NPASS有时会使用像 <dt>/<dd>（定义列表）这样的定义标签。但在其他情况下，他们又会使用没有 <th> 标签，直接将文字保存在 <td> 标签的普通表格。因此我需要跟多的判断逻辑：

首先尝试 <dt>/<dd> 查找（首选方法）。
如果未找到，我就直接寻找 td[1][contains(normalize-space(.),'Molecular Weight')] 之后的下一个 <td> 元素里保存的值。

代码如下：

1
# Parser for NPASS
2
# Extracts Formula and Molecular Weight from npass.bidd.group using <dt>/<dd> tags and fallbacks to table parsing if needed
3
def parse_npass(driver):
4

5
    # Try extracting formula from <dt>/<dd>
6
    try:
7
        formula = driver.find_element(
8
            By.XPATH,
9
            "//dt[contains(normalize-space(),'Molecular Formula')]/following-sibling::dd[1]"
10
        ).text.strip()
11
    except NoSuchElementException:
12
        formula = None
13

14
    # Try extracting weight from <dt>/<dd>
15
    try:
16
        weight = driver.find_element(
17
            By.XPATH,
18
            "//dt[contains(normalize-space(),'Molecular Weight')]/following-sibling::dd[1]"
19
        ).text.strip()
20
    except NoSuchElementException:
21
        weight = None
22

23
    # NPASS often uses a <table class="table_with_border">…</table> for Mw;
24
    # if the dt/dd lookup failed or returned '0', fall back to grabbing from the table.
25
    if not weight or weight == '0':
26
        try:
27
            weight = driver.find_element(
28
                By.XPATH,
29
                "//table[contains(@class,'table_with_border')]"
30
                "//tr[td[1][contains(normalize-space(.),'Molecular Weight')]]/td[2]"
31
            ).text.strip()
32
        except NoSuchElementException:
33
            weight = None
34

35
    return weight, formula

Wikidata

当我为 Wikidata 写 parser 的时候，发现 Wikidata 有提供 API，并有 wbgetentities 函数。这样就可以直接通过 https://www.wikidata.org/w/api.php?action=wbgetentities&ids=QXXX&props=claims&format=json ，直接获得一个干净的 JSON 文件：

1
{
2
  "entities": {
3
    "Qxxx": {
4
      "claims": {
5
        "P274": [...],  // Formula
6
        "P2067": [...]  // Molecular weight
7
      }
8
    }
9
  }
10
}

这样就只需要从 JSON 文件中读取值就好了，代码如下：

1
# Parser for Wikidata
2
# Uses the Wikidata API to extract:
3
#   P274: chemical formula
4
#   P2067: molecular weight
5
# Handles potential nested dictionary responses
6
def parse_wikidata(entity_id):
7
    import urllib.request, json
8

9
    api_url = (
10
        'https://www.wikidata.org/w/api.php'
11
        '?action=wbgetentities&ids=%s&props=claims&format=json' % entity_id
12
    )
13
    try:
14
        with urllib.request.urlopen(api_url, timeout=PAGE_LOAD_TIMEOUT) as f:
15
            data = json.load(f)
16
        claims = data['entities'][entity_id]['claims']
17
        formula = None
18
        weight = None
19
        if 'P274' in claims:
20
            formula = claims['P274'][0]['mainsnak']['datavalue']['value']
21
        if 'P2067' in claims:
22
            weight = claims['P2067'][0]['mainsnak']['datavalue']['value']
23
            # Wikidata returns a dict {'amount': '+<value>', 'unit': ...}; extract the numeric amount
24
            if isinstance(weight, dict):
25
                raw_amount = weight.get('amount')
26
                weight = raw_amount.lstrip('+') if raw_amount is not None else None
27
        return weight, formula
28
    except Exception as e:
29
        print(f"WARNING: failed to fetch Wikidata {entity_id}: {e}", file=sys.stderr)
30
        return None, None

在这里我就意识到用爬虫直接爬去 ui 界面上的内容可能是一个非常糟糕的选择，本文后面的部分会继续提到。

Dispatcher

在 parser 都完成后，我还需要一个 dispatcher，根据 URL 所指向的网站域名来选择正确的 parser，结构非常简单：

1
# Dispatcher: Determine Which Parser to Use
2
def dispatch_parse(driver, url):
3
    hostname = urlparse(url).hostname or ''
4
    if 'knapsackfamily.com' in hostname:
5
        driver.get(url)
6
        return parse_knapsack(driver)
7
    if 'bidd.group' in hostname:
8
        driver.get(url)
9
        return parse_npass(driver)
10
    if 'wikidata.org' in hostname:
11
        entity_id = url.rstrip('/').rsplit('/', 1)[-1]
12
        return parse_wikidata(entity_id)
13
    print(f"WARNING: no parser available for {url}", file=sys.stderr)
14
    return None, None

CLI

我希望我的脚本能够作为一个通用的工具，于是简单写了一个命令行界面，主要有 4 个参数：input_csv, output_csv, --headless, --pause。

1
# Command-line Interface
2
if __name__ == '__main__':
3
    p = argparse.ArgumentParser(
4
        description='Scrape molecular weight and formula for pubchem chemicals.'
5
    )
6
    p.add_argument('input_csv', help='Input CSV (final_pubchem.csv)')
7
    p.add_argument('output_csv', help='Output CSV with Mw and Formula')
8
    p.add_argument('--headless', action='store_true', help='Run Chrome in headless mode')
9
    p.add_argument('--pause', type=float, default=DEFAULT_PAUSE,
10
                   help='Seconds to pause between requests')
11
    args = p.parse_args()
12
    main(args.input_csv, args.output_csv, args.pause, args.headless)

PUG REST API

前面说过，我在写 Wikidata 的爬虫时就意识到，对于「获取参考化学成分表」这一任务来说，爬虫这个方法有点过于不优雅了。很快我就了解到原来 PubChem 提供了 API 平台：PUG REST，而且并不鼓励直接使用爬虫大量爬取网页内容。我立刻决定弥补我犯下的错误，重写整个脚本。

pug-rest-api-structure

PUG REST 的查询都基于 PubChem 编号，即 SID 代表物质编号，CID 代表成分编号，AID 代表检验编号。想要查询某个编号的物质 / 成分 / 检验相关的信息，就可以使用这样的 URL 结构：

https://pubchem.ncbi.nlm.nih.gov/rest/pug /compound/name/vioxx /property/InChI /TXT
prolog input operation output

https://pubchem.ncbi.nlm.nih.gov/rest/pug	/compound/name/vioxx	/property/InChI	/TXT
prolog	input	operation	output

同时还支持大量的输出格式：

Output Format Description
XML standard XML, for which a schema is available
JSON JSON, JavaScript Object Notation
JSONP JSONP, like JSON but wrapped in a callback function
ASNB standard binary ASN.1, NCBI’s native format in many cases
ASNT NCBI’s human-readable text flavor of ASN.1
SDF chemical structure data
CSV comma-separated values, spreadsheet compatible
PNG standard PNG image data
TXT plain text

Output Format	Description
XML	standard XML, for which a schema is available
JSON	JSON, JavaScript Object Notation
JSONP	JSONP, like JSON but wrapped in a callback function
ASNB	standard binary ASN.1, NCBI’s native format in many cases
ASNT	NCBI’s human-readable text flavor of ASN.1
SDF	chemical structure data
CSV	comma-separated values, spreadsheet compatible
PNG	standard PNG image data
TXT	plain text

这样一来，对于我的需求，我就可以通过这个 API 平台快速的批量查询 CID 列表，同时还可以自定义查询的内容。这需要调取这样的一条 URL：

1
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/{namespace}/property/{props}/JSON"

其中 namespace 是一个以「 , 」分隔的 CID 列表，而 props 则可以填入任意一个或多个 PUG REST 支持的查询条目。

这样就可以写出一个用于查询的 helper：

1
payload = {namespace: ",".join(ids), "property": props}
2
backoff = 1.0
3

4
for attempt in range(retries):
5
    try:
6
        r = SESSION.post(url, data=payload, timeout=30)
7
        r.raise_for_status()
8
        data = r.json()["PropertyTable"]["Properties"]
9
        key_field = "CID" if namespace == "cid" else "Name"
10
        return {str(item[key_field]): item for item in data}
11
    except Exception as exc:
12
        if attempt == retries - 1:
13
            raise
14
        time.sleep(backoff)
15
        backoff *= 2
16
        continue

对一个包含一列 CID 信息的化合物表格（如开头用脚本合并得到的表格），我就可以快速批量的获取他们的化学式和分子质量，并将新信息分别以单独的列加入回表格中：

1
df = pd.read_csv(in_path)
2
if args.id_column not in df.columns:
3
    sys.exit(f"Column {args.id_column!r} not found in {in_path}")
4

5
# ensure CID keys are clean strings without '.0'
6
if args.cid:
7
    ids = df[args.id_column].astype(int).astype(str).tolist()
8
else:
9
    ids = df[args.id_column].astype(str).tolist()
10
namespace = "cid" if args.cid else "name"
11

12
# load cache if present
13
cache: Dict[str, Dict[str, str]] = {}
14
if cache_path:
15
    cache = load_cache(cache_path)
16

17
# figure out which IDs still need querying
18
to_query = [i for i in ids if i not in cache]
19
print(f"{len(ids)} total IDs  /  {len(to_query)} to query (cached {len(ids)-len(to_query)})")
20

21
# batch loop
22
for i in range(0, len(to_query), args.batch_size):
23
    batch = to_query[i : i + args.batch_size]
24
    print(f"Fetching batch {i // args.batch_size + 1}  (size {len(batch)}) ...", end="", flush=True)
25
    try:
26
        props_dict = pug_request(namespace, batch, args.props)
27
        cache.update(props_dict)
28
        print(" done.")
29
    except Exception as exc:
30
        print(f" failed ({exc}).")
31
    time.sleep(args.sleep)
32

33
# save cache
34
if cache_path:
35
    save_cache(cache, cache_path)
36
    # optionally remove cache file after run
37
    if args.auto_delete_cache and cache_path.exists():
38
        cache_path.unlink()
39
        print("Deleted cache", cache_path)
40

41
# add columns back to DataFrame
42
prop_names = args.props.split(",")
43
for prop in prop_names:
44
    key_series = (
45
        df[args.id_column].astype(int).astype(str)
46
        if args.cid
47
        else df[args.id_column].astype(str)
48
    )
49
    df[prop] = key_series.map(lambda x: cache.get(x, {}).get(prop, ""))

使用 API 的运行速度比爬虫快的多，爬虫加载一个网页的时间已经够 API 查询上百条化合物，工作效率得到了巨大的增加。同时还不需要担心大量检索触发数据库网站的访问限制，以及部分数据库的 ui 界面加载速度较慢造成超时报错。

在这之后我还想进一步简化流程，希望能够将从 PubChem 手动下载化合物列表这一步也自动化，实现输入 Taxonomy ID，直接输出对应的，包含所有所需信息的化合物表格，但是这遇到了一些困难。PUG REST 并没有将 Taxonomy ID 和 CID 直接联系在一起的功能，而最接近的只有 Taxonomy ID -> AID -> CID。文档中有写到：

Assays and Bioactivities

The following operation returns a list of compounds involved in a given taxonomy. Valid output formats are XML, JSON(P), ASNT/B, and TXT.

https://pubchem.ncbi.nlm.nih.gov/rest/pug/taxonomy/taxid/2697049/aids/TXT

There is no operation available to directly retrieve the bioactivity data associated with a given taxonomy, as often the data volume is huge. However, one can first get the list of AIDs using the above link, and then aggregate the concise bioactivity data from each AID, e.g.:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1409578/concise/JSON

实际操作后发现，对于大量的 taxonomy，尤其是我的项目所涉及的天然中药材，/taxonomy/taxid/xxxxxxx/aids/ 都只会返回 404，即数据库内没有对应的 Taxonomy ID -> AID 的记录，这样也就没有办法继续从 AID 列表获取 CID 了。

质谱软件

在花费大量时间靠着自己那一点点业余计算机知识为各种数据库构建 parser 并尝试自动化收集数据后，我才开始考虑是否有人之前做过类似的事情，但做得远比我更好。于是我简单搜索就发现了这份质谱软件列表。质谱分析问题已经有着一个完整的工具生态系统，并且很多工具非常适配我的需求，甚至有更高级的工具利用机器学习算法直接通过质谱波峰预测蛋白质序列。

比如，在华盛顿大学上学的朋友和我提到他们学校开发的 Crux 工具，里面的很多方法完全可以更快更好的完成我的任务：

tide-index Create an index of all peptides in a fasta file, for use in subsequent calls to tide-search.

tide-search Search a collection of spectra against a sequence database, provided either as a FASTA file or an index, returning a collection of peptide-spectrum matches (PSMs). This is a fast search engine, but it runs most quickly if provided with a peptide index built with tide-index.

comet Search a collection of spectra against a sequence database, returning a collection of PSMs. This search engine runs directly on a protein database in FASTA format.

percolator Re-rank and assign confidence estimates to a collection of PSMs using the Percolator algorithm. Optionally, also produce protein rankings using the Fido algorithm.

kojak Search a collection of spectra against a sequence database, finding cross-linked peptide matches.

还有很多利用相似算法的商业程序，有着完整的前端甚至 web 界面，比如 InstaNovo 等，只要上传光谱文件，就能直接得到 Transformer 模型的预测结果。

总结

捣鼓了半天原来在重复造轮子确实很让我失望，但我并不后悔自己编写脚本，因为这我理解了幕后机制，也是我第一次爬虫实战。我犯的最大错误是没有先思考为什么，就直接去考虑怎么做，过于专注于「爬虫」这一技术而不是「分析质谱数据」这一目的。在科学计算领域，往往站在巨人的肩膀上会更有效。