PepFold

Methodology

PepFold is a 5-stage computational pipeline that takes SNP variant identifiers (rsIDs) as input and produces annotated pharmacogenomic reports with ranked peptide candidates and Fmoc-SPPS synthesis protocols.

Pipeline Architecture

rsIDs → [1. ClinVar] → [2. UniProt] → [3. Evo 2 / Rational Design] → [4. ESMFold] → [5. Scoring] → Report

Stage 1: Variant Annotation

  • Source: NCBI ClinVar via EUtils API (esearch.fcgi + esummary.fcgi)
  • Input: List of rsIDs (e.g., rs429358)
  • Output: Clinical significance (pathogenic, benign, drug-response, etc.), associated gene symbol, review status
  • Filtering: Variants classified as purely "benign" are excluded from downstream analysis
  • Note: 12 common pharmacogenomic variants are cached locally to reduce API latency
  • Rate limiting: NCBI requires max 3 requests/second without API key

Stage 2: Target Mapping

  • Source: UniProt REST API
  • Input: Gene symbols from Stage 1
  • Output: Protein sequence, annotated binding sites (if available), protein name, function
  • Binding region estimation: If UniProt provides binding site annotations, the first annotated site is used; otherwise, a 30-residue window centered on the sequence midpoint is used as the candidate interaction region

Stage 3: Peptide Generation

Primary: NVIDIA BioNeMo Evo 2

  • 40B parameter genomic foundation model
  • Endpoint: health.api.nvidia.com/v1/biology/arc/evo2-40b/forward
  • Method: Forward pass on the binding region sequence, sample top-K amino acids per position from output logits
  • Candidates per target: Configurable (default 3)

Fallback: Deterministic Rational Design

When the Evo 2 API is unavailable or returns no usable logits, the pipeline uses a deterministic rational design heuristic. This fallback is NOT machine learning; it is a rule-based heuristic.

  • Method: Charge-complementarity mapping (acidic→basic, hydrophobic pairs, aromatic pairs)
  • Determinism: SHA-256 hash of position index for reproducible selection

Note: The generation method (evo2 or rational_design) is recorded per candidate.

Stage 4: Structure Prediction

  • Source: ESMFold API (Meta, api.esmatlas.com/foldSequence/v1/pdb/)
  • Input: Peptide amino acid sequence (plain text)
  • Output: PDB-format 3D structure, per-residue pLDDT confidence scores
  • Rendering: Interactive py3Dmol viewers in HTML reports, colored by pLDDT
  • Failure mode: If ESMFold is unavailable, candidate receives pLDDT=0.0 and is flagged as "structure not predicted"

Stage 5: Heuristic Scoring

Scoring is NOT machine learning — it relies on deterministic, rule-based scoring across 4 dimensions with fixed weights.

DimensionWeightMethod
Binding affinity35%Charge complementarity + hydrophobic matching + size compatibility between peptide and target binding region
Structural confidence30%pLDDT score from ESMFold (0-100 normalized to 0-1) + peptide length optimality (8-25 aa = 1.0) + amino acid diversity
Clinical relevance20%Keyword matching on ClinVar clinical_significance field. Pathogenic=1.0, Drug response=0.9, Risk factor=0.7, Uncertain=0.3, Benign=0.1. Review status weighted 30%, significance 70%.
Novelty15%Pairwise sequence similarity against all other candidates. Score = 1 - average_similarity.

Overall = 0.35 × binding + 0.30 × structural + 0.20 × clinical + 0.15 × novelty

Candidates are ranked by overall score, and the top N per target are selected.

Synthesis Protocol Generation

Method: Rule-based Fmoc-SPPS (solid-phase peptide synthesis) template.

  • Resin selection: Wang resin for standard C-terminal, Rink Amide for K/R C-terminal
  • Coupling reagents: HBTU/DIPEA standard; HATU for difficult residues (H, N, Q, R, W)
  • Coupling times: 30-60 min based on position and residue difficulty
  • Cleavage cocktail: Selected based on sequence composition:
    • Cys/Met present → Reagent K (TFA/phenol/thioanisole/EDT)
    • Trp/Arg present → TFA/TIS/water/EDT
    • Otherwise → TFA/TIS/water
  • Purification: RP-HPLC with C18 or C4 column based on hydrophobicity
  • QC: ESI-MS, RP-HPLC, amino acid analysis, LAL endotoxin test
  • Cost estimate: $15/residue base + surcharges for difficult residues and purity

IMPORTANT: These are TEMPLATES requiring laboratory optimization, not validated protocols.

Report Format

  • HTML: Interactive (py3Dmol viewers, Plotly charts), self-contained
  • PDF: Generated via Playwright headless browser
  • Sections: Variant annotations table, target mapping, ranked candidates with scores, 3D viewers, synthesis protocol per candidate

Data Sources & External Dependencies

DependencyPurposeRate Limits / NotesFailure Handling
NCBI EUtilsClinVar variant annotationMax 3 req/sec without API keyLocal cache fallback for 12 common variants
UniProt APIProtein sequence mappingStandard fair usePipeline aborts target if mapping fails
BioNeMo Evo 2Peptide generationNVIDIA API limitsFallback to rational design heuristic
ESMFold APIStructure predictionMeta API limitsStructure flagged as not predicted; pLDDT=0.0

Limitations

  • No molecular docking or free energy calculations
  • Binding scores are heuristic, not physics-based
  • ESMFold predictions are models, not experimental structures
  • Evo 2 may fall back to rational design without notice (flagged in report)
  • Synthesis protocols require wet-lab optimization
  • Not experimentally validated
  • Clinical relevance scoring uses keyword matching, not curated pharmacogenomic databases like PharmGKB

Reproducibility

  • Pipeline is deterministic given the same external API responses
  • Rational design fallback uses SHA-256 seeded by position for reproducibility
  • Job IDs are cryptographic (128-bit) for report retrieval
  • All external data sources are timestamped in reports

Citation

If you use PepFold in your research, please cite:

PepFold: Pharmacogenomic Variant-to-Synthesis Pipeline. Olam Création, 2026. https://pepfold.com