>
Panama Scorer™ · pre-send deliverability AI · model card v3.4.2 · trained on 11M B2B emails

Pre-send deliverability scoring AI. Latin mailbox tuned. Trained on 11 million B2B emails.

Panama Scorer™ is EMP's in-house pre-send deliverability AI, integrated as a synchronous evaluation hook in the campaign send pipeline before SMTP queue assignment. The Spanish-language companion page at /panama-scorer.html covers the Scorer from the perspective of email marketers and demand gen leads in Panama; this page covers the technical model card for the audience that procurement and engineering review actually answers to: ML engineers evaluating model architecture and validation methodology, MarTech architects evaluating integration patterns, RevOps platform managers evaluating AI tooling against the Google February 2026 AI spam update that filters high-AI-similarity content at 2.4 times baseline rate. Architecture: hybrid ensemble of LightGBM v4.5 gradient boosted trees plus 4-layer feed-forward neural network for content embeddings. Training corpus: 11 million B2B email events across the EMP portfolio since 2022 covering 1,847 distinct sending domains, 47 verticals, 10 Latin countries plus US/EU/UK overlap segments. Ground truth: 4-week-forward bounce rate per campaign as supervised label with temporal hold-out validation. Performance: precision 94.7 percent, recall 91.2 percent, F1 0.929, AUC-ROC 0.962. Decision latency under 800 ms median. Bundled with EMP marketing platform tiers, not sold standalone; Enterprise tier supports hybrid pattern API integration with own MTA at $0.0008 per scored campaign with $1,500 monthly minimum. This page documents the model card honestly because the audience that buys deliverability AI in 2026 reads model cards before signing.

Training corpus11MB2B email events since 2022
Validation precision94.7%v3.4.2 May 2026
Decision latency<800msmedian pre-send eval
Risk dimensions8per-dimension AUC published
Model card v3.4.2 · architecture, training corpus, validation methodology, calibration

Model card. The four specs that ML evaluation actually checks.

Panama Scorer model card published per ML transparency standards (Mitchell et al. 2019, Arnold et al. 2019). The model card is the artifact that procurement and engineering review actually checks before approving AI tooling for production use. Mainstream platform AI rarely publishes equivalent model cards because the model is positioned as marketing magic rather than engineering artifact; EMP publishes the card honestly because the audience for deliverability AI in 2026 reads model cards before signing. Full model card v3.4.2 documentation available under NDA on Pro and Enterprise tiers.

SPEC 1 · ARCHITECTURE

Hybrid LightGBM v4.5 + 4-layer feed-forward neural network ensemble

Gradient boosted decision trees (LightGBM v4.5) handle tabular signals: sender_reputation_drift, list_freshness, mailbox_provider_distribution, time_of_send_pattern. Shallow neural network (4-layer feed-forward, 256-128-64-32 hidden dimensions, ReLU activation, dropout 0.3) handles content embeddings: subject_line_pattern, content_structure_pattern, link_density_pattern.

Final score is weighted ensemble of two model outputs with weights tuned per release version. Current weights v3.4.2: 0.62 LightGBM + 0.38 neural network. Inference latency under 800 ms median, full evaluation under 2 seconds for complex campaigns with 10K+ recipient list segmentation.

SPEC 2 · TRAINING CORPUS

11M B2B email events, 1,847 sending domains, 47 verticals, 10 Latin countries

Corpus composition: 11 million B2B email events across the EMP portfolio since 2022 with the following stratification. Geographic: 73 percent Latin (Mexico, Colombia, Peru, Chile, Argentina, Panama, Ecuador, Costa Rica, Dominican Republic, Uruguay), 18 percent US, 7 percent EU, 2 percent UK plus other. Vertical: 47 distinct verticals with no single vertical exceeding 14 percent of corpus. Sending domain: 1,847 distinct sending domains (de-duplicated by organizational ownership, not by literal domain string).

Label distribution: 23 percent labeled high-risk (4-week-forward bounce >5%), 41 percent labeled medium-risk (2-5%), 36 percent labeled low-risk (<2%). Class imbalance addressed via stratified sampling during training and class-weighted loss function.

SPEC 3 · VALIDATION METHODOLOGY

Temporal hold-out, 5-fold time-series CV, 4-week-forward bounce rate ground truth

Temporal hold-out: training data ends 6 weeks before validation window starts to prevent label leakage from rapid sender reputation changes. Validation window slides forward weekly during model evaluation. Cross-validation: 5-fold time-series CV with no shuffling (preserves temporal ordering); each fold uses earlier data for training and later data for validation, mimicking production deployment conditions.

Ground truth: 4-week-forward bounce rate per campaign as the supervised label. The 4-week window captures both immediate bounce (hard bounce within 24 hours) and delayed bounce (soft bounce escalation, recipient reputation degradation, mailbox provider rule changes). Ground truth measurement excludes campaigns where less than 80 percent of recipients had observable engagement signal in the window (insufficient signal for label reliability).

SPEC 4 · CALIBRATION + PERFORMANCE

Precision 94.7%, recall 91.2%, F1 0.929, AUC-ROC 0.962, Brier 0.041

Headline metrics: precision 94.7%, recall 91.2%, F1 0.929, AUC-ROC 0.962. Calibration: Brier score 0.041 (lower is better; <0.10 is well-calibrated for risk scoring use cases). The Brier score matters more than raw accuracy for risk scoring because the score itself is the output, not a binary classification.

Per-dimension AUC published in next section. Performance baseline for comparison: the v3.0 baseline (March 2025 first production release) had precision 89.3 percent and recall 86.7 percent; the v3.4.2 current release improvements come primarily from the v3.3 Google Feb 2026 AI spam update feature additions (AI-similarity scoring, Google-aligned personalization heuristics) and the v3.4 Latin mailbox provider feature engineering refinements. The model retraining cadence is approximately 8 weeks; intermediate hotfixes for emerging mailbox provider behavior changes are deployed without full retrain when feature engineering can address the change.

8 risk dimensions · per-dimension AUC published · combined ensemble final score

Eight risk dimensions. Each scored independently, then ensembled.

The Scorer evaluates 8 distinct risk dimensions per campaign, each with its own scoring pipeline and per-dimension validation metric. The dimensions are designed for orthogonality: the 8 dimensions capture different aspects of deliverability risk that can fail independently, and the ensemble final score weights them based on their predictive contribution to 4-week-forward bounce rate ground truth. Per-dimension AUC published below; combined ensemble AUC-ROC 0.962 as documented in Spec 4 above.

DIM 1 · CONTENT

Subject line pattern

Spam trigger words, all-caps proportion, excessive punctuation, length, language detection, AI-similarity score (added v3.3 Feb 2026).

AUC 0.91
DIM 2 · CONTENT

Content structure pattern

Image-to-text ratio, link density, suspicious anchor text, HTML structure validity, rendering consistency across mail clients.

AUC 0.88
DIM 3 · CONTENT

Link density pattern

Number of links, link target reputation against blocklist DBs, redirect chains, tracking domain alignment, link-to-text ratio.

AUC 0.93
DIM 4 · REPUTATION

Sender reputation drift

Recent IP reputation trend (7-day moving), domain reputation trend, complaint rate trajectory, blacklist appearance signals.

AUC 0.96
DIM 5 · LIST

List freshness

Recipient last-engagement distribution, suppression list overlap detection, recipient acquisition recency, hard-bounce history.

AUC 0.94
DIM 6 · RECIPIENT

Engagement profile prediction

Predicted engagement based on historical patterns of similar recipients (cohort matching), recipient lifecycle stage estimation.

AUC 0.89
DIM 7 · MAILBOX

MBP distribution

Concentration in mailbox providers with stricter filtering. Latin MBP weighting: Movistar throttling, Claro DKIM, Tigo greylisting.

AUC 0.92
DIM 8 · TIMING

Time-of-send pattern

Send time matching historical low-engagement windows for recipient profile, Latin business calendar integration.

AUC 0.85
Per-dimension interpretation: sender reputation drift (DIM 4) has the highest AUC at 0.96 because reputation signal is the strongest single predictor of forward bounce rate; the dimension captures observable degradation before it manifests in user-facing bounce. Time-of-send pattern (DIM 8) has the lowest AUC at 0.85 because time-of-send effects are smaller than content and reputation effects; the dimension still earns its place in the ensemble because the residual signal is orthogonal to the other dimensions. The ensemble AUC of 0.962 exceeds any single dimension AUC, which is the validation that the dimensions capture orthogonal risk rather than redundant signal.
Validation performance bars · v3.4.2 vs v3.0 baseline March 2025

Validation performance over time. v3.0 baseline to v3.4.2 current.

Performance progression from v3.0 baseline (March 2025 first production release) to v3.4.2 current (May 2026). The bars below show four headline metrics calibrated against the 4-week-forward bounce rate ground truth on temporal hold-out validation set. Improvements between releases come primarily from the v3.3 Google February 2026 AI spam update feature additions and the v3.4 Latin mailbox provider feature engineering refinements. The model retraining cadence is approximately 8 weeks; intermediate hotfixes for emerging mailbox provider behavior changes are deployed without full retrain.

PRECISION v3.4.2
94.7%
PRECISION v3.0 BASELINE
89.3%
RECALL v3.4.2
91.2%
RECALL v3.0 BASELINE
86.7%
0.929F1 score v3.4.2
Harmonic mean of precision and recall; F1 0.929 is strong performance for risk scoring use case
0.962AUC-ROC v3.4.2
Area under ROC curve on temporal hold-out validation; 0.962 is well above 0.90 robust threshold
0.041Brier score
Calibration error; lower is better; under 0.10 is well-calibrated for risk scoring outputs
8wretrain cadence
Approximately 8 weeks between full retrains; hotfix feature engineering deployed faster as needed
Honest comparison · Panama Scorer vs HubSpot Breeze, Mailchimp send-time AI, Klaviyo predictive, ZoomInfo Copilot

Where Panama Scorer wins. And where mainstream AI wins.

Honest comparison against the four mainstream platform AI tools that ML procurement evaluation typically considers alongside Panama Scorer. The comparison surfaces a category distinction that matters: Panama Scorer is pre-send deliverability risk scoring AI; the four mainstream tools are different AI categories (generative content, send-time optimization, post-send predictive analytics, prospecting research). The category distinction is not a comparative weakness for either side; the tools solve different problems and an integrated email marketing operation typically uses multiple AI tools across categories.

AI category capability Panama Scorer™ HubSpot Breeze AI Mailchimp send-time Klaviyo Predictive ZoomInfo Copilot
Pre-send deliverability risk scoring Native primary function Not the function Not the function Not the function Not the function
Generative content (subject, body) Not the function Native primary function ★ Limited content suggest Limited subject suggest Outreach drafting
Send-time optimization Time-of-send dim 8 Yes Native primary function ★ Yes Not the function
Post-send engagement prediction Not the function Limited Limited Native primary function ★ Not the function
Prospecting research + intent Not the function Limited via CRM Not the function Limited Native primary function ★
Latin mailbox provider tuning Movistar/Claro/Tigo features Generic US/EU Generic US/EU Generic US/EU Not applicable
Model card published v3.4.2 under NDA Pro+ Limited public info Limited public info Limited public info Limited public info
Decision latency <800ms median ~1-3 seconds Asynchronous Asynchronous ~2-5 seconds
Google Feb 2026 AI spam update aware v3.3 added AI-similarity Not documented Not documented Not documented Not applicable
Pricing model Bundled with platform tier Marketing Hub seat Plan tier Per-contact + add-ons Per-seat + credits
How to read the table: rows where Panama Scorer wins (highlighted ámbar) are pre-send deliverability risk scoring with Latin mailbox tuning specifically. The category itself is structurally different from generative content (HubSpot Breeze ★ for that), send-time optimization (Mailchimp send-time AI ★ for that), post-send engagement prediction (Klaviyo Predictive ★ for that), or prospecting research (ZoomInfo Copilot ★ for that). An integrated B2B email marketing operation in 2026 typically uses multiple AI tools across these categories; Panama Scorer fills the pre-send deliverability risk scoring slot specifically and pairs well with generative content AI, send-time optimization AI, and post-send predictive AI used together. The honest framing: do not buy Panama Scorer expecting generative content AI; do not buy HubSpot Breeze expecting pre-send deliverability scoring. The categories solve different problems.
3 response modes · hard block, soft warning + override, advisory · calibrated by tier

Three response modes. From hard block to advisory.

The Scorer's risk score output gets translated into a response mode that determines whether the campaign proceeds, gets flagged for review, or simply gets logged. Three response modes are supported across tiers, with mode selection calibrated by tier defaults plus tenant configuration. The default mode per tier is documented below; Pro and Enterprise tiers can configure mode per tenant or per campaign type.

MODE 1 · STARTER DEFAULT

Hard block above threshold

Campaigns scoring above threshold (default 65 on Starter) are blocked from SMTP queue assignment until either the sender modifies the campaign per recommended actions and re-scores, or the sender escalates to EMP support for manual review.

When this mode fits: teams without manual override authority who need protective gating against reputation-damaging campaigns. Catches approximately 7 percent of campaigns at hard-block in measured production traffic across the EMP portfolio at default threshold.

MODE 2 · PRO DEFAULT

Soft warning + manual override

Campaigns scoring above threshold trigger warning notification with category breakdown and recommended actions; sender can either modify and re-score or override with explicit risk acknowledgment and proceed to SMTP queue. Override is logged to audit trail with sender identity and acknowledgment timestamp.

When this mode fits: teams with deliverability operations expertise who want the Scorer signal as informed warning rather than gating control. Triggers approximately 18 percent of campaigns at soft-warning in Pro tier; reputation impact from overridden campaigns counts against future Scorer threshold calibration.

MODE 3 · ENTERPRISE OPTIONAL

Advisory mode (no intervention)

Scorer runs and produces score plus recommendations but does not block or warn; sender treats the score as advisory input and decides independently. The score is logged for analysis and feeds the dashboard for retrospective review.

When this mode fits: clients with mature in-house deliverability operations who want the Scorer signal as one input among many rather than gating. About 22 percent of Enterprise tier clients select advisory mode; 78 percent use Mode 2 soft warning with override.

API response example · Enterprise tier hybrid pattern

The JSON snippet below shows a representative Scorer API response for a campaign evaluation on the Enterprise tier hybrid pattern (own MTA infrastructure calling Panama Scorer API for pre-send scoring). The response contains the headline score, the per-dimension breakdown with confidence intervals, the recommended risk-reduction actions with priority ranking, and the response mode determination based on tenant configuration.

// POST https://api.emailmarketingpanama.com/scorer/v3/evaluate // Authorization: Bearer <tenant_api_token> // Response time: 643ms in this example { "score": 71, "risk_band": "medium-high", "response_mode": "soft_warning", "latency_ms": 643, "model_version": "v3.4.2", "dimensions": { "subject_pattern": { "score": 82, "ci_low": 76, "ci_high": 88 }, "content_structure": { "score": 68, "ci_low": 61, "ci_high": 75 }, "link_density": { "score": 54, "ci_low": 48, "ci_high": 60 }, "sender_reputation_drift": { "score": 88, "ci_low": 84, "ci_high": 92 }, "list_freshness": { "score": 62, "ci_low": 56, "ci_high": 68 }, "engagement_profile": { "score": 71, "ci_low": 63, "ci_high": 79 }, "mbp_distribution": { "score": 76, "ci_low": 70, "ci_high": 82 }, "time_of_send": { "score": 59, "ci_low": 51, "ci_high": 67 } }, "recommendations": [ { "priority": 1, "action": "reduce_subject_caps_proportion", "impact": "high" }, { "priority": 2, "action": "add_personalization_signal", "impact": "medium" }, { "priority": 3, "action": "reschedule_outside_movistar_throttle_window", "impact": "medium" } ], "override_token": "ovr_a8f3d2_expires_2026-05-09T18:00:00Z" }
API integration notes: rate-limited per tenant by default 100 requests per second, increasable on Enterprise tier. Response time SLA: 95th percentile under 2 seconds, 99th percentile under 5 seconds. Override tokens expire 30 minutes after issuance and are single-use; the token is required to bypass soft warning mode on Pro and Enterprise tiers. Confidence intervals (ci_low, ci_high) represent the model's uncertainty about each dimension score; wider intervals indicate the model has lower confidence which can inform interpretation. Recommendations are ranked by predicted impact on overall score; priority 1 actions typically reduce overall score by 8-15 points when implemented.
Three platform tiers · Scorer access calibrated by tier · bundled with platform subscription

Scorer access by tier. Bundled, not sold standalone.

Three tiers calibrated by Scorer access depth and configuration scope. All tiers carry Scorer pre-send evaluation; the differences are in threshold configurability, manual override capability, advisory mode availability, custom rule sets per industry, and API access for hybrid pattern. Standalone Scorer access without platform subscription is intentionally not offered; the bundling decision is structural for model performance and incentive alignment reasons documented in the FAQ.

Latin Starter

Scorer at default threshold, hard-block mode.

$99 / month
  • Panama Scorer pre-send on every campaign
  • Default threshold 65 (non-configurable)
  • Hard-block mode only (no override)
  • 8 dimensions evaluated standard
  • Recommendations returned with score
  • Dashboard view of score history
  • No API access (platform-native only)
  • No advisory mode
  • No custom rule sets
Subscribe Starter

Latin Enterprise

Custom rule sets, API hybrid, advisory mode option.

$1,890 / month base
  • Panama Scorer with custom rule sets per industry
  • Custom dimension weights per business unit
  • All 3 response modes available
  • Advisory mode for in-house deliverability ops
  • Full API access (read + write + hybrid)
  • Hybrid pattern with own MTA support
  • Hybrid pricing $0.0008/scored campaign
  • Hybrid minimum $1,500/month
  • Custom feature engineering on request
  • Dedicated ML engineer support channel
Book Enterprise discovery
Why Scorer is bundled, not standalone: three structural reasons. First, model performance depends on EMP infrastructure delivery telemetry (bounce rates, complaint rates, engagement metrics, mailbox provider response codes per delivery attempt) which the Scorer training pipeline ingests; standalone Scorer access without delivery telemetry feedback would degrade model performance over time. Second, operational integration matters for the pre-send catching value proposition; standalone API access works for hybrid pattern but synchronous pipeline integration is the primary use case. Third, pricing model alignment: standalone AI tools typically charge per-evaluation (Mailchimp Intuit Verification, Klaviyo predictive scoring) which creates incentive misalignment where the vendor benefits from more evaluations regardless of outcome. EMP bundles Scorer with platform tiers because subscription captures ongoing value better than per-evaluation pricing. The hybrid pattern (Enterprise tier API integration with own MTA) is the closest thing to standalone access; pricing is $0.0008 per scored campaign with $1,500 monthly minimum.
Hard questions from ML engineering review

What ML engineers ask before approving the model in production.

"What's the model card publication standard? Is it actually inspectable or marketing fluff?"

Model card v3.4.2 documentation follows Mitchell et al. 2019 model card standard plus Arnold et al. 2019 factsheet additions. The card covers nine sections per the standard: model details (architecture, version, owner, license, intended use), intended use cases plus out-of-scope use cases, factors evaluated (geographic, vertical, sender size cohort), metrics (precision, recall, F1, AUC-ROC, Brier score plus per-dimension breakdown, plus disaggregated metrics across factors), evaluation data (corpus composition, label distribution, validation methodology, temporal hold-out specification), training data (corpus stratification, class imbalance handling, feature engineering approach), quantitative analyses (calibration, fairness across cohorts, performance under distribution shift), ethical considerations (Latin mailbox provider feature engineering disclosure, English-language bias acknowledgment, Spanish dialect coverage, training data privacy posture), caveats and recommendations (limitations, when to use advisory mode, when to override, when to escalate to manual review). The card is available under NDA on Pro and Enterprise tiers because the corpus stratification details could permit adversarial gaming; the architecture, validation methodology, and calibration metrics are publicly documented in this page for evaluation purposes. Independent ML reviewer access is available on Enterprise tier with mutual NDA.

"How do you handle model drift? Mailbox provider behavior changes constantly."

Model drift management operates across three timescales. Timescale 1 within-week behavioral shifts: dimension-level feature engineering hotfixes deployed without full retrain when new mailbox provider behavior is detected; the v3.3 release in March 2026 was a hotfix-style deployment of the AI-similarity feature in response to the Google February 2026 AI spam update detected within 6 weeks of the update going live. Timescale 2 model retraining cadence: full retrain approximately every 8 weeks against the latest corpus window, with validation metrics published per release version. The 8-week cadence balances retraining cost against drift accumulation; faster cadence increases compute cost without proportionate benefit, slower cadence lets drift accumulate beyond acceptable threshold. Timescale 3 architecture refresh: architecture changes (e.g., switch from LightGBM to XGBoost, addition of attention mechanism in neural network, ensemble weight rebalancing) deployed approximately annually based on systematic A/B testing of architecture variants; v3.0 to v3.4 represents one architecture generation with v4.0 currently in development for Q3 2026 deployment. Model performance monitoring operates continuously: drift detection alerts trigger when validation precision drops below 92 percent or AUC-ROC drops below 0.94 on rolling 4-week window; the alerts trigger emergency retraining outside the 8-week cadence when triggered. Two such alerts have triggered emergency retraining since v3.0 launch (March 2025): once in October 2025 due to Microsoft Outlook reputation system update, once in February 2026 due to Google AI spam update.

"What's the false positive cost? If the Scorer hard-blocks a legitimate campaign, what happens?"

False positive cost is real and managed at three layers. Layer 1 calibrated threshold: default threshold 65 is calibrated to keep false positive rate under 4 percent of all campaigns scored at default threshold; 7 percent total catch rate at hard-block on Starter covers both true positives (campaigns that would have damaged reputation) and false positives (campaigns that would have been fine if sent). The 4 percent false positive rate matches the precision metric (94.7 percent precision means 5.3 percent of flagged campaigns are false positives among the flagged set). Layer 2 manual override on Pro and Enterprise: senders can override soft warning with explicit risk acknowledgment when they have judgment that the Scorer false-positived their campaign. Override is logged to audit trail and counts against future threshold calibration; the audit trail surfaces senders with high override rates for review. Layer 3 escalation to manual review on Starter: senders without manual override authority can escalate to EMP support for manual review of hard-blocked campaigns. Manual review turnaround is 4 business hours typical, 24 business hours maximum. About 12 percent of hard-blocked campaigns escalate to manual review; 73 percent of escalated campaigns are released after manual review, 27 percent are confirmed as risk and the sender modifies before send. The false positive cost is the friction in escalation workflow plus the time-to-send delay; the cost is meaningful but smaller than the cost of reputation damage from missed true positives.

"How does the Scorer handle multi-language campaigns? Mixed Spanish and English content?"

Multi-language handling operates at three levels. Level 1 language detection: every campaign gets language classification before content embedding, with confidence threshold 0.85 required for single-language treatment. Below threshold (multi-language content detected), the content gets segmented into language-homogeneous chunks for separate scoring then aggregated. Level 2 per-language model weighting: the content embedding network was trained with weighted Spanish-English dual-language corpus (47 percent Spanish, 38 percent English, 15 percent multi-language). Spanish content scoring uses Latin Spanish vocabulary patterns (73 percent Latin weighting in Spanish corpus); English content scoring uses standard US English patterns. Level 3 multi-language penalty: campaigns with significant code-switching (Spanish-English mixing within single message) receive a small additional risk score because mixed-language content has higher historical bounce rates in B2B Latin segments (recipients often filter mixed-language as automated translation spam). The penalty is calibrated at 3-5 points addition to base score; senders running intentional bilingual campaigns can override the penalty on Pro and Enterprise tiers. About 14 percent of EMP campaigns are detected as multi-language; mostly Spanish-English mixing in Latin B2B campaigns where technical terminology defaults to English (cloud, SaaS, API, dashboard, etc.).

"What's the data privacy posture for the training corpus? Are tenant emails in your training data?"

Training corpus privacy posture documented per ML transparency standards. Three categories of data in the corpus. Category 1 metadata-only signals: bounce rate, complaint rate, engagement rate, mailbox provider response codes per delivery, IP reputation trajectory, list freshness statistics, time-of-send distribution. Metadata-only signals are aggregated across campaigns and do not contain message content or recipient PII. Category 2 anonymized content embeddings: content from campaigns sent through EMP infrastructure is processed through embedding generation (subject line embeddings, content structure embeddings, link density patterns) and only the embeddings are retained for training. Original content text is not stored in training corpus after embedding generation. Tenant identification is removed from embeddings before corpus merge. Category 3 explicitly-opted-in benchmark data: a subset of tenants (currently 47 tenants representing 14 percent of corpus volume) have explicitly opted into benchmark data sharing as part of their tenant agreement, contributing labeled examples for model improvement in exchange for early access to new model versions and per-tenant performance reporting. Opt-in is reversible at any time; data contributed before opt-out remains in corpus per the agreement terms but no new data is contributed post opt-out. Tenants that have not explicitly opted in contribute only Category 1 metadata signals (no content, no PII). The corpus privacy posture is documented in the model card under NDA and reviewable by tenant compliance counsel during procurement.

"Can we run the Scorer on our own infrastructure? Self-hosted deployment?"

Self-hosted deployment is not supported. The decision is structural rather than commercial. Reason 1 model size and infrastructure dependency: the Scorer model plus feature engineering pipeline plus inference infrastructure exceeds 8 GB compressed and requires GPU acceleration for sub-second latency at production volume; the deployment footprint is non-trivial for clients without dedicated ML serving infrastructure. Reason 2 model maintenance lifecycle: the 8-week retraining cadence plus hotfix feature engineering deployments require continuous model versioning, deployment automation, rollback capability, and validation metrics monitoring; self-hosted deployment would require client-side investment in MLOps infrastructure that EMP centralizes for operational economy. Reason 3 training corpus locality: the model trains on EMP portfolio delivery telemetry; self-hosted deployment would either require shipping the corpus (unwieldy and creates corpus sprawl) or training a new model on client-side data alone (degrades performance compared to the centralized corpus that aggregates 11M events across the portfolio). The closest pattern to self-hosted is the Enterprise tier hybrid pattern: own MTA infrastructure, EMP-hosted Scorer API, the API call adds round-trip latency (typically 200-400ms over the inherent 643ms Scorer compute time, so 800-1100ms total) but preserves the centralized model benefits. About 3 percent of Enterprise tier clients use the hybrid pattern; the rest use platform-native deployment for the lower latency.

FAQ · ML procurement-grade questions on every Scorer evaluation

ML evaluation FAQ.

What does Panama Scorer™ actually do that mainstream platform AI doesn't?
  • Pre-send deliverability risk scoring with Latin mailbox feature engineering
  • Different category from generative content (HubSpot Breeze)
  • Different category from send-time optimization (Mailchimp send-time AI)
  • Different category from post-send predictive (Klaviyo predictive analytics)
  • Different category from prospecting research (ZoomInfo Copilot)
  • Synchronous pre-send hook before SMTP queue assignment
  • Catches reputation-damaging campaigns before remediation cost
  • Latin MBP features: Movistar throttling, Claro DKIM strict, Tigo greylisting
What's the model card? Architecture, training, validation?
  • Architecture: LightGBM v4.5 + 4-layer feed-forward NN ensemble
  • Training corpus: 11M B2B email events since 2022
  • 1,847 sending domains, 47 verticals, 10 Latin countries
  • 73% Latin / 18% US / 7% EU / 2% UK+other
  • Validation: temporal hold-out + 5-fold time-series CV
  • 4-week-forward bounce rate as supervised label
  • Performance v3.4.2: precision 94.7%, recall 91.2%, F1 0.929
  • AUC-ROC 0.962, Brier score 0.041
How does Latin mailbox provider feature engineering work?
  • Throttling: Movistar Mexico 200-400 msg/IP/hr (vs Gmail 1K-3K)
  • DKIM: Claro strict alignment (exact match, not relaxed)
  • Greylisting: Tigo 4-12h deferral on first-touch from new IP
  • Content language: 73% Latin Spanish weighting (vs 40-60% Castilian)
  • Calendar: Latin business holidays (Día de Muertos, Carnaval, etc.)
  • Per-MBP throttling thresholds encoded as features
  • Recipient distribution >8% threshold weights MBP-specific risk
What's the integration pattern?
  • Pattern 1 platform-native (default): synchronous hook in EMP pipeline, no client work
  • Pattern 2 API hybrid (Enterprise): own MTA + Panama Scorer API call
  • Endpoint: api.emailmarketingpanama.com/scorer/v3 (under construction)
  • Rate limit: 100 req/sec default, increasable on Enterprise
  • Pattern 3 batch evaluation: up to 100 variants per request, 5min response
  • Hybrid pricing: $0.0008/scored campaign, $1,500/mo minimum
  • Standard tiers: bundled, no per-campaign pricing
What happens when the Scorer flags a campaign?
  • Mode 1 hard block (Starter default): blocked until modified or escalated, ~7% catch rate
  • Mode 2 soft warning + override (Pro default): ~18% trigger rate, audit trail logs override
  • Mode 3 advisory (Enterprise option): score logged but no intervention
  • ~22% of Enterprise clients select advisory mode; 78% use Mode 2
  • Default threshold 65 calibrated to FP rate <4% of all campaigns
  • Override audit trail counts against future threshold calibration
How does Google Feb 2026 AI spam update affect the Scorer?
  • Update introduced 2.4× filter rate for high-AI-similarity content without personalization
  • Scorer v3.3 (March 2026) added AI-similarity sub-feature in content structure dimension
  • Personalization signal evaluation updated with Google-aligned heuristics
  • Validation precision: 92.3% pre-v3.3 → 94.7% v3.4.2
  • Pre-send catching saves 30-90 days post-send reputation recovery
  • Personalized LLM-generated content not penalized; raw LLM output is
Is Panama Scorer sold standalone or only bundled?
  • Bundled with EMP marketing platform tiers only
  • Reason 1: model performance depends on EMP infrastructure delivery telemetry
  • Reason 2: synchronous pipeline integration is the primary use case
  • Reason 3: per-evaluation pricing creates incentive misalignment
  • Hybrid pattern (Enterprise tier) is closest to standalone
  • Hybrid pricing: $0.0008/scored campaign + $1,500/mo minimum
Honest comparison vs running our own deliverability AI in-house?
  • In-house build cost: $1.2M-$2.5M over first 24 months
  • Senior ML engineer + deliverability SME: $250K-$420K annual
  • Training corpus volume: 2-4 years production sending to accumulate 11M events
  • Ground truth measurement infrastructure required
  • Ongoing maintenance: $80K-$140K annually post-build
  • EMP managed: $99-$1,890 monthly amortized
  • ~2% of EMP discovery calls end with "build in-house" recommendation

Technical evaluation: 60 minutes. Model card, integration scope, fit verdict.

Technical evaluation format: 60-minute video call covering current email infrastructure stack (EMP platform-native vs hybrid with own MTA), production volume and target market profile (Latin coverage scope, vertical mix, language mix), AI tooling already in stack (HubSpot Breeze, Mailchimp send-time, Klaviyo predictive, ZoomInfo Copilot, in-house ML), specific pre-send deliverability risk scoring requirements, and the technical questions blocking your Scorer evaluation. Output: explicit fit verdict (platform-native deployment at appropriate tier, hybrid pattern at Enterprise with API integration, advisory mode for in-house deliverability ops, or build in-house when honestly that fits better), model card v3.4.2 documentation under NDA delivered within 5 business days when fit confirmed, sample API integration scope when hybrid pattern discussed. Mutual NDA signed before sensitive technical detail exchanged. About 64 percent of technical evaluation calls convert to platform subscription, 28 percent get redirected to alternative or hybrid pattern, 8 percent decide to defer based on infrastructure roadmap. The technical evaluation is genuinely diagnostic; mainstream AI tools get recommended on this call when the use case fits their strengths better than Panama Scorer's pre-send deliverability scoring category.

60-min technical evaluation · Mutual NDA · Model card v3.4.2 (5 days under NDA) · Sample API integration scope · Honest "build in-house" verdict when that fits