Backup & Monitoring
Data protection and observability are not optional — they are core operational requirements. This chapter covers everything from basic backup commands to a complete disaster-recovery plan and a production monitoring stack for your ScoopJoy deployment.
Bench backup: what gets backed up
Section titled “Bench backup: what gets backed up”The bench backup command creates three files:
| File | Contents | Format |
|---|---|---|
*-database.sql.gz | Full MariaDB database dump | Compressed SQL |
*-files.tar | Public files (uploads, images) | Tar archive |
*-private-files.tar | Private files (attachments, PDFs) | Tar archive |
# Basic backupbench --site scoopjoy.com backup
# Backup with files (recommended)bench --site scoopjoy.com backup --with-files
# Backup all sitesbench backup-all-sites
# Backup specific site with compressionbench --site scoopjoy.com backup --with-files --compressBackup files are stored in sites/<site-name>/private/backups/ by default.
Automated backups
Section titled “Automated backups”Using bench (built-in)
Section titled “Using bench (built-in)”When you run bench setup production, a crontab entry is automatically added
that runs backups every 6 hours:
# View the auto-configured backup schedulecrontab -l | grep backupCustom cron schedule
Section titled “Custom cron schedule”# Edit crontab for the frappe usercrontab -e
# Backup every 6 hours with files0 */6 * * * cd ~/frappe-bench && ~/frappe-bench/env/bin/bench --site scoopjoy.com backup --with-files >> ~/frappe-bench/logs/backup.log 2>&1
# Backup all sites at 2 AM daily0 2 * * * cd ~/frappe-bench && ~/frappe-bench/env/bin/bench backup-all-sites --with-files >> ~/frappe-bench/logs/backup-all.log 2>&1Offsite backup to S3
Section titled “Offsite backup to S3”Built-in S3 backup configuration
Section titled “Built-in S3 backup configuration”Frappe has built-in support for S3 backups via the S3 Backup Settings DocType. Configure it through the web UI or via bench:
bench --site scoopjoy.com set-config backup_s3_bucket "scoopjoy-erp-backups"bench --site scoopjoy.com set-config backup_s3_region "us-east-1"bench --site scoopjoy.com set-config backup_s3_access_key "AKIA..."bench --site scoopjoy.com set-config backup_s3_secret_key "your-secret-key"Automated S3 backup script with rotation
Section titled “Automated S3 backup script with rotation”For complete control — multiple sites, local rotation, and S3 lifecycle — use a
custom backup script. It backs up each site, uploads to S3 (including
site_config.json), then prunes both local and remote copies past their
retention windows.
#!/bin/bash# Automated backup with S3 upload and local rotation
set -euo pipefail
# ─── Configuration ───────────────────────────────────BENCH_PATH="$HOME/frappe-bench"SITES=("scoopjoy.com" "outlet1.scoopjoy.com" "outlet2.scoopjoy.com")S3_BUCKET="s3://scoopjoy-erp-backups"S3_REGION="us-east-1"LOCAL_RETENTION_DAYS=7S3_RETENTION_DAYS=90LOG_FILE="${BENCH_PATH}/logs/s3-backup.log"DATE=$(date +%Y-%m-%d_%H-%M-%S)
# ─── Functions ───────────────────────────────────────log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$LOG_FILE"}
backup_site() { local site=$1 log "Starting backup for ${site}"
cd "$BENCH_PATH"
# Create the backup ./env/bin/bench --site "$site" backup --with-files 2>> "$LOG_FILE"
if [ $? -ne 0 ]; then log "ERROR: Backup failed for ${site}" return 1 fi
# Find the latest backup files local backup_dir="${BENCH_PATH}/sites/${site}/private/backups" local latest_db=$(ls -t "${backup_dir}"/*-database.sql.gz 2>/dev/null | head -1) local latest_files=$(ls -t "${backup_dir}"/*-files.tar 2>/dev/null | head -1) local latest_private=$(ls -t "${backup_dir}"/*-private-files.tar 2>/dev/null | head -1)
# Upload to S3 local s3_path="${S3_BUCKET}/${site}/${DATE}"
if [ -n "$latest_db" ]; then aws s3 cp "$latest_db" "${s3_path}/" --region "$S3_REGION" 2>> "$LOG_FILE" log "Uploaded database backup: $(basename "$latest_db")" fi
if [ -n "$latest_files" ]; then aws s3 cp "$latest_files" "${s3_path}/" --region "$S3_REGION" 2>> "$LOG_FILE" log "Uploaded files backup: $(basename "$latest_files")" fi
if [ -n "$latest_private" ]; then aws s3 cp "$latest_private" "${s3_path}/" --region "$S3_REGION" 2>> "$LOG_FILE" log "Uploaded private files backup: $(basename "$latest_private")" fi
# Also backup site_config.json (holds the encryption key) aws s3 cp "${BENCH_PATH}/sites/${site}/site_config.json" \ "${s3_path}/site_config.json" --region "$S3_REGION" 2>> "$LOG_FILE"
log "Backup complete for ${site}"}
cleanup_local() { log "Cleaning up local backups older than ${LOCAL_RETENTION_DAYS} days" for site in "${SITES[@]}"; do find "${BENCH_PATH}/sites/${site}/private/backups" \ -type f -mtime "+${LOCAL_RETENTION_DAYS}" -delete 2>> "$LOG_FILE" done}
cleanup_s3() { log "Cleaning up S3 backups older than ${S3_RETENTION_DAYS} days" local cutoff_date=$(date -d "-${S3_RETENTION_DAYS} days" +%Y-%m-%d) for site in "${SITES[@]}"; do aws s3 ls "${S3_BUCKET}/${site}/" --region "$S3_REGION" | \ while read -r line; do local dir_date=$(echo "$line" | awk '{print $2}' | cut -d'_' -f1 | tr -d '/') if [[ "$dir_date" < "$cutoff_date" ]]; then local dir_name=$(echo "$line" | awk '{print $2}') aws s3 rm "${S3_BUCKET}/${site}/${dir_name}" \ --recursive --region "$S3_REGION" 2>> "$LOG_FILE" log "Removed old S3 backup: ${site}/${dir_name}" fi done done}
# ─── Main ────────────────────────────────────────────log "========================================="log "Starting backup run"
for site in "${SITES[@]}"; do backup_site "$site"done
cleanup_localcleanup_s3
log "Backup run complete"log "========================================="# Make executable and schedulechmod +x ~/scripts/backup-to-s3.sh
# Run every 6 hourscrontab -e# Add:0 */6 * * * ~/scripts/backup-to-s3.shRestoring from backup
Section titled “Restoring from backup”# Basic restore (database only)bench --site scoopjoy.com --force restore \ /path/to/20260320_120000-scoopjoy-database.sql.gz
# Full restore with all filesbench --site scoopjoy.com --force restore \ /path/to/20260320_120000-scoopjoy-database.sql.gz \ --with-public-files /path/to/20260320_120000-scoopjoy-files.tar \ --with-private-files /path/to/20260320_120000-scoopjoy-private-files.tar
# After restore: run migrations and clear cachebench --site scoopjoy.com migratebench --site scoopjoy.com clear-cachebench buildbench restartRestoring from S3 in Docker
Section titled “Restoring from S3 in Docker”This is the same restore flow as a Docker deployment,
just driven through docker compose exec into the backend container.
# Download backup files from S3aws s3 cp s3://scoopjoy-erp-backups/scoopjoy.com/2026-03-20_02-00-00/ ./restore/ --recursive
# Restore inside the Docker containerdocker compose exec backend bench --site scoopjoy.com --force restore \ /home/frappe/frappe-bench/sites/restore/20260320-database.sql.gz \ --with-public-files /home/frappe/frappe-bench/sites/restore/20260320-files.tar \ --with-private-files /home/frappe/frappe-bench/sites/restore/20260320-private-files.tar
docker compose exec backend bench --site scoopjoy.com migratedocker compose exec backend bench --site scoopjoy.com clear-cacheDisaster recovery strategy
Section titled “Disaster recovery strategy”RPO and RTO planning
Section titled “RPO and RTO planning”| Metric | Target | Implementation |
|---|---|---|
| RPO (Recovery Point Objective) | 6 hours | Backup every 6 hours to S3 |
| RTO (Recovery Time Objective) | 2 hours | Documented runbook + tested restore |
Database replication
Section titled “Database replication”For near-zero RPO, configure MariaDB replication. Start with the primary:
[mysqld]server-id = 1log_bin = /var/log/mysql/mariadb-binbinlog_format = ROWexpire_logs_days = 7max_binlog_size = 100Mbinlog_do_db = _scoopjoy_combinlog_do_db = _outlet1_scoopjoy_com-- On primary: create the replication userCREATE USER 'replication'@'%' IDENTIFIED BY 'secure-repl-password';GRANT REPLICATION SLAVE ON *.* TO 'replication'@'%';FLUSH PRIVILEGES;SHOW MASTER STATUS;Then point the replica at it:
[mysqld]server-id = 2relay_log = /var/log/mysql/relay-binread_only = 1-- On replica: configure replicationCHANGE MASTER TO MASTER_HOST='primary-db.example.com', MASTER_USER='replication', MASTER_PASSWORD='secure-repl-password', MASTER_LOG_FILE='mariadb-bin.000001', MASTER_LOG_POS=XXX;START SLAVE;SHOW SLAVE STATUS\GComplete disaster-recovery runbook
Section titled “Complete disaster-recovery runbook”When something breaks at 2 AM, you want a checklist, not a brainstorm. Keep this runbook printed and tested — an untested backup is just a hope.
SCENARIO 1: Application server failure (OS/hardware)─────────────────────────────────────────────────────1. Provision new Ubuntu 24.04 server2. Run production-setup.sh (see Chapter 27)3. Download latest S3 backup: aws s3 cp s3://scoopjoy-erp-backups/scoopjoy.com/LATEST/ ./restore/ --recursive4. Restore each site: bench --site scoopjoy.com --force restore ./restore/*database.sql.gz \ --with-public-files ./restore/*files.tar \ --with-private-files ./restore/*private-files.tar5. Copy encryption_key from backed-up site_config.json6. bench --site scoopjoy.com migrate7. bench restart8. Update DNS to point to new server IP9. Setup SSL: sudo -H bench setup lets-encrypt scoopjoy.com
ESTIMATED TIME: 1-2 hours
SCENARIO 2: Database corruption──────────────────────────────1. Stop all bench processes: sudo supervisorctl stop all2. Download latest S3 database backup3. bench --site scoopjoy.com --force restore ./restore/*database.sql.gz4. bench --site scoopjoy.com migrate5. bench --site scoopjoy.com clear-cache6. sudo supervisorctl start all7. Verify data integrity through spot checks
ESTIMATED TIME: 30-60 minutes
SCENARIO 3: Accidental data deletion by user─────────────────────────────────────────────1. Identify the most recent backup BEFORE the deletion occurred2. Create a temporary recovery site: bench new-site recovery.localhost --mariadb-root-password XXX3. Restore the backup to the recovery site4. Use bench console to extract the deleted records5. Re-create the records on the production site6. Drop the recovery site: bench drop-site recovery.localhost --force
ESTIMATED TIME: 1-3 hours depending on data volume
SCENARIO 4: Complete infrastructure failure──────────────────────────────────────────1. Follow Scenario 1 steps on a new provider2. If using K8s: redeploy Helm chart to new cluster3. Restore database from S34. Restore files from S35. Update DNS records (TTL should be low: 300s)6. Verify all sites and services
ESTIMATED TIME: 2-4 hoursMonitoring
Section titled “Monitoring”bench doctor: quick health check
Section titled “bench doctor: quick health check”bench doctorThis checks that background workers are running, the scheduler is active, and there are no stuck jobs.
Prometheus + Grafana
Section titled “Prometheus + Grafana”Set up Prometheus to scrape the Frappe stack components. Use node_exporter for
system metrics and dedicated exporters for MariaDB, Redis, and Nginx, plus
Frappe’s own ping endpoint.
global: scrape_interval: 15s
scrape_configs: # Node-level metrics (CPU, memory, disk) - job_name: 'node' static_configs: - targets: ['localhost:9100'] relabel_configs: - source_labels: [__address__] target_label: instance replacement: 'scoopjoy-erp-server'
# MariaDB metrics - job_name: 'mariadb' static_configs: - targets: ['localhost:9104']
# Redis metrics - job_name: 'redis' static_configs: - targets: ['localhost:9121']
# Nginx metrics - job_name: 'nginx' static_configs: - targets: ['localhost:9113']
# Custom Frappe health endpoint - job_name: 'frappe' metrics_path: /api/method/ping static_configs: - targets: ['localhost:8000']Install the exporters:
# Node exporter (system metrics)sudo apt install prometheus-node-exporter
# MariaDB exporter — create a monitoring user first:# CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'password';# GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';docker run -d --name mariadb-exporter \ -p 9104:9104 \ -e DATA_SOURCE_NAME="exporter:password@(localhost:3306)/" \ prom/mysqld-exporter
# Redis exporterdocker run -d --name redis-exporter \ -p 9121:9121 \ oliver006/redis_exporter \ --redis.addr=redis://localhost:6379Once metrics are flowing, point Grafana at Prometheus and graph queue depth with a PromQL query like this:
redis_key_size{key=~"rq:queue:.*"}Key metrics to monitor
Section titled “Key metrics to monitor”| Metric | Source | Alert Threshold |
|---|---|---|
| Response time (P95) | Nginx access log | > 5 seconds |
| Worker queue depth | Redis rq:queue:default | > 100 pending jobs |
| Active users | Frappe session store | Informational |
| Database size | MariaDB information_schema | > 80% disk usage |
| CPU usage | node_exporter | > 85% sustained |
| Memory usage | node_exporter | > 90% |
| Disk I/O wait | node_exporter | > 20% |
| SSL certificate expiry | blackbox_exporter | < 14 days |
| Failed login attempts | Frappe Activity Log | > 50/hour |
| Background job failures | Redis / worker logs | > 10/hour |
Log management
Section titled “Log management”Frappe generates several log files. Knowing where each one lives turns a 3 AM
incident into a quick tail:
Directoryfrappe-bench/
Directorylogs/
- web.log Gunicorn access/error logs
- web.error.log Gunicorn errors
- worker-short.log short queue worker logs
- worker-default.log default queue worker logs
- worker-long.log long queue worker logs
- schedule.log scheduler logs
- socketio.log Socket.IO server logs
- backup.log backup operation logs
Directorysites/
Directory<site-name> /
Directorylogs/
- frappe.log application-level logs (from
frappe.log())
- frappe.log application-level logs (from
Centralize logs with a log shipper such as Filebeat into ELK/OpenSearch:
filebeat.inputs: - type: log enabled: true paths: - /home/frappe/frappe-bench/logs/*.log fields: service: frappe-bench multiline: pattern: '^\[' negate: true match: after
- type: log enabled: true paths: - /home/frappe/frappe-bench/sites/*/logs/frappe.log fields: service: frappe-app multiline: pattern: '^Traceback' negate: true match: after
output.elasticsearch: hosts: ["https://elasticsearch.example.com:9200"] index: "frappe-logs-%{+yyyy.MM.dd}"Sentry integration for error tracking
Section titled “Sentry integration for error tracking”The frappe-sentry community app sends application errors to Sentry. Install and
configure it:
# Install the Sentry integration appbench get-app https://github.com/ParsimonyGit/frappe-sentrybench --site scoopjoy.com install-app frappe_sentryConfigure it in site_config.json:
{ "sentry_dsn": "https://abc123@o123456.ingest.sentry.io/789", "sentry_environment": "production", "sentry_release": "scoopjoy-erp@v16.10.10", "sentry_traces_sample_rate": 0.1}Health-check script for cron-based monitoring
Section titled “Health-check script for cron-based monitoring”If you don’t want a full Prometheus stack, a lightweight cron script gets you 80% of the value: it pings the web server, checks Supervisor, queue depth, disk, MariaDB, Redis, and SSL expiry, then alerts to Slack on failure.
#!/bin/bash# Lightweight health check with Slack alerting
set -euo pipefail
SITE="scoopjoy.com"BENCH_PATH="$HOME/frappe-bench"SLACK_WEBHOOK="${SLACK_WEBHOOK_URL:-}"ALERT_FILE="/tmp/frappe-alert-state"
alert() { local level=$1 local message=$2 echo "[$(date)] ${level}: ${message}"
if [ -n "$SLACK_WEBHOOK" ]; then local color="danger" [ "$level" = "WARNING" ] && color="warning" [ "$level" = "OK" ] && color="good"
curl -s -X POST "$SLACK_WEBHOOK" \ -H 'Content-type: application/json' \ -d "{\"attachments\":[{\"color\":\"${color}\",\"title\":\"ERPNext Health: ${level}\",\"text\":\"${message}\",\"footer\":\"${SITE}\"}]}" \ > /dev/null 2>&1 fi}
# Check 1: Web server respondingHTTP_CODE=$(curl -s -o /dev/null -w '%{http_code}' "http://localhost:8000/api/method/ping" || echo "000")if [ "$HTTP_CODE" != "200" ]; then alert "CRITICAL" "Web server not responding (HTTP ${HTTP_CODE})"fi
# Check 2: Supervisor processes runningSTOPPED=$(sudo supervisorctl status | grep -c "STOPPED\|FATAL\|EXITED" || true)if [ "$STOPPED" -gt 0 ]; then STOPPED_NAMES=$(sudo supervisorctl status | grep "STOPPED\|FATAL\|EXITED" | awk '{print $1}') alert "CRITICAL" "Supervisor processes down: ${STOPPED_NAMES}"fi
# Check 3: Worker queue depthQUEUE_DEPTH=$(cd "$BENCH_PATH" && ./env/bin/python -c "import redisr = redis.Redis()depth = r.llen('rq:queue:default') + r.llen('rq:queue:short') + r.llen('rq:queue:long')print(depth)" 2>/dev/null || echo "0")
if [ "$QUEUE_DEPTH" -gt 100 ]; then alert "WARNING" "Worker queue depth is high: ${QUEUE_DEPTH} jobs pending"fi
# Check 4: Disk usageDISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | tr -d '%')if [ "$DISK_USAGE" -gt 85 ]; then alert "WARNING" "Disk usage at ${DISK_USAGE}%"fi
# Check 5: MariaDB runningif ! systemctl is-active --quiet mariadb; then alert "CRITICAL" "MariaDB is not running"fi
# Check 6: Redis runningif ! systemctl is-active --quiet redis-server; then alert "CRITICAL" "Redis is not running"fi
# Check 7: SSL certificate expiryCERT_EXPIRY=$(echo | openssl s_client -servername "$SITE" -connect "$SITE":443 2>/dev/null | \ openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)if [ -n "$CERT_EXPIRY" ]; then DAYS_LEFT=$(( ( $(date -d "$CERT_EXPIRY" +%s) - $(date +%s) ) / 86400 )) if [ "$DAYS_LEFT" -lt 14 ]; then alert "WARNING" "SSL certificate expires in ${DAYS_LEFT} days" fifi# Run every 5 minuteschmod +x ~/scripts/health-check.shcrontab -e# Add:*/5 * * * * SLACK_WEBHOOK_URL="https://hooks.slack.com/services/XXX/YYY/ZZZ" ~/scripts/health-check.sh >> ~/frappe-bench/logs/health-check.log 2>&1