Fix
This commit is contained in:
307
docs/database-maintenance.md
Normal file
307
docs/database-maintenance.md
Normal file
@@ -0,0 +1,307 @@
|
||||
# Database Maintenance
|
||||
|
||||
## Overview
|
||||
|
||||
HLL Vietnam keeps database cleanup at the application level.
|
||||
|
||||
The current maintenance scope is intentionally narrow:
|
||||
|
||||
- old `server_snapshots`;
|
||||
- old non-critical `rcon_admin_log_events`;
|
||||
- old critical `rcon_admin_log_events` only after retention and protected-match checks;
|
||||
- old non-protected `rcon_materialized_matches`;
|
||||
- dependent `rcon_match_player_stats` for deleted matches.
|
||||
|
||||
The first maintenance pass does not routinely delete:
|
||||
|
||||
- `displayed_historical_snapshots`;
|
||||
- file-based snapshots under `backend/data/snapshots/`;
|
||||
- public-scoreboard `historical_*` fallback tables;
|
||||
- `player_event_raw_ledger` and its worker metadata;
|
||||
- Elo/MMR tables;
|
||||
- Comunidad Hispana #03 data reactivation or targets.
|
||||
|
||||
## Why Application-Level And Not `pg_cron`
|
||||
|
||||
Cleanup is versioned in backend code instead of delegated to `pg_cron`, host cron, or a separate container because the retention logic depends on product rules:
|
||||
|
||||
- keep the latest 100 closed materialized matches;
|
||||
- keep the current month;
|
||||
- keep the previous month during the first 7 days of a new month;
|
||||
- keep the current week;
|
||||
- keep the previous week when weekly fallback may still need it;
|
||||
- keep child stats for protected matches;
|
||||
- avoid breaking current/live pages that still read recent AdminLog data.
|
||||
|
||||
Those rules belong with the application’s read and write model, not inside database-only scheduling.
|
||||
|
||||
## Scheduled Cleanup Inside `historical-runner`
|
||||
|
||||
Database maintenance is scheduled inside `app.historical_runner`.
|
||||
|
||||
Behavior:
|
||||
|
||||
- disabled by default;
|
||||
- no extra Docker service is added for maintenance;
|
||||
- the runner checks whether maintenance is due;
|
||||
- when enabled and due, the runner invokes `python -m app.database_maintenance cleanup --apply` behavior through the shared Python function;
|
||||
- failures are logged and do not crash the historical runner loop;
|
||||
- cleanup runs under the same writer-lock coordination used by the historical writer flows.
|
||||
|
||||
Relevant structured log events:
|
||||
|
||||
- `database-maintenance-scheduler-skipped-disabled`
|
||||
- `database-maintenance-scheduler-skipped-not-due`
|
||||
- `database-maintenance-scheduler-started`
|
||||
- `database-maintenance-scheduler-completed`
|
||||
- `database-maintenance-scheduler-failed`
|
||||
|
||||
## Environment Variables
|
||||
|
||||
Required maintenance-related variables:
|
||||
|
||||
```text
|
||||
HLL_DB_MAINTENANCE_ENABLED=false
|
||||
HLL_DB_MAINTENANCE_INTERVAL_SECONDS=43200
|
||||
HLL_RECENT_MATCHES_KEEP=100
|
||||
HLL_ADMIN_LOG_NONCRITICAL_RETENTION_DAYS=30
|
||||
HLL_ADMIN_LOG_CRITICAL_RETENTION_DAYS=90
|
||||
HLL_SERVER_SNAPSHOT_RETENTION_DAYS=14
|
||||
HLL_DB_MAINTENANCE_BATCH_SIZE=5000
|
||||
```
|
||||
|
||||
Meaning:
|
||||
|
||||
- `HLL_DB_MAINTENANCE_ENABLED`
|
||||
Enables scheduled apply mode inside `historical-runner`.
|
||||
- `HLL_DB_MAINTENANCE_INTERVAL_SECONDS`
|
||||
Default scheduler interval. `43200` means every 12 hours.
|
||||
- `HLL_RECENT_MATCHES_KEEP`
|
||||
Number of latest closed materialized matches that must always be protected.
|
||||
- `HLL_ADMIN_LOG_NONCRITICAL_RETENTION_DAYS`
|
||||
Retention for non-critical AdminLog events such as chat/connect/disconnect.
|
||||
- `HLL_ADMIN_LOG_CRITICAL_RETENTION_DAYS`
|
||||
Retention for critical AdminLog events such as `kill`, `match_start`, `match_end`.
|
||||
- `HLL_SERVER_SNAPSHOT_RETENTION_DAYS`
|
||||
Retention for live server snapshots.
|
||||
- `HLL_DB_MAINTENANCE_BATCH_SIZE`
|
||||
Delete batch size for apply mode.
|
||||
|
||||
## Protected Data
|
||||
|
||||
The cleanup command protects:
|
||||
|
||||
- latest 100 closed materialized matches by default;
|
||||
- current month materialized matches;
|
||||
- previous month materialized matches when the current day is `1` through `7`;
|
||||
- current week materialized matches;
|
||||
- previous week materialized matches when weekly fallback may still need them;
|
||||
- `rcon_match_player_stats` belonging to protected matches;
|
||||
- current/live AdminLog data required for visible current-match surfaces;
|
||||
- `displayed_historical_snapshots`;
|
||||
- file snapshots in `backend/data/snapshots/`.
|
||||
|
||||
If a match timestamp cannot be interpreted safely, that match is skipped and protected instead of deleted.
|
||||
|
||||
## Deleted Data
|
||||
|
||||
Apply mode is currently allowed to delete:
|
||||
|
||||
- `server_snapshots` older than retention;
|
||||
- non-critical `rcon_admin_log_events` older than retention;
|
||||
- critical `rcon_admin_log_events` older than retention only when they are not required by protected materialized match ranges;
|
||||
- non-protected `rcon_materialized_matches`;
|
||||
- dependent `rcon_match_player_stats` for deleted matches.
|
||||
|
||||
Current critical AdminLog event types:
|
||||
|
||||
- `kill`
|
||||
- `match_start`
|
||||
- `match_end`
|
||||
|
||||
## Dry-Run Command
|
||||
|
||||
From `backend/`:
|
||||
|
||||
```powershell
|
||||
python -m app.database_maintenance cleanup --dry-run
|
||||
```
|
||||
|
||||
From the repository root with the backend package on `PYTHONPATH`:
|
||||
|
||||
```powershell
|
||||
$env:PYTHONPATH='backend'
|
||||
python -m app.database_maintenance cleanup --dry-run
|
||||
```
|
||||
|
||||
Inside Docker Compose:
|
||||
|
||||
```powershell
|
||||
docker compose exec backend python -m app.database_maintenance cleanup --dry-run
|
||||
```
|
||||
|
||||
Useful dry-run options:
|
||||
|
||||
```powershell
|
||||
docker compose exec backend python -m app.database_maintenance cleanup --dry-run `
|
||||
--recent-matches-keep 100 `
|
||||
--admin-log-noncritical-retention-days 30 `
|
||||
--admin-log-critical-retention-days 90 `
|
||||
--server-snapshot-retention-days 14 `
|
||||
--batch-size 5000
|
||||
```
|
||||
|
||||
Dry-run is the safe preview path and should be reviewed before any production apply.
|
||||
|
||||
## Apply Command
|
||||
|
||||
Local module execution:
|
||||
|
||||
```powershell
|
||||
python -m app.database_maintenance cleanup --apply
|
||||
```
|
||||
|
||||
Docker Compose:
|
||||
|
||||
```powershell
|
||||
docker compose exec backend python -m app.database_maintenance cleanup --apply
|
||||
```
|
||||
|
||||
One-off local validation with a fixed time anchor:
|
||||
|
||||
```powershell
|
||||
python -m app.database_maintenance cleanup --apply --now 2026-06-20T12:00:00Z
|
||||
```
|
||||
|
||||
Optional maintenance vacuum/analyze:
|
||||
|
||||
```powershell
|
||||
python -m app.database_maintenance cleanup --apply --vacuum-analyze
|
||||
```
|
||||
|
||||
## Table-Size Audit SQL
|
||||
|
||||
```sql
|
||||
select
|
||||
schemaname,
|
||||
relname as table_name,
|
||||
pg_size_pretty(pg_total_relation_size(relid)) as total_size,
|
||||
pg_size_pretty(pg_relation_size(relid)) as table_size,
|
||||
pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as indexes_size,
|
||||
n_live_tup as estimated_rows,
|
||||
n_dead_tup as estimated_dead_rows
|
||||
from pg_stat_user_tables
|
||||
order by pg_total_relation_size(relid) desc;
|
||||
```
|
||||
|
||||
## Row-Count And Age Audit SQL
|
||||
|
||||
### AdminLog events by type/date
|
||||
|
||||
```sql
|
||||
select
|
||||
event_type,
|
||||
count(*) as row_count,
|
||||
min(event_timestamp) as first_event_timestamp,
|
||||
max(event_timestamp) as last_event_timestamp,
|
||||
min(server_time) as first_server_time,
|
||||
max(server_time) as last_server_time
|
||||
from rcon_admin_log_events
|
||||
group by event_type
|
||||
order by row_count desc, event_type asc;
|
||||
```
|
||||
|
||||
### Materialized matches by server/date
|
||||
|
||||
```sql
|
||||
select
|
||||
target_key,
|
||||
source_basis,
|
||||
count(*) as matches,
|
||||
min(coalesce(ended_at, started_at)) as first_closed_at,
|
||||
max(coalesce(ended_at, started_at)) as last_closed_at
|
||||
from rcon_materialized_matches
|
||||
group by target_key, source_basis
|
||||
order by target_key asc, source_basis asc;
|
||||
```
|
||||
|
||||
### Server snapshots by date
|
||||
|
||||
```sql
|
||||
select
|
||||
server_id,
|
||||
min(captured_at) as first_captured_at,
|
||||
max(captured_at) as last_captured_at,
|
||||
count(*) as snapshot_rows
|
||||
from server_snapshots
|
||||
group by server_id
|
||||
order by last_captured_at desc;
|
||||
```
|
||||
|
||||
### Displayed snapshots count
|
||||
|
||||
```sql
|
||||
select
|
||||
snapshot_type,
|
||||
metric,
|
||||
snapshot_window,
|
||||
count(*) as snapshot_rows,
|
||||
min(generated_at) as first_generated_at,
|
||||
max(generated_at) as last_generated_at
|
||||
from displayed_historical_snapshots
|
||||
group by snapshot_type, metric, snapshot_window
|
||||
order by snapshot_type asc, metric asc, snapshot_window asc;
|
||||
```
|
||||
|
||||
## Logs To Inspect
|
||||
|
||||
The cleanup command emits JSON logs. Minimum events to look for:
|
||||
|
||||
- `database-maintenance-started`
|
||||
- `database-maintenance-plan`
|
||||
- `database-maintenance-table-skipped`
|
||||
- `database-maintenance-delete-batch`
|
||||
- `database-maintenance-completed`
|
||||
- `database-maintenance-error`
|
||||
|
||||
Examples:
|
||||
|
||||
```powershell
|
||||
docker compose logs --tail=200 backend
|
||||
docker compose logs --tail=200 historical-runner
|
||||
```
|
||||
|
||||
If scheduled cleanup is enabled:
|
||||
|
||||
```powershell
|
||||
docker compose logs --tail=200 historical-runner
|
||||
```
|
||||
|
||||
## Docker And Portainer Warnings
|
||||
|
||||
- Never use `docker compose down -v` unless you intentionally want to delete PostgreSQL and mounted volume data.
|
||||
- Always review dry-run output before enabling apply in production.
|
||||
- Do not manually delete protected match or player-stat rows from PostgreSQL.
|
||||
- Keep backups before changing retention settings.
|
||||
- Do not add Comunidad Hispana #03 back into RCON targets in this task.
|
||||
- Do not add a separate maintenance container, host cron, or `pg_cron` job for this feature.
|
||||
|
||||
For Portainer-style operations the same warning applies:
|
||||
|
||||
- deleting volumes is destructive;
|
||||
- maintenance should run through the application command, not through manual table purges.
|
||||
|
||||
## Rollback And Restore Considerations
|
||||
|
||||
- Retention changes are destructive when apply mode runs.
|
||||
- Keep a PostgreSQL backup before enabling scheduled apply in production.
|
||||
- If cleanup removes too much data, recovery is restore-based, not “undo last delete.”
|
||||
- Favor dry-run, smaller batch sizes, and reviewed retention values before long-running scheduled apply.
|
||||
|
||||
## Safe Operator Flow
|
||||
|
||||
1. Audit table size and row ages with the SQL above.
|
||||
2. Run dry-run locally or in Compose.
|
||||
3. Review protected counts and candidate counts in JSON output.
|
||||
4. Enable `HLL_DB_MAINTENANCE_ENABLED=true` only after dry-run review.
|
||||
5. Monitor `historical-runner` logs for scheduler events and cleanup completion.
|
||||
Reference in New Issue
Block a user