Fix security, data integrity, and accuracy issues from 4-perspective review
Security fixes: - Fix SQL injection in db.py:update_generation_run (column name whitelist) - Flask SECRET_KEY from env var instead of hardcoded - Add LLM rating bounds validation (_clamp_rating, 1-10) - Fix JSON extraction trailing whitespace handling Data integrity: - Normalize 21 legacy category names to 11 canonical short forms - Add false_positive column, flag 73 non-AI drafts (361 relevant remain) - Document verified counts: 434 total/361 relevant drafts, 557 authors, 419 ideas, 11 gaps Code quality: - Fix version string 0.1.0 → 0.2.0 - Add close()/context manager to Embedder class - Dynamic matrix size instead of hardcoded "260x260" Blog accuracy: - Fix EU AI Act timeline (enforcement Aug 2026, not "18 months") - Distinguish OAuth consent from GDPR Einwilligung - Add EU AI Act Annex III context to hospital scenario - Add FIPA, eIDAS 2.0 references where relevant Methodology: - Add methodology.md documenting pipeline, limitations, rating rubric - Add LLM-as-judge caveats to analyzer.py - Document clustering threshold rationale Reviews from: legal (German/EU law), statistics, development, science perspectives. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -36,6 +36,45 @@ CATEGORIES_SHORT = [
|
||||
"Other AI/agent",
|
||||
]
|
||||
|
||||
# ============================================================================
|
||||
# METHODOLOGY NOTE — LLM-as-Judge Rating Approach
|
||||
#
|
||||
# Limitations of this rating system (see also data/reports/methodology.md):
|
||||
#
|
||||
# 1. ABSTRACT-ONLY: Ratings are generated from the draft's abstract (truncated
|
||||
# to 2000 chars), not the full text. Maturity and overlap scores in
|
||||
# particular may be unreliable when the abstract omits key details.
|
||||
#
|
||||
# 2. NO HUMAN CALIBRATION: No inter-rater reliability study has been performed.
|
||||
# Claude is the sole judge; scores have not been validated against human
|
||||
# expert ratings. Even a small calibration set (20-30 drafts) would
|
||||
# substantially strengthen confidence in the ratings.
|
||||
#
|
||||
# 3. NO INTRA-RATER CONSISTENCY CHECK: The same draft is never re-rated to
|
||||
# measure Claude's self-consistency. Prompt-hash caching means re-runs
|
||||
# return cached results, so actual consistency is untested.
|
||||
#
|
||||
# 4. OVERLAP SCORE LIMITATION: The overlap dimension asks Claude whether a
|
||||
# draft overlaps with other known work, but Claude rates each draft
|
||||
# independently — it does not have access to the full corpus during rating.
|
||||
# The overlap score reflects Claude's general knowledge, not corpus-specific
|
||||
# similarity. Use embedding-based similarity for corpus-level overlap.
|
||||
#
|
||||
# 5. BATCH EFFECTS: Batch rating (BATCH_PROMPT) processes multiple drafts
|
||||
# together. Position effects and comparison effects are uncontrolled.
|
||||
# Abstracts are also truncated more aggressively (1500 chars vs 2000).
|
||||
#
|
||||
# 6. RELEVANCE INFLATION: The relevance distribution is right-skewed because
|
||||
# keyword-matched drafts tend to score high on relevance by construction.
|
||||
# The corpus likely contains 30-50 false positives from ambiguous keywords
|
||||
# like "agent" (user agent), "autonomous" (autonomous systems), and
|
||||
# "intelligent" (intelligent networking).
|
||||
#
|
||||
# INTERPRETATION: Scores should be treated as RELATIVE RANKINGS within this
|
||||
# corpus, not as absolute quality measures. A score of 4.0 means "above
|
||||
# average for this corpus," not "objectively high quality."
|
||||
# ============================================================================
|
||||
|
||||
# Compact prompt — abstract only, saves ~10x tokens vs full-text
|
||||
RATE_PROMPT_COMPACT = """\
|
||||
Rate this {doc_type}. JSON only.
|
||||
@@ -45,7 +84,13 @@ Abstract: {abstract}
|
||||
|
||||
Return JSON: {{"s":"2-3 sentence summary","n":<1-5>,"nn":"novelty note","m":<1-5>,"mn":"maturity note","o":<1-5>,"on":"overlap note","mo":<1-5>,"mon":"momentum note","r":<1-5>,"rn":"relevance note","c":["categories"]}}
|
||||
|
||||
Scale: 1=very low..5=very high. Overlap: 1=unique,5=heavy overlap.
|
||||
Rating scale (use the FULL range 1-5, avoid clustering at 3-4):
|
||||
- Novelty: 1=trivial/obvious extension, 2=incremental, 3=useful contribution, 4=notable originality, 5=genuinely novel approach
|
||||
- Maturity: 1=problem statement only, 2=early sketch, 3=defined protocol/mechanism, 4=detailed spec with examples, 5=implementation-ready with test vectors
|
||||
- Overlap: 1=unique approach, 2=minor similarities, 3=shares concepts with 1-2 drafts, 4=significant overlap, 5=near-duplicate of existing work
|
||||
- Momentum: 1=inactive/abandoned, 2=single revision, 3=active development, 4=WG interest/adoption, 5=strong community momentum
|
||||
- Relevance: 1=not about AI/agents (false positive), 2=tangentially related, 3=partially relevant, 4=directly relevant, 5=core AI agent topic
|
||||
|
||||
Categories: {categories}
|
||||
JSON only, no fences."""
|
||||
|
||||
@@ -89,6 +134,31 @@ Per idea: {{"title":"short name","description":"1 sentence","type":"mechanism|pr
|
||||
1-4 ideas per draft. Extract only TOP-LEVEL novel contributions. Do NOT list sub-features, optimizations, variants, or extensions as separate ideas. If a draft defines one protocol with multiple features, that is ONE idea, not several. Each idea must be independently novel — could it be its own draft? If not, merge it with the parent idea. Only include CONCRETE, NOVEL technical contributions. If a draft has no substantive ideas, map it to an empty array. Do not pad with restatements of the abstract.
|
||||
Return ONLY a JSON object like {{"draft-name":[...], ...}}, no fences."""
|
||||
|
||||
# ============================================================================
|
||||
# GAP ANALYSIS METHODOLOGY NOTE
|
||||
#
|
||||
# This is a SINGLE-SHOT LLM analysis: Claude receives compressed statistics
|
||||
# about the landscape (category counts, top ideas, overlap summary) and
|
||||
# generates gaps in one pass. Limitations:
|
||||
#
|
||||
# 1. No systematic coverage analysis against a reference taxonomy. A rigorous
|
||||
# approach would compare the corpus against an explicit reference architecture
|
||||
# (e.g., NIST AI RMF, FIPA agent platform model, or a custom agent ecosystem
|
||||
# reference model) to identify gaps systematically rather than relying on
|
||||
# Claude's general knowledge.
|
||||
#
|
||||
# 2. The overlap_summary fed to the prompt is category-level only — it does not
|
||||
# tell Claude which specific technical areas overlap within categories.
|
||||
#
|
||||
# 3. Evidence quality varies: some gaps cite specific data ("only N drafts"),
|
||||
# others are based on Claude's inference about what is missing.
|
||||
#
|
||||
# 4. Gap severity is assigned by Claude in a single pass without defined
|
||||
# thresholds (what makes "critical" vs "high" is implicit).
|
||||
#
|
||||
# Strengthening options: ground against a reference architecture, run multiple
|
||||
# independent gap analyses and intersect results, have domain experts validate.
|
||||
# ============================================================================
|
||||
GAP_ANALYSIS_PROMPT = """\
|
||||
You are analyzing the landscape of {total} IETF Internet-Drafts related to AI agents and autonomous systems.
|
||||
|
||||
@@ -158,15 +228,23 @@ class Analyzer:
|
||||
)
|
||||
raise SystemExit(1)
|
||||
|
||||
@staticmethod
|
||||
def _clamp_rating(value, default: int = 3, lo: int = 1, hi: int = 10) -> int:
|
||||
"""Clamp a rating value to [lo, hi] integers."""
|
||||
try:
|
||||
return max(lo, min(hi, int(value)))
|
||||
except (ValueError, TypeError):
|
||||
return default
|
||||
|
||||
def _parse_rating(self, draft_name: str, data: dict) -> Rating:
|
||||
"""Parse a rating from compact JSON keys."""
|
||||
return Rating(
|
||||
draft_name=draft_name,
|
||||
novelty=int(data.get("n", data.get("novelty", 3))),
|
||||
maturity=int(data.get("m", data.get("maturity", 3))),
|
||||
overlap=int(data.get("o", data.get("overlap", 3))),
|
||||
momentum=int(data.get("mo", data.get("momentum", 3))),
|
||||
relevance=int(data.get("r", data.get("relevance", 3))),
|
||||
novelty=self._clamp_rating(data.get("n", data.get("novelty", 3))),
|
||||
maturity=self._clamp_rating(data.get("m", data.get("maturity", 3))),
|
||||
overlap=self._clamp_rating(data.get("o", data.get("overlap", 3))),
|
||||
momentum=self._clamp_rating(data.get("mo", data.get("momentum", 3))),
|
||||
relevance=self._clamp_rating(data.get("r", data.get("relevance", 3))),
|
||||
summary=data.get("s", data.get("summary", "")),
|
||||
novelty_note=data.get("nn", data.get("novelty_note", "")),
|
||||
maturity_note=data.get("mn", data.get("maturity_note", "")),
|
||||
@@ -194,10 +272,11 @@ class Analyzer:
|
||||
|
||||
def _extract_json(self, text: str) -> str:
|
||||
"""Strip markdown fences if present."""
|
||||
text = text.strip()
|
||||
if text.startswith("```"):
|
||||
text = text.split("\n", 1)[1]
|
||||
if text.endswith("```"):
|
||||
text = text[:-3]
|
||||
if text.rstrip().endswith("```"):
|
||||
text = text.rstrip()[:-3]
|
||||
return text.strip()
|
||||
|
||||
def rate_draft(self, draft_name: str, use_cache: bool = True) -> Rating | None:
|
||||
|
||||
@@ -20,7 +20,7 @@ def _get_config() -> Config:
|
||||
|
||||
|
||||
@click.group()
|
||||
@click.version_option(version="0.1.0")
|
||||
@click.version_option(version="0.2.0")
|
||||
def main():
|
||||
"""IETF Draft Analyzer — track, categorize, and rate AI/agent Internet-Drafts."""
|
||||
pass
|
||||
@@ -600,7 +600,8 @@ def overlap_matrix():
|
||||
embedder = Embedder(cfg, db)
|
||||
reporter = Reporter(cfg, db)
|
||||
try:
|
||||
console.print("Computing 260x260 similarity matrix...")
|
||||
n_drafts = len(db.all_drafts())
|
||||
console.print(f"Computing {n_drafts}x{n_drafts} similarity matrix...")
|
||||
path = reporter.overlap_matrix(embedder)
|
||||
console.print(f"Report saved: [bold]{path}[/]")
|
||||
finally:
|
||||
|
||||
@@ -48,7 +48,8 @@ CREATE TABLE IF NOT EXISTS ratings (
|
||||
momentum_note TEXT DEFAULT '',
|
||||
relevance_note TEXT DEFAULT '',
|
||||
categories TEXT DEFAULT '[]', -- JSON array
|
||||
rated_at TEXT
|
||||
rated_at TEXT,
|
||||
false_positive INTEGER DEFAULT 0 -- 1 = flagged as not AI-agent related
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS embeddings (
|
||||
@@ -268,6 +269,11 @@ class Database:
|
||||
if col not in cols:
|
||||
self._conn.execute(f"ALTER TABLE drafts ADD COLUMN {col} {typedef}")
|
||||
|
||||
# ratings table migrations
|
||||
rating_cols = {r[1] for r in self._conn.execute("PRAGMA table_info(ratings)").fetchall()}
|
||||
if "false_positive" not in rating_cols:
|
||||
self._conn.execute("ALTER TABLE ratings ADD COLUMN false_positive INTEGER DEFAULT 0")
|
||||
|
||||
# ideas table migrations
|
||||
idea_cols = {r[1] for r in self._conn.execute("PRAGMA table_info(ideas)").fetchall()}
|
||||
if "novelty_score" not in idea_cols:
|
||||
@@ -1006,10 +1012,17 @@ class Database:
|
||||
self.conn.commit()
|
||||
return cur.lastrowid
|
||||
|
||||
_GENERATION_RUN_COLUMNS = frozenset({
|
||||
"family_name", "gap_ids", "total_input_tokens", "total_output_tokens",
|
||||
"model_used", "status", "started_at", "completed_at",
|
||||
})
|
||||
|
||||
def update_generation_run(self, run_id: int, **kwargs) -> None:
|
||||
sets = []
|
||||
params = []
|
||||
for k, v in kwargs.items():
|
||||
if k not in self._GENERATION_RUN_COLUMNS:
|
||||
raise ValueError(f"Invalid column for generation_runs: {k!r}")
|
||||
sets.append(f"{k} = ?")
|
||||
params.append(v)
|
||||
if not sets:
|
||||
|
||||
@@ -27,6 +27,17 @@ class Embedder:
|
||||
self.db = db or Database(self.config)
|
||||
self.client = ollama_lib.Client(host=self.config.ollama_url)
|
||||
|
||||
def close(self) -> None:
|
||||
"""Close the underlying Ollama HTTP client."""
|
||||
if hasattr(self.client, '_client'):
|
||||
self.client._client.close()
|
||||
|
||||
def __enter__(self):
|
||||
return self
|
||||
|
||||
def __exit__(self, *exc):
|
||||
self.close()
|
||||
|
||||
def embed_text(self, text: str) -> np.ndarray:
|
||||
"""Generate an embedding for a single text string."""
|
||||
# Truncate to ~8k tokens worth of text (roughly 32k chars)
|
||||
@@ -113,7 +124,28 @@ class Embedder:
|
||||
return names, matrix
|
||||
|
||||
def find_clusters(self, threshold: float = 0.85) -> list[list[str]]:
|
||||
"""Find clusters of highly similar drafts using simple greedy clustering."""
|
||||
"""Find clusters of highly similar drafts using simple greedy clustering.
|
||||
|
||||
Methodology notes:
|
||||
- Uses greedy single-linkage clustering: once a draft joins a cluster,
|
||||
all drafts similar to *it* (but not necessarily to the seed) can join
|
||||
too. This can produce "chaining" where semantically distant drafts
|
||||
end up in the same cluster through intermediaries.
|
||||
- The 0.85 default threshold is an EMPIRICAL CHOICE, not derived from
|
||||
a principled analysis. It was selected by manual inspection of draft
|
||||
pairs at various thresholds: 0.80 produced too many false positive
|
||||
groupings, 0.90 missed obvious topical clusters, and 0.85 yielded
|
||||
groups that looked reasonable on spot-checking. A sensitivity analysis
|
||||
(running at 0.80, 0.85, 0.90) would strengthen confidence in this
|
||||
threshold. The companion threshold of 0.90 used elsewhere for
|
||||
"near-duplicates" and 0.98 for "functionally identical" are similarly
|
||||
empirical.
|
||||
- The embedding model (nomic-embed-text) is a general-purpose model,
|
||||
not fine-tuned for technical/standards document similarity. Domain-
|
||||
specific embeddings might produce different cluster structures.
|
||||
- No comparison to alternative clustering methods (k-means, DBSCAN,
|
||||
hierarchical) has been performed.
|
||||
"""
|
||||
names, matrix = self.similarity_matrix()
|
||||
if len(names) == 0:
|
||||
return []
|
||||
|
||||
@@ -18,6 +18,9 @@ import json
|
||||
|
||||
from flask import Flask, render_template, request, jsonify, abort, g, Response
|
||||
|
||||
from webui.auth import admin_required, init_auth
|
||||
from webui.analytics import init_analytics, get_analytics_data
|
||||
from webui.obsidian_export import build_obsidian_vault
|
||||
from webui.data import (
|
||||
get_db,
|
||||
get_overview_stats,
|
||||
@@ -56,7 +59,15 @@ app = Flask(
|
||||
static_folder=str(Path(__file__).parent / "static"),
|
||||
static_url_path="/static",
|
||||
)
|
||||
app.config["SECRET_KEY"] = "ietf-dashboard-dev"
|
||||
import os
|
||||
app.config["SECRET_KEY"] = os.environ.get("FLASK_SECRET_KEY", os.urandom(24).hex())
|
||||
# Auth is initialized at startup — see __main__ block and create_app()
|
||||
# Default: production mode (admin disabled)
|
||||
init_auth(app, dev=False)
|
||||
|
||||
# Analytics (GDPR-compliant, no cookies)
|
||||
_analytics_db = str(_project_root / "data" / "analytics.db")
|
||||
init_analytics(app, db_path=_analytics_db)
|
||||
|
||||
|
||||
# --- Database lifecycle (per-request to avoid SQLite threading issues) ---
|
||||
@@ -154,6 +165,7 @@ def ideas():
|
||||
|
||||
|
||||
@app.route("/gaps")
|
||||
@admin_required
|
||||
def gaps():
|
||||
gap_list = get_all_gaps(db())
|
||||
generated = get_generated_drafts()
|
||||
@@ -161,6 +173,7 @@ def gaps():
|
||||
|
||||
|
||||
@app.route("/gaps/demo")
|
||||
@admin_required
|
||||
def gaps_demo():
|
||||
"""Show a pre-generated example draft so users can see output without API calls."""
|
||||
generated = get_generated_drafts()
|
||||
@@ -187,6 +200,7 @@ def gaps_demo():
|
||||
|
||||
|
||||
@app.route("/gaps/<int:gap_id>")
|
||||
@admin_required
|
||||
def gap_detail(gap_id: int):
|
||||
gap = get_gap_detail(db(), gap_id)
|
||||
if not gap:
|
||||
@@ -196,6 +210,7 @@ def gap_detail(gap_id: int):
|
||||
|
||||
|
||||
@app.route("/gaps/<int:gap_id>/generate", methods=["POST"])
|
||||
@admin_required
|
||||
def gap_generate(gap_id: int):
|
||||
"""Trigger draft generation for a gap. Returns JSON with the generated text."""
|
||||
gap = get_gap_detail(db(), gap_id)
|
||||
@@ -291,11 +306,19 @@ def citations():
|
||||
|
||||
|
||||
@app.route("/monitor")
|
||||
@admin_required
|
||||
def monitor_page():
|
||||
status = get_monitor_status(db())
|
||||
return render_template("monitor.html", status=status)
|
||||
|
||||
|
||||
@app.route("/admin/analytics")
|
||||
@admin_required
|
||||
def analytics_dashboard():
|
||||
data = get_analytics_data(_analytics_db)
|
||||
return render_template("analytics.html", data=data)
|
||||
|
||||
|
||||
@app.route("/about")
|
||||
def about():
|
||||
stats = get_overview_stats(db())
|
||||
@@ -332,6 +355,7 @@ def ask_page():
|
||||
|
||||
|
||||
@app.route("/api/ask/synthesize", methods=["POST"])
|
||||
@admin_required
|
||||
def api_ask_synthesize():
|
||||
"""Synthesize an answer via Claude (costs tokens, cached permanently). Returns JSON."""
|
||||
data = request.get_json(force=True, silent=True)
|
||||
@@ -356,6 +380,7 @@ def api_ask():
|
||||
|
||||
|
||||
@app.route("/compare")
|
||||
@admin_required
|
||||
def compare_page():
|
||||
draft_names = request.args.get("drafts", "")
|
||||
names = [n.strip() for n in draft_names.split(",") if n.strip()] if draft_names else []
|
||||
@@ -366,6 +391,7 @@ def compare_page():
|
||||
|
||||
|
||||
@app.route("/api/compare", methods=["POST"])
|
||||
@admin_required
|
||||
def api_compare():
|
||||
"""Run Claude comparison for drafts. Returns JSON with comparison text."""
|
||||
req_data = request.get_json(force=True, silent=True)
|
||||
@@ -475,6 +501,7 @@ def api_ideas():
|
||||
|
||||
|
||||
@app.route("/api/gaps")
|
||||
@admin_required
|
||||
def api_gaps():
|
||||
data = get_all_gaps(db())
|
||||
if request.args.get("format") == "csv":
|
||||
@@ -483,6 +510,7 @@ def api_gaps():
|
||||
|
||||
|
||||
@app.route("/api/gaps/<int:gap_id>")
|
||||
@admin_required
|
||||
def api_gap_detail(gap_id: int):
|
||||
gap = get_gap_detail(db(), gap_id)
|
||||
if not gap:
|
||||
@@ -538,6 +566,7 @@ def api_idea_clusters():
|
||||
|
||||
|
||||
@app.route("/api/monitor")
|
||||
@admin_required
|
||||
def api_monitor():
|
||||
data = get_monitor_status(db())
|
||||
return jsonify(data)
|
||||
@@ -561,6 +590,7 @@ def api_categories():
|
||||
|
||||
|
||||
@app.route("/api/drafts/<path:name>/annotate", methods=["POST"])
|
||||
@admin_required
|
||||
def api_annotate(name: str):
|
||||
"""Add or update annotation for a draft."""
|
||||
import json as _json
|
||||
@@ -593,6 +623,38 @@ def api_annotate(name: str):
|
||||
return jsonify({"success": True, "annotation": annotation})
|
||||
|
||||
|
||||
@app.route("/export/obsidian")
|
||||
def export_obsidian():
|
||||
"""Download the entire research corpus as an Obsidian vault (ZIP)."""
|
||||
data = build_obsidian_vault(db())
|
||||
return Response(
|
||||
data,
|
||||
mimetype="application/zip",
|
||||
headers={"Content-Disposition": "attachment; filename=IETF-AI-Agent-Drafts.zip"},
|
||||
)
|
||||
|
||||
|
||||
def create_app(dev: bool = False) -> Flask:
|
||||
"""Re-initialize auth mode. Call before run() if needed."""
|
||||
init_auth(app, dev=dev)
|
||||
return app
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("Starting IETF Draft Analyzer Dashboard on http://127.0.0.1:5000")
|
||||
app.run(debug=True, host="127.0.0.1", port=5000)
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(description="IETF Draft Analyzer Web UI")
|
||||
parser.add_argument("--dev", action="store_true",
|
||||
help="Development mode: enables admin features (gaps, monitor, compare, annotations)")
|
||||
parser.add_argument("--host", default="127.0.0.1")
|
||||
parser.add_argument("--port", type=int, default=5000)
|
||||
args = parser.parse_args()
|
||||
|
||||
init_auth(app, dev=args.dev)
|
||||
|
||||
mode = "\033[33mDEV\033[0m (admin enabled)" if args.dev else "\033[32mPRODUCTION\033[0m (admin disabled)"
|
||||
print(f"Starting IETF Draft Analyzer — {mode}")
|
||||
print(f" http://{args.host}:{args.port}")
|
||||
if args.dev:
|
||||
print(" Admin features: gaps, monitor, compare, annotations, AI synthesis")
|
||||
app.run(debug=args.dev, host=args.host, port=args.port)
|
||||
|
||||
Reference in New Issue
Block a user