PyPI Security Scanner

The PyPI Security Scanner is an automated threat-intelligence dashboard designed to inspect newly-released Python packages on the Python Package Index (PyPI). By running static heuristics analysis and multi-stage LLM code reviews, the scanner acts as an early warning system against supply-chain attacks, credential theft, and malicious supply chain insertions.

Scanner Architecture

The scanner pipeline is built on top of serverless Cloudflare infrastructure, operating in three distinct stages:

  • RSS Feed Monitor: A scheduled cron trigger executes every 15 minutes to crawl the official PyPI RSS feed (https://pypi.org/rss/packages.xml) for newly published packages and versions.
  • Task Backlog Queue: Newly discovered packages are added to a scanning queue table. Packages are downloaded, decompressed, and processed concurrently by workers.
  • D1 Database Storage: The analysis report containing risk scores, matching heuristics, and LLM text summaries are stored in a Cloudflare D1 relational database, which serves this frontend dashboard.

Heuristic Checks

The static analysis engine inspects package files and metadata for 10 core threat indicators:

๐Ÿ›ฐ๏ธ Outbound Network Activity
Flags raw sockets connections, or HTTP clients (requests, urllib, httpx, aiohttp) executed inside package setup script configurations.
๐Ÿงฉ Obfuscation Signatures
Flags uses of base64-decoded source text arrays, ROT13 decoding strings, dynamic exec() or eval() blocks, and loading compressed bytecode via marshal.
๐Ÿ’ป Subprocess & Shell Execution
Flags system command executions using subprocess.Popen or direct shell execution commands (os.system, os.popen).
๐ŸŽฏ Typosquatting Detection
Compares the package name using Levenshtein distance algorithms against the top 5,000 downloaded packages on PyPI to prevent developer phishing.

AI-Assisted Analysis

When heuristic flags exceed standard threshold risk limits, the source code and configuration snippets are routed to a multi-stage AI reasoning agent:

  1. Network & Syscall Audit: LLMs parse all network parameters and execution target flags to confirm legitimate package functionality.
  2. Payload Deobfuscation: The model decrypts base64 sequences and explains the actions taken by hidden setups.
  3. Security Synthesis: The final agent stage aggregates inputs from the previous phases to provide a cohesive safety verdict.

Understanding Verdicts

Each scanned package receives one of the following safety status verdicts:

  • Safe: No critical heuristics matching malicious behavior were triggered. Safe for developer ingestion.
  • Suspicious: Contains unusual code constructs (e.g. calling subprocesses in setup) that are not definitively malicious but merit user caution.
  • Malicious: Code contains verified malware signatures, such as uploading credentials, stealing tokens, or running shell commands backdoors.
  • Quarantined: The package has been removed from the official PyPI repository by the index security team.