PyPI Security Scanner
The PyPI Security Scanner is an automated threat-intelligence dashboard designed to inspect newly-released Python packages on the Python Package Index (PyPI). By running static heuristics analysis and multi-stage LLM code reviews, the scanner acts as an early warning system against supply-chain attacks, credential theft, and malicious supply chain insertions.
Scanner Architecture
The scanner pipeline is built on top of serverless Cloudflare infrastructure, operating in three distinct stages:
-
RSS Feed Monitor: A scheduled cron trigger executes every 15 minutes to crawl the official PyPI RSS feed (
https://pypi.org/rss/packages.xml) for newly published packages and versions. - Task Backlog Queue: Newly discovered packages are added to a scanning queue table. Packages are downloaded, decompressed, and processed concurrently by workers.
- D1 Database Storage: The analysis report containing risk scores, matching heuristics, and LLM text summaries are stored in a Cloudflare D1 relational database, which serves this frontend dashboard.
Heuristic Checks
The static analysis engine inspects package files and metadata for 10 core threat indicators:
- ๐ฐ๏ธ Outbound Network Activity
-
Flags raw sockets connections, or HTTP clients (
requests,urllib,httpx,aiohttp) executed inside package setup script configurations. - ๐งฉ Obfuscation Signatures
-
Flags uses of base64-decoded source text arrays, ROT13 decoding strings, dynamic
exec()oreval()blocks, and loading compressed bytecode viamarshal. - ๐ป Subprocess & Shell Execution
-
Flags system command executions using
subprocess.Popenor direct shell execution commands (os.system,os.popen). - ๐ฏ Typosquatting Detection
- Compares the package name using Levenshtein distance algorithms against the top 5,000 downloaded packages on PyPI to prevent developer phishing.
AI-Assisted Analysis
When heuristic flags exceed standard threshold risk limits, the source code and configuration snippets are routed to a multi-stage AI reasoning agent:
- Network & Syscall Audit: LLMs parse all network parameters and execution target flags to confirm legitimate package functionality.
- Payload Deobfuscation: The model decrypts base64 sequences and explains the actions taken by hidden setups.
- Security Synthesis: The final agent stage aggregates inputs from the previous phases to provide a cohesive safety verdict.
Understanding Verdicts
Each scanned package receives one of the following safety status verdicts:
- Safe: No critical heuristics matching malicious behavior were triggered. Safe for developer ingestion.
- Suspicious: Contains unusual code constructs (e.g. calling subprocesses in setup) that are not definitively malicious but merit user caution.
- Malicious: Code contains verified malware signatures, such as uploading credentials, stealing tokens, or running shell commands backdoors.
- Quarantined: The package has been removed from the official PyPI repository by the index security team.