ScandEval

v17.3.0 suspicious
6.0
Medium Risk

The robust European language model benchmark.

🤖 AI Analysis

Final verdict: SUSPICIOUS

The package exhibits several concerning behaviors including high risks associated with shell execution and credential handling, suggesting potential misuse. However, there are no definitive signs of malicious intent.

  • High shell risk indicating potential for running arbitrary commands
  • High credential risk suggesting possible unauthorized credential access
Per-check LLM notes
  • Network: The network calls are likely for checking internet availability and making HTTP requests, which may be part of the package's functionality.
  • Shell: The shell execution patterns suggest potential use of subprocesses to run external commands, possibly related to hardware detection or other system-specific tasks, but could also indicate risky behavior if not properly documented.
  • Obfuscation: The obfuscation pattern detected seems to be a result of code formatting issues rather than intentional malicious obfuscation.
  • Credentials: The code is harvesting a GITHUB_TOKEN from environment variables and using getpass for input, which could indicate an attempt to capture credentials unless it's clearly documented and used for legitimate purposes like API authentication.
  • Metadata: The package shows low maintenance and effort signs, but lacks clear indicators of malicious intent.

🔬 Heuristic Checks

Outbound Network Calls score 9.0

Found 6 network call pattern(s)

  • = False try: s = socket.create_connection(("1.1.1.1", 80)) s.close() internet_availabl
  • dy as text. """ req = urllib.request.Request(url, headers={"User-Agent": "EuroEval-CoreModels/1"}
  • Eval-CoreModels/1"}) with urllib.request.urlopen(req, timeout=timeout) as resp: return resp.r
  • try: req = urllib.request.Request(url, method="HEAD") with urllib.request.
  • thod="HEAD") with urllib.request.urlopen(req, timeout=10) as resp: length = i
  • not None else None req = urllib.request.Request( url, data=data, method=meth
Code Obfuscation score 2.0

Found 1 obfuscation pattern(s)

  • re[redundant-cast] model.eval() model.to(benchmark_config.device) # ty: ignore[invali
Shell / Subprocess Execution score 6.0

Found 3 shell execution pattern(s)

  • pty() try: proc = subprocess.Popen( # noqa: S603 cmd, stdin=slave_fd,
  • one try: result = subprocess.run( # noqa: S603, S607 [ "nvidia-s
  • n) master_fd, slave_fd = pty.openpty() try: proc = subprocess.Popen( # noqa: S603
Credential Harvesting score 5.0

Found 2 credential access pattern(s)

  • variable. """ token = os.environ.get("GITHUB_TOKEN") if not token: logger.error("GITHUB_TOKEN env
  • ret: reader = lambda: getpass.getpass(f"{prompt_text}: ") # noqa: E731 else: reader =
Typosquatting

No typosquatting candidates detected

Registered Email Domain

Email domain looks legitimate: alexandra.dk>

Suspicious Page Links

All external links appear legitimate

Git Repository History

Repository EuroEval/EuroEval appears legitimate

Maintainer History score 6.0

3 maintainer concern(s) found

  • Author name is missing or very short
  • Author "" appears to have only 1 package on PyPI (new or inactive account)
  • Package has no PyPI classifiers (low effort / metadata quality)
Known CVE Vulnerabilities

No known vulnerabilities found in OSV database.

💡 AI App Starter Prompt

Use this prompt to build a project with ScandEval
Create a web-based language evaluation tool using Python's ScandEval package. This tool will serve as a platform for users to test and compare various language models on their performance across different European languages, focusing particularly on Nordic and other less commonly evaluated European languages. The application should allow users to upload their own custom datasets or select from predefined ones, specify which language models they wish to evaluate, and choose from a variety of tasks such as sentiment analysis, named entity recognition, and machine translation. Additionally, the tool should provide visualizations of the evaluation results, allowing for easy comparison between different models and tasks. The application should be built using Flask for the backend and React for the frontend, ensuring a responsive and user-friendly interface. Utilize ScandEval's benchmarking capabilities to automate the evaluation process and integrate its scoring metrics into the application's output. Ensure that the application is scalable and can handle multiple concurrent evaluations.