RDTextract

v0.2.0 suspicious
4.0
Medium Risk

HTML→Markdown extractor optimized for LLM training corpora — zero noise artifacts, integrated quality scoring, low-value stub detection.

🤖 AI Analysis

Final verdict: SUSPICIOUS

The package shows low individual risks across various categories, but the metadata risk due to incomplete author details and potential inactivity of the account raises concerns. This combination suggests caution but does not conclusively indicate a supply-chain attack.

  • Low network, shell, obfuscation, and credential risks.
  • Metadata risk due to incomplete author details and potentially inactive account.
Per-check LLM notes
  • Network: No network calls detected, which is normal unless the package requires external services.
  • Shell: No shell execution patterns detected, indicating no immediate risk from command execution.
  • Obfuscation: No obfuscation patterns detected, indicating low risk of malicious activity.
  • Credentials: No credential harvesting patterns detected, suggesting no immediate threat to secrets or credentials.
  • Metadata: The author's details are incomplete and the account seems new or inactive, raising some suspicion but not conclusive evidence of malice.

🔬 Heuristic Checks

Outbound Network Calls

No suspicious network call patterns found

Code Obfuscation

No obfuscation patterns detected

Shell / Subprocess Execution

No shell execution patterns detected

Credential Harvesting

No credential harvesting patterns detected

Typosquatting

No typosquatting candidates detected

Registered Email Domain

Email domain looks legitimate: rdtvlokip.fr>

Suspicious Page Links

All external links appear legitimate

Git Repository History

Repository RDTvlokip/RDTextract appears legitimate

Maintainer History score 4.0

2 maintainer concern(s) found

  • Author name is missing or very short
  • Author "" appears to have only 1 package on PyPI (new or inactive account)
Known CVE Vulnerabilities

No known vulnerabilities found in OSV database.

💡 AI App Starter Prompt

Use this prompt to build a project with RDTextract
Create a web-based utility that extracts and converts content from HTML files into clean Markdown format, optimized specifically for use in large language model (LLM) training datasets. The application should leverage the 'RDTextract' package to ensure the extracted content is free from noise artifacts and includes quality scores for each piece of extracted text. Additionally, the utility should be able to detect and exclude low-value content such as advertisements, navigation bars, and other non-essential elements.

Steps to build the utility:
1. Set up a Flask backend server that allows users to upload HTML files.
2. Integrate the 'RDTextract' package within your Flask app to process the uploaded HTML files.
3. Implement a feature that displays the extracted Markdown content on a separate page, alongside quality scores for each segment.
4. Add functionality to allow users to download the cleaned Markdown file directly from the application.
5. Include a feature that highlights or marks sections of the HTML that have been identified as low-value content.
6. Ensure the application has a user-friendly interface with clear instructions on how to use it.

Suggested Features:
- User authentication to track individual usage statistics.
- A history section where users can view their previously processed files.
- An option to manually review and adjust the quality scores of specific segments.
- Integration with popular cloud storage services for easy file retrieval.

How 'RDTextract' is Utilized:
- Use 'RDTextract' to process the uploaded HTML files, extracting only the valuable content and generating quality scores for each piece of text.
- Apply 'RDTextract's low-value detection capabilities to identify and exclude non-essential parts of the HTML.
- Display the extracted content in a structured Markdown format, ensuring the final output is ready for LLM training purposes.