amazon-textract-haystack

v1.0.0 safe
3.0
Low Risk

Haystack integration for AWS Textract document text extraction and analysis

πŸ€– AI Analysis

Final verdict: SAFE

The package shows low risk indicators with no network or shell risks and minimal obfuscation. While there is a slight concern regarding metadata and credentials, these do not strongly suggest malicious intent.

  • No network or shell risks detected.
  • Low obfuscation risk.
Per-check LLM notes
  • Network: No network calls detected, which is normal for packages that don't require external services.
  • Shell: No shell execution patterns detected, indicating no direct system command execution.
  • Obfuscation: No obfuscation patterns detected.
  • Credentials: The observed patterns are likely for conditional skipping of tests based on environment variables, not for credential harvesting.
  • Metadata: The package is new and lacks maintainer history, which raises some concerns but does not definitively indicate malice.

πŸ“¦ Package Quality Overall: Medium (6.2/10)

✦ High Test Suite 9.0

Test suite present β€” 3 test file(s) found

  • Test runner config found: pyproject.toml
  • 3 test file(s) detected (e.g. __init__.py)
β—ˆ Medium Documentation 7.0

Some documentation present

  • Documentation URL: "Documentation" -> https://github.com/deepset-ai/haystack-core-integrations/tre
  • Detailed PyPI description (4145 chars)
β—‹ Low Contributing Guide 4.0

No contributing guide or governance files found

  • Development Status classifier >= Beta
β—‹ Low Type Annotations 1.0

No type annotations detected

  • No type annotations, py.typed marker, or stub files detected
✦ High Multiple Contributors 10.0

Active multi-contributor project

  • 16 unique contributor(s) across 100 commits in deepset-ai/haystack-core-integrations
  • Active community β€” 5 or more distinct contributors

πŸ”¬ Heuristic Checks

βœ“ Outbound Network Calls

No suspicious network call patterns found

βœ“ Code Obfuscation

No obfuscation patterns detected

βœ“ Shell / Subprocess Execution

No shell execution patterns detected

⚠ Credential Harvesting score 5.0

Found 2 credential access pattern(s)

  • @pytest.mark.skipif(not os.environ.get("AWS_ACCESS_KEY_ID"), reason=SKIP_REASON_NO_CREDENTIALS) @pyt
  • ) @pytest.mark.skipif(not os.environ.get("AWS_DEFAULT_REGION"), reason=SKIP_REASON_NO_REGION) def test
βœ“ Typosquatting

No typosquatting candidates detected

βœ“ Registered Email Domain

Email domain looks legitimate: deepset.ai>

βœ“ Suspicious Page Links

All external links appear legitimate

βœ“ Git Repository History

Repository deepset-ai/haystack-core-integrations appears legitimate

⚠ Maintainer History score 6.0

3 maintainer concern(s) found

  • Only one version has ever been released β€” brand new package
  • Author name is missing or very short
  • Author "" appears to have only 1 package on PyPI (new or inactive account)
βœ“ Known CVE Vulnerabilities

No known vulnerabilities found in OSV database.

πŸ’‘ AI App Starter Prompt

Use this prompt to build a project with amazon-textract-haystack
Create a Python-based mini-application called 'DocAnalyzer' that leverages the 'amazon-textract-haystack' package to analyze scanned documents. This application will allow users to upload PDF files containing scanned text and then perform various operations on the extracted data such as searching for specific keywords, extracting tables, and summarizing the content. Here’s a detailed breakdown of the project requirements:

1. **User Interface**: Develop a simple command-line interface (CLI) where users can interact with the application. The CLI should provide options like uploading a document, searching for text, extracting tables, and generating summaries.
2. **Document Upload**: Implement functionality to accept PDF uploads from local storage or via a URL. Ensure that the application supports both single-page and multi-page PDFs.
3. **Text Extraction**: Utilize 'amazon-textract-haystack' to extract text from the uploaded documents. The package should handle the conversion of scanned text into searchable text using AWS Textract services.
4. **Keyword Search**: Allow users to search for specific keywords within the extracted text. Provide an option to display the sentences or paragraphs containing these keywords.
5. **Table Extraction**: Implement a feature to identify and extract tables from the document. Users should be able to view the extracted table data in a structured format (e.g., CSV).
6. **Content Summary**: Generate a summary of the document's content. Use natural language processing techniques to create concise summaries that capture the essence of the document.
7. **Error Handling**: Include robust error handling mechanisms to manage issues such as unsupported file formats, connection errors, and timeouts.
8. **Configuration Management**: Enable users to configure settings such as API keys for AWS services and preferred output formats for extracted data.
9. **Testing and Documentation**: Write unit tests to ensure the reliability of each feature. Provide comprehensive documentation detailing how to install and use the application, along with examples.

The 'amazon-textract-haystack' package plays a crucial role in this project by providing the necessary tools to integrate AWS Textract functionalities into the application. It simplifies the process of text extraction from scanned documents, making it easier to implement advanced features like keyword search and table extraction.