Spaces:

vikee
/

chagu-dev

Build error

App Files Files Community

talexm commited on Nov 30, 2024

Commit

e893d68

1 Parent(s): 0c3cda8

adding blockchain logger

Browse files

Files changed (4) hide show

rag_sec/README.md +251 -11
rag_sec/backup.py +79 -0
rag_sec/document_search_system.py +147 -22
screenshots/Screenshot from 2024-11-30 19-01-31.png +0 -0

rag_sec/README.md CHANGED Viewed

@@ -1,13 +1,43 @@
-## Workflow
 The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:
-### 1. **Input Query**
 - A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.
 ---
-### 2. **Detection Module**
 - **Purpose**: Classify the query as "bad" or "good."
 - **Steps**:
   1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent.
@@ -16,7 +46,7 @@ The system follows a well-structured workflow to ensure accurate, secure, and co
 ---
-### 3. **Transformation Module**
 - **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
 - **Steps**:
   1. Identify missing context or ambiguous phrasing.
@@ -27,7 +57,7 @@ The system follows a well-structured workflow to ensure accurate, secure, and co
 ---
-### 4. **RAG Pipeline**
 - **Purpose**: Retrieve relevant data and generate a context-aware response.
 - **Steps**:
   1. **Document Retrieval**:
@@ -40,7 +70,7 @@ The system follows a well-structured workflow to ensure accurate, secure, and co
 ---
-### 5. **Semantic Response Generation**
 - **Purpose**: Provide a concise and meaningful answer.
 - **Steps**:
   1. Combine the retrieved documents into a coherent context.
@@ -49,9 +79,219 @@ The system follows a well-structured workflow to ensure accurate, secure, and co
 ---
-### End-to-End Example
-#### Input Query:
-```plaintext
-"How to improve acting skills?"
-````

+# **Document Search System**
+## **Overview**
+The **Document Search System** provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses.
+---
+## **Features**
+1. **Query Classification:**
+   - Detects malicious or inappropriate queries using a sentiment analysis model.
+   - Blocks malicious queries and prevents them from further processing.
+2. **Query Transformation:**
+   - Rephrases or enhances ambiguous queries to improve retrieval accuracy.
+   - Uses rule-based transformations and advanced text-to-text models.
+3. **RAG Pipeline:**
+   - Retrieves top-k documents based on semantic similarity.
+   - Generates context-aware responses using generative models.
+4. **Blockchain Integration (Chagu):**
+   - Logs all stages of query processing into a blockchain for integrity and traceability.
+   - Validates blockchain integrity.
+5. **Neo4j Integration:**
+   - Stores and visualizes relationships between queries, responses, and documents.
+   - Allows detailed querying and visualization of the data flow.
+---
+## **Workflow**
 The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:
+### **1. Input Query**
 - A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.
 ---
+### **2. Detection Module**
 - **Purpose**: Classify the query as "bad" or "good."
 - **Steps**:
   1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent.
 ---
+### **3. Transformation Module**
 - **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
 - **Steps**:
   1. Identify missing context or ambiguous phrasing.
 ---
+### **4. RAG Pipeline**
 - **Purpose**: Retrieve relevant data and generate a context-aware response.
 - **Steps**:
   1. **Document Retrieval**:
 ---
+### **5. Semantic Response Generation**
 - **Purpose**: Provide a concise and meaningful answer.
 - **Steps**:
   1. Combine the retrieved documents into a coherent context.
 ---
+### **6. Logging and Storage**
+- **Blockchain Logging:**
+  - Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability.
+  - Ensures data integrity and tamper-proof records.
+- **Neo4j Storage:**
+  - Relationships between queries, responses, and retrieved documents are stored in Neo4j.
+  - Enables detailed analysis and graph-based visualization.
+---
+## **Neo4j Visualization**
+Here is an example of how the relationships between queries, responses, and documents appear in Neo4j:
+![Neo4j Visualization](../../screenshots/Screenshot_from_2024-11-30_19-01-31.png)
+- **Nodes**:
+  - Query: Represents the user query.
+  - TransformedQuery: Rephrased or improved query.
+  - Document: Relevant documents retrieved based on the query.
+  - Response: The generated response.
+- **Relationships**:
+  - `RETRIEVED`: Links the query to retrieved documents.
+  - `TRANSFORMED_TO`: Links the original query to the transformed query.
+  - `GENERATED`: Links the query to the generated response.
+---
+## **Setup Instructions**
+1. Clone the repository:
+   ```bash
+   git clone https://github.com/your-repo/document-search-system.git
+    ```
+Here’s the updated README.md content in proper Markdown format with the embedded image reference:
+markdown
+# **Document Search System**
+## **Overview**
+The **Document Search System** provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses.
+---
+## **Features**
+1. **Query Classification:**
+   - Detects malicious or inappropriate queries using a sentiment analysis model.
+   - Blocks malicious queries and prevents them from further processing.
+2. **Query Transformation:**
+   - Rephrases or enhances ambiguous queries to improve retrieval accuracy.
+   - Uses rule-based transformations and advanced text-to-text models.
+3. **RAG Pipeline:**
+   - Retrieves top-k documents based on semantic similarity.
+   - Generates context-aware responses using generative models.
+4. **Blockchain Integration (Chagu):**
+   - Logs all stages of query processing into a blockchain for integrity and traceability.
+   - Validates blockchain integrity.
+5. **Neo4j Integration:**
+   - Stores and visualizes relationships between queries, responses, and documents.
+   - Allows detailed querying and visualization of the data flow.
+---
+## **Workflow**
+The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries:
+### **1. Input Query**
+- A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent.
+---
+### **2. Detection Module**
+- **Purpose**: Classify the query as "bad" or "good."
+- **Steps**:
+  1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent.
+  2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message.
+  3. If "good," proceed to the **Transformation Module**.
+---
+### **3. Transformation Module**
+- **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval.
+- **Steps**:
+  1. Identify missing context or ambiguous phrasing.
+  2. Transform the query using:
+     - Rule-based transformations for simple fixes.
+     - Text-to-text models (e.g., `google/flan-t5-small`) for more sophisticated rephrasing.
+  3. Pass the transformed query to the **RAG Pipeline**.
+---
+### **4. RAG Pipeline**
+- **Purpose**: Retrieve relevant data and generate a context-aware response.
+- **Steps**:
+  1. **Document Retrieval**:
+     - Encode the transformed query and documents into embeddings using `all-MiniLM-L6-v2`.
+     - Compute semantic similarity between the query and stored documents.
+     - Retrieve the top-k documents relevant to the query.
+  2. **Response Generation**:
+     - Use the retrieved documents as context.
+     - Pass the query and context to a generative model (e.g., `distilgpt2`) to synthesize a meaningful response.
+---
+### **5. Semantic Response Generation**
+- **Purpose**: Provide a concise and meaningful answer.
+- **Steps**:
+  1. Combine the retrieved documents into a coherent context.
+  2. Generate a response tailored to the query using the generative model.
+  3. Return the response to the user, ensuring clarity and relevance.
+---
+### **6. Logging and Storage**
+- **Blockchain Logging:**
+  - Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability.
+  - Ensures data integrity and tamper-proof records.
+- **Neo4j Storage:**
+  - Relationships between queries, responses, and retrieved documents are stored in Neo4j.
+  - Enables detailed analysis and graph-based visualization.
+---
+## **Neo4j Visualization**
+Here is an example of how the relationships between queries, responses, and documents appear in Neo4j:
+![Neo4j Visualization](./path/to/Screenshot_from_2024-11-30_19-01-31.png)
+- **Nodes**:
+  - Query: Represents the user query.
+  - TransformedQuery: Rephrased or improved query.
+  - Document: Relevant documents retrieved based on the query.
+  - Response: The generated response.
+- **Relationships**:
+  - `RETRIEVED`: Links the query to retrieved documents.
+  - `TRANSFORMED_TO`: Links the original query to the transformed query.
+  - `GENERATED`: Links the query to the generated response.
+---
+## **Setup Instructions**
+1. Clone the repository:
+   ```bash
+   git clone https://github.com/your-repo/document-search-system.git
+   ```
+Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+Initialize the Neo4j database:
+Connect to your Neo4j Aura instance.
+Set up credentials in the code.
+Load the dataset:
+Place your documents in the dataset directory (e.g., data-sets/aclImdb/train).
+Run the system:
+```bash
+python document_search_system.py
+```
+Neo4j Queries
+Retrieve All Queries Logged
+```cypher
+MATCH (q:Query)
+RETURN q.text AS query, q.timestamp AS timestamp
+ORDER BY timestamp DESC
+```
+Visualize Query Relationships
+```cypher
+MATCH (n)-[r]->(m)
+RETURN n, r, m
+Find Documents for a Query
+```
+```cypher
+MATCH (q:Query {text: "How to improve acting skills?"})-[:RETRIEVED]->(d:Document)
+RETURN d.name AS document_name
+```
+### Key Technologies
+Machine Learning Models:
+distilbert-base-uncased-finetuned-sst-2-english for sentiment analysis.
+google/flan-t5-small for query transformation.
+distilgpt2 for response generation.
+Vector Similarity Search:
+all-MiniLM-L6-v2 embeddings for document retrieval.
+Blockchain Logging:
+Powered by chainguard.blockchain_logger.
+Graph-Based Storage:
+Relationships visualized and queried via Neo4j.
+vbnet

rag_sec/backup.py ADDED Viewed

	@@ -0,0 +1,79 @@

+import os
+from pathlib import Path
+from .bad_query_detector import BadQueryDetector
+from .query_transformer import QueryTransformer
+from .document_retriver import DocumentRetriever
+from .senamtic_response_generator import SemanticResponseGenerator
+class DocumentSearchSystem:
+    def __init__(self):
+        """
+        Initializes the DocumentSearchSystem with:
+        - BadQueryDetector for identifying malicious or inappropriate queries.
+        - QueryTransformer for improving or rephrasing queries.
+        - DocumentRetriever for semantic document retrieval.
+        - SemanticResponseGenerator for generating context-aware responses.
+        """
+        self.detector = BadQueryDetector()
+        self.transformer = QueryTransformer()
+        self.retriever = DocumentRetriever()
+        self.response_generator = SemanticResponseGenerator()
+    def process_query(self, query):
+        """
+        Processes a user query through the following steps:
+        1. Detect if the query is malicious.
+        2. Transform the query if needed.
+        3. Retrieve relevant documents based on the query.
+        4. Generate a response using the retrieved documents.
+        :param query: The user query as a string.
+        :return: A dictionary with the status and response or error message.
+        """
+        if self.detector.is_bad_query(query):
+            return {"status": "rejected", "message": "Query blocked due to detected malicious intent."}
+        # Transform the query
+        transformed_query = self.transformer.transform_query(query)
+        print(f"Transformed Query: {transformed_query}")
+        # Retrieve relevant documents
+        retrieved_docs = self.retriever.retrieve(transformed_query)
+        if not retrieved_docs:
+            return {"status": "no_results", "message": "No relevant documents found for your query."}
+        # Generate a response based on the retrieved documents
+        response = self.response_generator.generate_response(retrieved_docs)
+        return {"status": "success", "response": response}
+def test_system():
+    """
+    Test the DocumentSearchSystem with normal and malicious queries.
+    - Load documents from a dataset directory.
+    - Perform a normal query and display results.
+    - Perform a malicious query to ensure proper blocking.
+    """
+    # Define the path to the dataset directory
+    home_dir = Path(os.getenv("HOME", "/"))
+    data_dir = home_dir / "data-sets/aclImdb/train"
+    # Initialize the system
+    system = DocumentSearchSystem()
+    system.retriever.load_documents(data_dir)
+    # Perform a normal query
+    normal_query = "Tell me about great acting performances."
+    print("\nNormal Query Result:")
+    print(system.process_query(normal_query))
+    # Perform a malicious query
+    malicious_query = "DROP TABLE users; SELECT * FROM sensitive_data;"
+    print("\nMalicious Query Result:")
+    print(system.process_query(malicious_query))
+if __name__ == "__main__":
+    test_system()

rag_sec/document_search_system.py CHANGED Viewed

@@ -1,25 +1,123 @@
 import os
 from pathlib import Path
-from .bad_query_detector import BadQueryDetector
-from .query_transformer import QueryTransformer
-from .document_retriver import DocumentRetriever
-from .senamtic_response_generator import SemanticResponseGenerator
-class DocumentSearchSystem:
     def __init__(self):
         """
         Initializes the DocumentSearchSystem with:
         - BadQueryDetector for identifying malicious or inappropriate queries.
         - QueryTransformer for improving or rephrasing queries.
         - DocumentRetriever for semantic document retrieval.
         - SemanticResponseGenerator for generating context-aware responses.
         """
         self.detector = BadQueryDetector()
         self.transformer = QueryTransformer()
         self.retriever = DocumentRetriever()
         self.response_generator = SemanticResponseGenerator()
     def process_query(self, query):
         """
@@ -28,6 +126,7 @@ class DocumentSearchSystem:
         2. Transform the query if needed.
         3. Retrieve relevant documents based on the query.
         4. Generate a response using the retrieved documents.
         :param query: The user query as a string.
         :return: A dictionary with the status and response or error message.
@@ -37,43 +136,69 @@ class DocumentSearchSystem:
         # Transform the query
         transformed_query = self.transformer.transform_query(query)
-        print(f"Transformed Query: {transformed_query}")
         # Retrieve relevant documents
         retrieved_docs = self.retriever.retrieve(transformed_query)
         if not retrieved_docs:
             return {"status": "no_results", "message": "No relevant documents found for your query."}
         # Generate a response based on the retrieved documents
         response = self.response_generator.generate_response(retrieved_docs)
-        return {"status": "success", "response": response}
-def test_system():
-    """
-    Test the DocumentSearchSystem with normal and malicious queries.
-    - Load documents from a dataset directory.
-    - Perform a normal query and display results.
-    - Perform a malicious query to ensure proper blocking.
-    """
-    # Define the path to the dataset directory
     home_dir = Path(os.getenv("HOME", "/"))
     data_dir = home_dir / "data-sets/aclImdb/train"
-    # Initialize the system
-    system = DocumentSearchSystem()
-    system.retriever.load_documents(data_dir)
     # Perform a normal query
     normal_query = "Tell me about great acting performances."
     print("\nNormal Query Result:")
-    print(system.process_query(normal_query))
     # Perform a malicious query
     malicious_query = "DROP TABLE users; SELECT * FROM sensitive_data;"
     print("\nMalicious Query Result:")
-    print(system.process_query(malicious_query))
-if __name__ == "__main__":
-    test_system()

 import os
 from pathlib import Path
+from chainguard.blockchain_logger import BlockchainLogger
+from neo4j import GraphDatabase
+import sys
+from os import path
+sys.path.append(path.dirname(path.dirname(path.abspath(__file__))))
+from bad_query_detector import BadQueryDetector
+from query_transformer import QueryTransformer
+from document_retriver import DocumentRetriever
+from senamtic_response_generator import SemanticResponseGenerator
+class DataTransformer:
     def __init__(self):
+        """
+        Initializes a DataTransformer with a blockchain logger instance.
+        """
+        self.blockchain_logger = BlockchainLogger()
+    def secure_transform(self, data):
+        """
+        Securely transforms the input data by logging it into the blockchain.
+        Args:
+            data (dict): The log data or any data to be securely transformed.
+        Returns:
+            dict: A dictionary containing the original data, block hash, and blockchain length.
+        """
+        # Log the data into the blockchain
+        block_details = self.blockchain_logger.log_data(data)
+        # Return the block details and blockchain status
+        return {
+            "data": data,
+            **block_details
+        }
+    def validate_blockchain(self):
+        """
+        Validates the integrity of the blockchain.
+        Returns:
+            bool: True if the blockchain is valid, False otherwise.
+        """
+        return self.blockchain_logger.is_blockchain_valid()
+class Neo4jHandler:
+    def __init__(self, uri, user, password):
+        """
+        Initializes a Neo4j handler for storing and querying relationships.
+        """
+        self.driver = GraphDatabase.driver(uri, auth=(user, password))
+    def close(self):
+        self.driver.close()
+    def log_relationships(self, query, transformed_query, response, documents):
+        """
+        Logs the relationships between queries, responses, and documents into Neo4j.
+        """
+        with self.driver.session() as session:
+            session.write_transaction(self._create_and_link_nodes, query, transformed_query, response, documents)
+    @staticmethod
+    def _create_and_link_nodes(tx, query, transformed_query, response, documents):
+        # Create Query node
+        tx.run("MERGE (q:Query {text: $query}) RETURN q", parameters={"query": query})
+        # Create TransformedQuery node
+        tx.run("MERGE (t:TransformedQuery {text: $transformed_query}) RETURN t",
+               parameters={"transformed_query": transformed_query})
+        # Create Response node
+        tx.run("MERGE (r:Response {text: $response}) RETURN r", parameters={"response": response})
+        # Link Query to TransformedQuery and Response
+        tx.run(
+            """
+            MATCH (q:Query {text: $query}), (t:TransformedQuery {text: $transformed_query})
+            MERGE (q)-[:TRANSFORMED_TO]->(t)
+            """, parameters={"query": query, "transformed_query": transformed_query}
+        )
+        tx.run(
+            """
+            MATCH (q:Query {text: $query}), (r:Response {text: $response})
+            MERGE (q)-[:GENERATED]->(r)
+            """, parameters={"query": query, "response": response}
+        )
+        # Create and link Document nodes
+        for doc in documents:
+            tx.run("MERGE (d:Document {name: $doc}) RETURN d", parameters={"doc": doc})
+            tx.run(
+                """
+                MATCH (q:Query {text: $query}), (d:Document {name: $doc})
+                MERGE (q)-[:RETRIEVED]->(d)
+                """, parameters={"query": query, "doc": doc}
+            )
+class DocumentSearchSystem:
+    def __init__(self, neo4j_uri, neo4j_user, neo4j_password):
         """
         Initializes the DocumentSearchSystem with:
         - BadQueryDetector for identifying malicious or inappropriate queries.
         - QueryTransformer for improving or rephrasing queries.
         - DocumentRetriever for semantic document retrieval.
         - SemanticResponseGenerator for generating context-aware responses.
+        - DataTransformer for blockchain logging of queries and responses.
+        - Neo4jHandler for relationship logging and visualization.
         """
         self.detector = BadQueryDetector()
         self.transformer = QueryTransformer()
         self.retriever = DocumentRetriever()
         self.response_generator = SemanticResponseGenerator()
+        self.data_transformer = DataTransformer()
+        self.neo4j_handler = Neo4jHandler(neo4j_uri, neo4j_user, neo4j_password)
     def process_query(self, query):
         """
         2. Transform the query if needed.
         3. Retrieve relevant documents based on the query.
         4. Generate a response using the retrieved documents.
+        5. Log all stages to the blockchain and Neo4j.
         :param query: The user query as a string.
         :return: A dictionary with the status and response or error message.
         # Transform the query
         transformed_query = self.transformer.transform_query(query)
+        # Log the original query to the blockchain
+        self.data_transformer.secure_transform({"type": "query", "content": query})
         # Retrieve relevant documents
         retrieved_docs = self.retriever.retrieve(transformed_query)
         if not retrieved_docs:
             return {"status": "no_results", "message": "No relevant documents found for your query."}
+        # Log the retrieved documents to the blockchain
+        self.data_transformer.secure_transform({"type": "documents", "content": retrieved_docs})
         # Generate a response based on the retrieved documents
         response = self.response_generator.generate_response(retrieved_docs)
+        # Log the response to the blockchain
+        blockchain_details = self.data_transformer.secure_transform({"type": "response", "content": response})
+        # Log relationships to Neo4j
+        self.neo4j_handler.log_relationships(query, transformed_query, response, retrieved_docs)
+        return {
+            "status": "success",
+            "response": response,
+            "retrieved_documents": retrieved_docs,
+            "blockchain_details": blockchain_details
+        }
+    def validate_system_integrity(self):
+        """
+        Validates the integrity of the blockchain.
+        """
+        return self.data_transformer.validate_blockchain()
+if __name__ == "__main__":
     home_dir = Path(os.getenv("HOME", "/"))
     data_dir = home_dir / "data-sets/aclImdb/train"
+    # Initialize system with Neo4j credentials
+    system = DocumentSearchSystem(
+        neo4j_uri="neo4j+s://0ca71b10.databases.neo4j.io",
+        neo4j_user="neo4j",
+        neo4j_password="<PINGME ill provide>"
+    )
+    system.retriever.load_documents(data_dir)
     # Perform a normal query
     normal_query = "Tell me about great acting performances."
     print("\nNormal Query Result:")
+    result = system.process_query(normal_query)
+    print("Status:", result["status"])
+    print("Response:", result["response"])
+    print("Retrieved Documents:", result["retrieved_documents"])
+    print("Blockchain Details:", result["blockchain_details"])
     # Perform a malicious query
     malicious_query = "DROP TABLE users; SELECT * FROM sensitive_data;"
     print("\nMalicious Query Result:")
+    result = system.process_query(malicious_query)
+    print("Status:", result["status"])
+    print("Message:", result.get("message"))

screenshots/Screenshot from 2024-11-30 19-01-31.png ADDED Viewed