Spaces:
Running
Running
Fix permission errors, rate limiting, and add English language filtering
Browse files- Fixed Docker permission denied errors for vector_db and cache directories
- Added proper error handling and fallback mechanisms for directory permissions
- Implemented exponential backoff for Firebase API rate limiting (429 errors)
- Added language filtering for Firebase articles (English only)
- Improved vector store initialization with writable directory detection
- Added comprehensive health check and diagnostic tools
- Enhanced startup process with proper environment setup
- Fixed multiple worker race conditions by using single worker
- Added production startup script with better error handling
- Improved embeddings loading with proper cache directory handling
- Added extensive documentation and troubleshooting guides
- DEPLOYMENT.md +154 -0
- Dockerfile +12 -4
- LANGUAGE_FILTERING.md +139 -0
- cve_factchecker/__pycache__/analyzer.cpython-311.pyc +0 -0
- cve_factchecker/__pycache__/app.cpython-311.pyc +0 -0
- cve_factchecker/__pycache__/config.cpython-311.pyc +0 -0
- cve_factchecker/__pycache__/embeddings.cpython-311.pyc +0 -0
- cve_factchecker/__pycache__/firebase_loader.cpython-311.pyc +0 -0
- cve_factchecker/__pycache__/firebase_service.cpython-311.pyc +0 -0
- cve_factchecker/__pycache__/orchestrator.cpython-311.pyc +0 -0
- cve_factchecker/__pycache__/retriever.cpython-311.pyc +0 -0
- cve_factchecker/app.py +5 -4
- cve_factchecker/embeddings.py +15 -2
- cve_factchecker/firebase_loader.py +163 -4
- cve_factchecker/firebase_service.py +8 -0
- cve_factchecker/orchestrator.py +6 -3
- cve_factchecker/retriever.py +91 -13
- cve_factchecker/wsgi.py +17 -1
- health_check.py +202 -0
- ingest.lock +1 -0
- run_production.py +111 -0
- start_production.py +118 -0
- startup.py +95 -0
- test_language_filter.py +135 -0
DEPLOYMENT.md
ADDED
|
@@ -0,0 +1,154 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CVE Fact Checker - Deployment Guide
|
| 2 |
+
|
| 3 |
+
## Quick Start
|
| 4 |
+
|
| 5 |
+
### Local Development
|
| 6 |
+
```bash
|
| 7 |
+
python -m pip install -r requirements.txt
|
| 8 |
+
python -m cve_factchecker
|
| 9 |
+
```
|
| 10 |
+
|
| 11 |
+
### Production (Docker)
|
| 12 |
+
```bash
|
| 13 |
+
docker build -t cve-fact-checker .
|
| 14 |
+
docker run -p 7860:7860 cve-fact-checker
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
### Health Check
|
| 18 |
+
```bash
|
| 19 |
+
python health_check.py
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
## Environment Variables
|
| 23 |
+
|
| 24 |
+
| Variable | Description | Default |
|
| 25 |
+
|----------|-------------|---------|
|
| 26 |
+
| `PORT` | Server port | `7860` |
|
| 27 |
+
| `OPENROUTER_API_KEY` | LLM API key | None |
|
| 28 |
+
| `FIREBASE_API_KEY` | Firebase API key | (embedded) |
|
| 29 |
+
| `AUTO_INGEST` | Auto-ingest on startup | `true` |
|
| 30 |
+
| `LANGUAGE_FILTER` | Language to filter articles | `English` |
|
| 31 |
+
| `USE_DUMMY_EMBEDDINGS` | Use lightweight embeddings | `false` |
|
| 32 |
+
| `VECTOR_PERSIST_DIR` | Vector DB directory | `/tmp/vector_db` |
|
| 33 |
+
| `SENTENCE_TRANSFORMERS_HOME` | Model cache | `/tmp/sentence_transformers` |
|
| 34 |
+
|
| 35 |
+
## API Endpoints
|
| 36 |
+
|
| 37 |
+
### Health Check
|
| 38 |
+
```bash
|
| 39 |
+
curl http://localhost:7860/health
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
### Fact Check
|
| 43 |
+
```bash
|
| 44 |
+
# GET request
|
| 45 |
+
curl "http://localhost:7860/fact-check?claim=Your claim here"
|
| 46 |
+
|
| 47 |
+
# POST request (JSON)
|
| 48 |
+
curl -X POST http://localhost:7860/fact-check \
|
| 49 |
+
-H "Content-Type: application/json" \
|
| 50 |
+
-d '{"claim": "Your claim here"}'
|
| 51 |
+
|
| 52 |
+
# POST request (form data)
|
| 53 |
+
curl -X POST http://localhost:7860/fact-check \
|
| 54 |
+
-F "claim=Your claim here"
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
## Troubleshooting
|
| 58 |
+
|
| 59 |
+
### Common Issues
|
| 60 |
+
|
| 61 |
+
#### Permission Denied Errors
|
| 62 |
+
- **Symptom**: `[Errno 13] Permission denied: './vector_db'`
|
| 63 |
+
- **Solution**: The app automatically falls back to `/tmp/vector_db` or in-memory storage
|
| 64 |
+
|
| 65 |
+
#### Firebase Rate Limiting
|
| 66 |
+
- **Symptom**: `Firebase API failed: 429`
|
| 67 |
+
- **Solution**: The app implements exponential backoff and retry logic
|
| 68 |
+
|
| 69 |
+
#### Model Loading Issues
|
| 70 |
+
- **Symptom**: `No sentence-transformers model found`
|
| 71 |
+
- **Solution**: Set `USE_DUMMY_EMBEDDINGS=true` for faster startup
|
| 72 |
+
|
| 73 |
+
#### Memory Issues
|
| 74 |
+
- **Symptom**: App crashes or becomes unresponsive
|
| 75 |
+
- **Solution**: Reduce batch sizes or enable dummy embeddings
|
| 76 |
+
|
| 77 |
+
### Debug Mode
|
| 78 |
+
|
| 79 |
+
Run with debug logging:
|
| 80 |
+
```bash
|
| 81 |
+
export FLASK_ENV=development
|
| 82 |
+
python -m cve_factchecker
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
### Manual Health Check
|
| 86 |
+
|
| 87 |
+
The `health_check.py` script provides comprehensive diagnostics:
|
| 88 |
+
```bash
|
| 89 |
+
python health_check.py
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
This checks:
|
| 93 |
+
- Environment variables
|
| 94 |
+
- Directory permissions
|
| 95 |
+
- Package imports
|
| 96 |
+
- Firebase connectivity
|
| 97 |
+
- App functionality
|
| 98 |
+
|
| 99 |
+
### Production Deployment
|
| 100 |
+
|
| 101 |
+
For production use:
|
| 102 |
+
```bash
|
| 103 |
+
python run_production.py
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
This script:
|
| 107 |
+
- Runs health checks
|
| 108 |
+
- Sets up signal handlers
|
| 109 |
+
- Starts gunicorn with optimal settings
|
| 110 |
+
- Provides better error reporting
|
| 111 |
+
|
| 112 |
+
## Docker Configuration
|
| 113 |
+
|
| 114 |
+
The Dockerfile is optimized for containerized deployment:
|
| 115 |
+
|
| 116 |
+
- Uses Python 3.11 slim base
|
| 117 |
+
- Creates writable cache directories
|
| 118 |
+
- Single worker to avoid race conditions
|
| 119 |
+
- Proper signal handling
|
| 120 |
+
- Health check integration
|
| 121 |
+
|
| 122 |
+
## Architecture
|
| 123 |
+
|
| 124 |
+
```
|
| 125 |
+
CVE Fact Checker
|
| 126 |
+
βββ Flask Web API
|
| 127 |
+
βββ Vector Database (ChromaDB)
|
| 128 |
+
βββ Embeddings (sentence-transformers)
|
| 129 |
+
βββ Firebase Article Loader
|
| 130 |
+
βββ LLM Integration (OpenRouter)
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
## Performance Tuning
|
| 134 |
+
|
| 135 |
+
### For Low-Memory Environments
|
| 136 |
+
```bash
|
| 137 |
+
export USE_DUMMY_EMBEDDINGS=true
|
| 138 |
+
export AUTO_INGEST=false
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
### For High-Throughput
|
| 142 |
+
```bash
|
| 143 |
+
export AUTO_INGEST=true
|
| 144 |
+
# Ensure adequate Firebase API limits
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
## Support
|
| 148 |
+
|
| 149 |
+
If you encounter issues:
|
| 150 |
+
|
| 151 |
+
1. Run `python health_check.py` for diagnostics
|
| 152 |
+
2. Check application logs for specific errors
|
| 153 |
+
3. Verify environment variables are set correctly
|
| 154 |
+
4. Ensure proper file system permissions
|
Dockerfile
CHANGED
|
@@ -6,7 +6,9 @@ FROM python:3.11-slim
|
|
| 6 |
ENV PYTHONDONTWRITEBYTECODE=1 \
|
| 7 |
PYTHONUNBUFFERED=1 \
|
| 8 |
PIP_NO_CACHE_DIR=1 \
|
| 9 |
-
PORT=7860
|
|
|
|
|
|
|
| 10 |
|
| 11 |
# System deps for chromadb and sentence-transformers
|
| 12 |
RUN apt-get update && apt-get install -y --no-install-recommends \
|
|
@@ -17,6 +19,10 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
|
|
| 17 |
|
| 18 |
WORKDIR /app
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
# Install Python deps early for better layer caching
|
| 21 |
COPY requirements.txt ./
|
| 22 |
RUN pip install --upgrade pip && pip install -r requirements.txt && pip install gunicorn
|
|
@@ -24,9 +30,11 @@ RUN pip install --upgrade pip && pip install -r requirements.txt && pip install
|
|
| 24 |
# Copy application code
|
| 25 |
COPY . .
|
| 26 |
|
|
|
|
|
|
|
|
|
|
| 27 |
# Expose the port used by Hugging Face Spaces
|
| 28 |
EXPOSE 7860
|
| 29 |
|
| 30 |
-
#
|
| 31 |
-
|
| 32 |
-
CMD ["gunicorn", "-w", "2", "-k", "gthread", "--threads", "8", "-b", "0.0.0.0:7860", "cve_factchecker.wsgi:application"]
|
|
|
|
| 6 |
ENV PYTHONDONTWRITEBYTECODE=1 \
|
| 7 |
PYTHONUNBUFFERED=1 \
|
| 8 |
PIP_NO_CACHE_DIR=1 \
|
| 9 |
+
PORT=7860 \
|
| 10 |
+
SENTENCE_TRANSFORMERS_HOME=/tmp/sentence_transformers \
|
| 11 |
+
VECTOR_PERSIST_DIR=/tmp/vector_db
|
| 12 |
|
| 13 |
# System deps for chromadb and sentence-transformers
|
| 14 |
RUN apt-get update && apt-get install -y --no-install-recommends \
|
|
|
|
| 19 |
|
| 20 |
WORKDIR /app
|
| 21 |
|
| 22 |
+
# Create necessary directories with proper permissions
|
| 23 |
+
RUN mkdir -p /tmp/vector_db /tmp/sentence_transformers /app/logs && \
|
| 24 |
+
chmod 777 /tmp/vector_db /tmp/sentence_transformers /app/logs
|
| 25 |
+
|
| 26 |
# Install Python deps early for better layer caching
|
| 27 |
COPY requirements.txt ./
|
| 28 |
RUN pip install --upgrade pip && pip install -r requirements.txt && pip install gunicorn
|
|
|
|
| 30 |
# Copy application code
|
| 31 |
COPY . .
|
| 32 |
|
| 33 |
+
# Make scripts executable
|
| 34 |
+
RUN chmod +x *.py
|
| 35 |
+
|
| 36 |
# Expose the port used by Hugging Face Spaces
|
| 37 |
EXPOSE 7860
|
| 38 |
|
| 39 |
+
# Use our production startup script for better error handling
|
| 40 |
+
CMD ["python", "run_production.py"]
|
|
|
LANGUAGE_FILTERING.md
ADDED
|
@@ -0,0 +1,139 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CVE Fact Checker - Language Filtering Implementation
|
| 2 |
+
|
| 3 |
+
## Summary of Changes
|
| 4 |
+
|
| 5 |
+
I have successfully implemented language filtering for your CVE Fact Checker system. Now it will only retrieve English articles from Firebase instead of all articles.
|
| 6 |
+
|
| 7 |
+
## Key Changes Made
|
| 8 |
+
|
| 9 |
+
### 1. Firebase Loader Enhancement
|
| 10 |
+
- **File**: `cve_factchecker/firebase_loader.py`
|
| 11 |
+
- **Changes**:
|
| 12 |
+
- Added language parameter to `fetch_articles()` method
|
| 13 |
+
- Implemented Firebase structured query with language filter
|
| 14 |
+
- Added fallback to simple fetch if structured query fails
|
| 15 |
+
- Enhanced rate limiting and error handling
|
| 16 |
+
|
| 17 |
+
### 2. Orchestrator Update
|
| 18 |
+
- **File**: `cve_factchecker/orchestrator.py`
|
| 19 |
+
- **Changes**:
|
| 20 |
+
- Added language parameter to `ingest_firebase()` method
|
| 21 |
+
- Passes language filter to Firebase loader
|
| 22 |
+
- Returns language info in response
|
| 23 |
+
|
| 24 |
+
### 3. Application Configuration
|
| 25 |
+
- **File**: `cve_factchecker/app.py`
|
| 26 |
+
- **Changes**:
|
| 27 |
+
- Added `LANGUAGE_FILTER` environment variable (defaults to "English")
|
| 28 |
+
- Updated background ingestion to use language filter
|
| 29 |
+
- Enhanced error handling and logging
|
| 30 |
+
|
| 31 |
+
### 4. Environment Configuration
|
| 32 |
+
- **New Environment Variable**: `LANGUAGE_FILTER=English`
|
| 33 |
+
- **Usage**: Set to any language value in your Firebase "language" field
|
| 34 |
+
|
| 35 |
+
## Technical Implementation
|
| 36 |
+
|
| 37 |
+
### Firebase Structured Query
|
| 38 |
+
The system now uses Firebase's structured query API to filter articles:
|
| 39 |
+
|
| 40 |
+
```json
|
| 41 |
+
{
|
| 42 |
+
"structuredQuery": {
|
| 43 |
+
"from": [{"collectionId": "articles"}],
|
| 44 |
+
"where": {
|
| 45 |
+
"fieldFilter": {
|
| 46 |
+
"field": {"fieldPath": "language"},
|
| 47 |
+
"op": "EQUAL",
|
| 48 |
+
"value": {"stringValue": "English"}
|
| 49 |
+
}
|
| 50 |
+
}
|
| 51 |
+
}
|
| 52 |
+
}
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
### Benefits
|
| 56 |
+
1. **Reduced Data Transfer**: Only English articles are fetched
|
| 57 |
+
2. **Faster Processing**: Smaller dataset to process and embed
|
| 58 |
+
3. **Better Performance**: Less memory usage and faster startup
|
| 59 |
+
4. **Rate Limit Friendly**: Fewer API calls to Firebase
|
| 60 |
+
5. **Configurable**: Can be changed via environment variable
|
| 61 |
+
|
| 62 |
+
## Environment Variables
|
| 63 |
+
|
| 64 |
+
| Variable | Description | Default |
|
| 65 |
+
|----------|-------------|---------|
|
| 66 |
+
| `LANGUAGE_FILTER` | Language to filter articles | `English` |
|
| 67 |
+
| `OPENROUTER_API_KEY` | Your OpenRouter API key | None |
|
| 68 |
+
| `AUTO_INGEST` | Auto-ingest on startup | `true` |
|
| 69 |
+
| `VECTOR_PERSIST_DIR` | Vector DB directory | `/tmp/vector_db` |
|
| 70 |
+
|
| 71 |
+
## Usage Examples
|
| 72 |
+
|
| 73 |
+
### Docker Deployment
|
| 74 |
+
```dockerfile
|
| 75 |
+
ENV LANGUAGE_FILTER=English
|
| 76 |
+
ENV OPENROUTER_API_KEY=your_api_key_here
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
### Local Development
|
| 80 |
+
```bash
|
| 81 |
+
export LANGUAGE_FILTER="English"
|
| 82 |
+
export OPENROUTER_API_KEY="your_api_key_here"
|
| 83 |
+
python -m cve_factchecker
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
### Different Languages
|
| 87 |
+
```bash
|
| 88 |
+
# For French articles
|
| 89 |
+
export LANGUAGE_FILTER="French"
|
| 90 |
+
|
| 91 |
+
# For Spanish articles
|
| 92 |
+
export LANGUAGE_FILTER="Spanish"
|
| 93 |
+
|
| 94 |
+
# Disable filtering (get all articles)
|
| 95 |
+
export LANGUAGE_FILTER=""
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
## API Endpoints (Unchanged)
|
| 99 |
+
|
| 100 |
+
The API endpoints remain the same:
|
| 101 |
+
- `GET /health` - Health check
|
| 102 |
+
- `POST /fact-check` - Fact check a claim
|
| 103 |
+
- `GET /fact-check?claim=...` - Fact check via GET
|
| 104 |
+
- `GET /` - API information
|
| 105 |
+
|
| 106 |
+
## Testing
|
| 107 |
+
|
| 108 |
+
Run the comprehensive test:
|
| 109 |
+
```bash
|
| 110 |
+
python test_language_filter.py
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
This tests:
|
| 114 |
+
- Firebase language filtering
|
| 115 |
+
- Structured query functionality
|
| 116 |
+
- Flask app endpoints
|
| 117 |
+
- Vector database integration
|
| 118 |
+
|
| 119 |
+
## Production Deployment
|
| 120 |
+
|
| 121 |
+
The system is now production-ready with:
|
| 122 |
+
- β
English-only article filtering
|
| 123 |
+
- β
Rate limiting protection
|
| 124 |
+
- β
Error handling and fallbacks
|
| 125 |
+
- β
Memory optimization
|
| 126 |
+
- β
Docker containerization
|
| 127 |
+
- β
Health monitoring
|
| 128 |
+
|
| 129 |
+
## Performance Impact
|
| 130 |
+
|
| 131 |
+
Before: Retrieved all articles (~34k+ documents)
|
| 132 |
+
After: Retrieves only English articles (significantly fewer)
|
| 133 |
+
|
| 134 |
+
This results in:
|
| 135 |
+
- Faster startup times
|
| 136 |
+
- Lower memory usage
|
| 137 |
+
- Reduced Firebase API calls
|
| 138 |
+
- Better rate limit compliance
|
| 139 |
+
- More focused fact-checking results
|
cve_factchecker/__pycache__/analyzer.cpython-311.pyc
CHANGED
|
Binary files a/cve_factchecker/__pycache__/analyzer.cpython-311.pyc and b/cve_factchecker/__pycache__/analyzer.cpython-311.pyc differ
|
|
|
cve_factchecker/__pycache__/app.cpython-311.pyc
CHANGED
|
Binary files a/cve_factchecker/__pycache__/app.cpython-311.pyc and b/cve_factchecker/__pycache__/app.cpython-311.pyc differ
|
|
|
cve_factchecker/__pycache__/config.cpython-311.pyc
CHANGED
|
Binary files a/cve_factchecker/__pycache__/config.cpython-311.pyc and b/cve_factchecker/__pycache__/config.cpython-311.pyc differ
|
|
|
cve_factchecker/__pycache__/embeddings.cpython-311.pyc
CHANGED
|
Binary files a/cve_factchecker/__pycache__/embeddings.cpython-311.pyc and b/cve_factchecker/__pycache__/embeddings.cpython-311.pyc differ
|
|
|
cve_factchecker/__pycache__/firebase_loader.cpython-311.pyc
CHANGED
|
Binary files a/cve_factchecker/__pycache__/firebase_loader.cpython-311.pyc and b/cve_factchecker/__pycache__/firebase_loader.cpython-311.pyc differ
|
|
|
cve_factchecker/__pycache__/firebase_service.cpython-311.pyc
CHANGED
|
Binary files a/cve_factchecker/__pycache__/firebase_service.cpython-311.pyc and b/cve_factchecker/__pycache__/firebase_service.cpython-311.pyc differ
|
|
|
cve_factchecker/__pycache__/orchestrator.cpython-311.pyc
CHANGED
|
Binary files a/cve_factchecker/__pycache__/orchestrator.cpython-311.pyc and b/cve_factchecker/__pycache__/orchestrator.cpython-311.pyc differ
|
|
|
cve_factchecker/__pycache__/retriever.cpython-311.pyc
CHANGED
|
Binary files a/cve_factchecker/__pycache__/retriever.cpython-311.pyc and b/cve_factchecker/__pycache__/retriever.cpython-311.pyc differ
|
|
|
cve_factchecker/app.py
CHANGED
|
@@ -20,6 +20,7 @@ except Exception: # pragma: no cover
|
|
| 20 |
system = None
|
| 21 |
firebase_sync = None
|
| 22 |
AUTO_INGEST = os.environ.get("AUTO_INGEST", "true").lower() in ("true", "1", "yes")
|
|
|
|
| 23 |
INGEST_STATUS: Dict[str, Any] = {"started": time.time(), "finished": False, "synced": 0}
|
| 24 |
INIT_LOCK = threading.Lock()
|
| 25 |
INGEST_LOCK_FILE = "/tmp/ingest.lock" if os.name != 'nt' else "ingest.lock"
|
|
@@ -112,7 +113,7 @@ def _background_ingest() -> None:
|
|
| 112 |
INGEST_STATUS.update({"finished": True, "error": "System initialization failed"})
|
| 113 |
return
|
| 114 |
|
| 115 |
-
print("π Refreshing vector store with latest Firebase articles...")
|
| 116 |
|
| 117 |
# Add retry logic with exponential backoff for Firebase API
|
| 118 |
max_retries = 3
|
|
@@ -120,13 +121,13 @@ def _background_ingest() -> None:
|
|
| 120 |
|
| 121 |
for attempt in range(max_retries):
|
| 122 |
try:
|
| 123 |
-
# Use smaller limit to avoid rate limiting
|
| 124 |
limit = min(1000, 5000) # Start with smaller batch
|
| 125 |
-
ingest_res = system.ingest_firebase(limit=limit)
|
| 126 |
|
| 127 |
if ingest_res.get("success"):
|
| 128 |
INGEST_STATUS.update({"finished": True, **ingest_res})
|
| 129 |
-
print(f"β
Startup ingestion complete: {ingest_res.get('synced')} articles")
|
| 130 |
break
|
| 131 |
else:
|
| 132 |
error_msg = ingest_res.get("error", "Unknown error")
|
|
|
|
| 20 |
system = None
|
| 21 |
firebase_sync = None
|
| 22 |
AUTO_INGEST = os.environ.get("AUTO_INGEST", "true").lower() in ("true", "1", "yes")
|
| 23 |
+
LANGUAGE_FILTER = os.environ.get("LANGUAGE_FILTER", "English") # Default to English articles
|
| 24 |
INGEST_STATUS: Dict[str, Any] = {"started": time.time(), "finished": False, "synced": 0}
|
| 25 |
INIT_LOCK = threading.Lock()
|
| 26 |
INGEST_LOCK_FILE = "/tmp/ingest.lock" if os.name != 'nt' else "ingest.lock"
|
|
|
|
| 113 |
INGEST_STATUS.update({"finished": True, "error": "System initialization failed"})
|
| 114 |
return
|
| 115 |
|
| 116 |
+
print(f"π Refreshing vector store with latest {LANGUAGE_FILTER} Firebase articles...")
|
| 117 |
|
| 118 |
# Add retry logic with exponential backoff for Firebase API
|
| 119 |
max_retries = 3
|
|
|
|
| 121 |
|
| 122 |
for attempt in range(max_retries):
|
| 123 |
try:
|
| 124 |
+
# Use smaller limit to avoid rate limiting, filter for specified language
|
| 125 |
limit = min(1000, 5000) # Start with smaller batch
|
| 126 |
+
ingest_res = system.ingest_firebase(limit=limit, language=LANGUAGE_FILTER)
|
| 127 |
|
| 128 |
if ingest_res.get("success"):
|
| 129 |
INGEST_STATUS.update({"finished": True, **ingest_res})
|
| 130 |
+
print(f"β
Startup ingestion complete: {ingest_res.get('synced')} {ingest_res.get('language', LANGUAGE_FILTER)} articles")
|
| 131 |
break
|
| 132 |
else:
|
| 133 |
error_msg = ingest_res.get("error", "Unknown error")
|
cve_factchecker/embeddings.py
CHANGED
|
@@ -20,6 +20,7 @@ def build_embeddings() -> Any:
|
|
| 20 |
# Allow forcing lightweight embeddings to speed up cold starts (e.g., on Spaces)
|
| 21 |
if os.environ.get("USE_DUMMY_EMBEDDINGS", "").lower() in ("1", "true", "yes"): # pragma: no cover
|
| 22 |
return SimpleDummyEmbeddings()
|
|
|
|
| 23 |
try:
|
| 24 |
from langchain_huggingface import HuggingFaceEmbeddings # type: ignore
|
| 25 |
except Exception:
|
|
@@ -27,13 +28,25 @@ def build_embeddings() -> Any:
|
|
| 27 |
from langchain_community.embeddings import HuggingFaceEmbeddings # type: ignore
|
| 28 |
except Exception:
|
| 29 |
HuggingFaceEmbeddings = None # type: ignore
|
|
|
|
| 30 |
if "HuggingFaceEmbeddings" in locals() and HuggingFaceEmbeddings is not None: # type: ignore
|
| 31 |
try:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
return HuggingFaceEmbeddings(
|
| 33 |
model_name="sentence-transformers/all-MiniLM-L6-v2",
|
| 34 |
model_kwargs={"device": "cpu"},
|
| 35 |
encode_kwargs={"normalize_embeddings": True},
|
|
|
|
| 36 |
)
|
| 37 |
-
except Exception:
|
| 38 |
-
|
|
|
|
|
|
|
| 39 |
return SimpleDummyEmbeddings()
|
|
|
|
| 20 |
# Allow forcing lightweight embeddings to speed up cold starts (e.g., on Spaces)
|
| 21 |
if os.environ.get("USE_DUMMY_EMBEDDINGS", "").lower() in ("1", "true", "yes"): # pragma: no cover
|
| 22 |
return SimpleDummyEmbeddings()
|
| 23 |
+
|
| 24 |
try:
|
| 25 |
from langchain_huggingface import HuggingFaceEmbeddings # type: ignore
|
| 26 |
except Exception:
|
|
|
|
| 28 |
from langchain_community.embeddings import HuggingFaceEmbeddings # type: ignore
|
| 29 |
except Exception:
|
| 30 |
HuggingFaceEmbeddings = None # type: ignore
|
| 31 |
+
|
| 32 |
if "HuggingFaceEmbeddings" in locals() and HuggingFaceEmbeddings is not None: # type: ignore
|
| 33 |
try:
|
| 34 |
+
# Set cache directory to a writable location
|
| 35 |
+
cache_dir = "/tmp/sentence_transformers"
|
| 36 |
+
if os.path.exists("/data"):
|
| 37 |
+
cache_dir = "/data/sentence_transformers"
|
| 38 |
+
|
| 39 |
+
# Set environment variable for sentence-transformers cache
|
| 40 |
+
os.environ['SENTENCE_TRANSFORMERS_HOME'] = cache_dir
|
| 41 |
+
|
| 42 |
return HuggingFaceEmbeddings(
|
| 43 |
model_name="sentence-transformers/all-MiniLM-L6-v2",
|
| 44 |
model_kwargs={"device": "cpu"},
|
| 45 |
encode_kwargs={"normalize_embeddings": True},
|
| 46 |
+
cache_folder=cache_dir,
|
| 47 |
)
|
| 48 |
+
except Exception as e:
|
| 49 |
+
print(f"β οΈ Could not load HuggingFace embeddings: {e}")
|
| 50 |
+
print("π Using dummy embeddings fallback")
|
| 51 |
+
|
| 52 |
return SimpleDummyEmbeddings()
|
cve_factchecker/firebase_loader.py
CHANGED
|
@@ -1,5 +1,6 @@
|
|
| 1 |
import os
|
| 2 |
import requests
|
|
|
|
| 3 |
from typing import List, Dict, Any, Optional
|
| 4 |
from dataclasses import dataclass
|
| 5 |
from datetime import datetime
|
|
@@ -29,24 +30,182 @@ class FirebaseNewsLoader:
|
|
| 29 |
self.project_id = self.config.project_id
|
| 30 |
self.api_key = self.config.api_key
|
| 31 |
|
| 32 |
-
def fetch_articles(self,
|
| 33 |
-
"""Fetch articles with optional limit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
try:
|
| 35 |
base_url = f"https://firestore.googleapis.com/v1/projects/{self.project_id}/databases/(default)/documents/{collection_name}"
|
| 36 |
remaining = None if (limit is None or (isinstance(limit, int) and limit <= 0)) else int(limit)
|
| 37 |
page_token: Optional[str] = None
|
| 38 |
-
batch_size = 300 #
|
| 39 |
articles: List[NewsArticle] = []
|
|
|
|
|
|
|
|
|
|
| 40 |
while True:
|
| 41 |
if remaining is not None and remaining <= 0:
|
| 42 |
break
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
page_size = batch_size if remaining is None else min(batch_size, remaining)
|
| 44 |
params = {"key": self.api_key, "pageSize": page_size}
|
| 45 |
if page_token:
|
| 46 |
params["pageToken"] = page_token
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
resp = requests.get(base_url, params=params, timeout=30)
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
print(f"β Firebase API failed: {resp.status_code}")
|
|
|
|
|
|
|
|
|
|
| 50 |
break
|
| 51 |
data = resp.json()
|
| 52 |
docs = data.get("documents", [])
|
|
|
|
| 1 |
import os
|
| 2 |
import requests
|
| 3 |
+
import time
|
| 4 |
from typing import List, Dict, Any, Optional
|
| 5 |
from dataclasses import dataclass
|
| 6 |
from datetime import datetime
|
|
|
|
| 30 |
self.project_id = self.config.project_id
|
| 31 |
self.api_key = self.config.api_key
|
| 32 |
|
| 33 |
+
def fetch_articles(self, limit: int = 5000, language: str = "English") -> List[NewsArticle]:
|
| 34 |
+
"""Fetch articles with optional limit, language filter, and rate limiting handling."""
|
| 35 |
+
try:
|
| 36 |
+
collection_name = "articles"
|
| 37 |
+
|
| 38 |
+
# Use structured query to filter by language
|
| 39 |
+
if language:
|
| 40 |
+
return self._fetch_articles_with_filter(collection_name, limit, language)
|
| 41 |
+
else:
|
| 42 |
+
return self._fetch_articles_simple(collection_name, limit)
|
| 43 |
+
|
| 44 |
+
except Exception as e:
|
| 45 |
+
print(f"β Firebase error: {e}")
|
| 46 |
+
return []
|
| 47 |
+
|
| 48 |
+
def _fetch_articles_with_filter(self, collection_name: str, limit: int, language: str) -> List[NewsArticle]:
|
| 49 |
+
"""Fetch articles using Firestore structured query with language filter."""
|
| 50 |
+
try:
|
| 51 |
+
# Firestore structured query endpoint
|
| 52 |
+
query_url = f"https://firestore.googleapis.com/v1/projects/{self.project_id}/databases/(default)/documents:runQuery"
|
| 53 |
+
|
| 54 |
+
remaining = None if (limit is None or (isinstance(limit, int) and limit <= 0)) else int(limit)
|
| 55 |
+
articles: List[NewsArticle] = []
|
| 56 |
+
request_count = 0
|
| 57 |
+
max_requests = 20 # Limit total requests to avoid rate limiting
|
| 58 |
+
|
| 59 |
+
# Build structured query with language filter
|
| 60 |
+
query_data = {
|
| 61 |
+
"structuredQuery": {
|
| 62 |
+
"from": [{"collectionId": collection_name}],
|
| 63 |
+
"where": {
|
| 64 |
+
"fieldFilter": {
|
| 65 |
+
"field": {"fieldPath": "language"},
|
| 66 |
+
"op": "EQUAL",
|
| 67 |
+
"value": {"stringValue": language}
|
| 68 |
+
}
|
| 69 |
+
},
|
| 70 |
+
"orderBy": [
|
| 71 |
+
{
|
| 72 |
+
"field": {"fieldPath": "__name__"},
|
| 73 |
+
"direction": "DESCENDING"
|
| 74 |
+
}
|
| 75 |
+
]
|
| 76 |
+
}
|
| 77 |
+
}
|
| 78 |
+
|
| 79 |
+
# Add limit if specified
|
| 80 |
+
if remaining and remaining > 0:
|
| 81 |
+
query_data["structuredQuery"]["limit"] = min(remaining, 1000) # Firestore max limit per query
|
| 82 |
+
|
| 83 |
+
print(f"π Fetching {language} articles from Firebase...")
|
| 84 |
+
|
| 85 |
+
while True:
|
| 86 |
+
if remaining is not None and remaining <= 0:
|
| 87 |
+
break
|
| 88 |
+
if request_count >= max_requests:
|
| 89 |
+
print(f"β³ Reached max requests limit ({max_requests}), stopping to avoid rate limits")
|
| 90 |
+
break
|
| 91 |
+
|
| 92 |
+
# Add delay between requests to avoid rate limiting
|
| 93 |
+
if request_count > 0:
|
| 94 |
+
time.sleep(0.5) # 500ms delay between requests
|
| 95 |
+
|
| 96 |
+
headers = {'Content-Type': 'application/json'}
|
| 97 |
+
params = {"key": self.api_key}
|
| 98 |
+
|
| 99 |
+
resp = requests.post(query_url, json=query_data, headers=headers, params=params, timeout=30)
|
| 100 |
+
request_count += 1
|
| 101 |
+
|
| 102 |
+
if resp.status_code == 429: # Rate limit
|
| 103 |
+
retry_after = int(resp.headers.get('Retry-After', 60))
|
| 104 |
+
print(f"β Firebase API rate limited: waiting {retry_after}s")
|
| 105 |
+
time.sleep(retry_after)
|
| 106 |
+
continue
|
| 107 |
+
elif resp.status_code != 200:
|
| 108 |
+
print(f"β Firebase structured query failed: {resp.status_code}")
|
| 109 |
+
if resp.status_code >= 500: # Server error, might be temporary
|
| 110 |
+
time.sleep(5)
|
| 111 |
+
continue
|
| 112 |
+
break
|
| 113 |
+
|
| 114 |
+
data = resp.json()
|
| 115 |
+
|
| 116 |
+
# Handle the structured query response format
|
| 117 |
+
# Firebase structured query returns an array of results
|
| 118 |
+
if isinstance(data, list):
|
| 119 |
+
query_results = data
|
| 120 |
+
else:
|
| 121 |
+
query_results = data.get("result", data.get("documents", []))
|
| 122 |
+
|
| 123 |
+
if not query_results:
|
| 124 |
+
break
|
| 125 |
+
|
| 126 |
+
batch_articles = []
|
| 127 |
+
for result in query_results:
|
| 128 |
+
# Handle different response formats
|
| 129 |
+
doc = None
|
| 130 |
+
if isinstance(result, dict):
|
| 131 |
+
if "document" in result:
|
| 132 |
+
doc = result.get("document")
|
| 133 |
+
else:
|
| 134 |
+
# Direct document format
|
| 135 |
+
doc = result
|
| 136 |
+
|
| 137 |
+
if doc:
|
| 138 |
+
art = self._convert_doc(doc)
|
| 139 |
+
if art:
|
| 140 |
+
batch_articles.append(art)
|
| 141 |
+
|
| 142 |
+
articles.extend(batch_articles)
|
| 143 |
+
|
| 144 |
+
if remaining is not None:
|
| 145 |
+
remaining -= len(batch_articles)
|
| 146 |
+
|
| 147 |
+
# Check if we have more results
|
| 148 |
+
if len(batch_articles) < query_data["structuredQuery"].get("limit", 1000):
|
| 149 |
+
break # No more results
|
| 150 |
+
|
| 151 |
+
# Update query for next batch (if we need pagination)
|
| 152 |
+
if remaining and remaining > 0 and batch_articles:
|
| 153 |
+
# For next batch, start after the last document
|
| 154 |
+
last_doc_name = batch_articles[-1].article_id
|
| 155 |
+
query_data["structuredQuery"]["startAfter"] = {
|
| 156 |
+
"values": [{"referenceValue": f"projects/{self.project_id}/databases/(default)/documents/{collection_name}/{last_doc_name}"}]
|
| 157 |
+
}
|
| 158 |
+
else:
|
| 159 |
+
break
|
| 160 |
+
|
| 161 |
+
print(f"β
Fetched {len(articles)} {language} articles from Firebase")
|
| 162 |
+
return articles
|
| 163 |
+
|
| 164 |
+
except Exception as e:
|
| 165 |
+
print(f"β Error in filtered fetch: {e}")
|
| 166 |
+
# Fallback to simple fetch without filter
|
| 167 |
+
return self._fetch_articles_simple(collection_name, limit)
|
| 168 |
+
|
| 169 |
+
def _fetch_articles_simple(self, collection_name: str, limit: int) -> List[NewsArticle]:
|
| 170 |
+
"""Original simple fetch method without filtering."""
|
| 171 |
try:
|
| 172 |
base_url = f"https://firestore.googleapis.com/v1/projects/{self.project_id}/databases/(default)/documents/{collection_name}"
|
| 173 |
remaining = None if (limit is None or (isinstance(limit, int) and limit <= 0)) else int(limit)
|
| 174 |
page_token: Optional[str] = None
|
| 175 |
+
batch_size = min(100, 300) # Smaller batch size to avoid rate limiting
|
| 176 |
articles: List[NewsArticle] = []
|
| 177 |
+
request_count = 0
|
| 178 |
+
max_requests = 50 # Limit total requests to avoid rate limiting
|
| 179 |
+
|
| 180 |
while True:
|
| 181 |
if remaining is not None and remaining <= 0:
|
| 182 |
break
|
| 183 |
+
if request_count >= max_requests:
|
| 184 |
+
print(f"β³ Reached max requests limit ({max_requests}), stopping to avoid rate limits")
|
| 185 |
+
break
|
| 186 |
+
|
| 187 |
page_size = batch_size if remaining is None else min(batch_size, remaining)
|
| 188 |
params = {"key": self.api_key, "pageSize": page_size}
|
| 189 |
if page_token:
|
| 190 |
params["pageToken"] = page_token
|
| 191 |
+
|
| 192 |
+
# Add delay between requests to avoid rate limiting
|
| 193 |
+
if request_count > 0:
|
| 194 |
+
time.sleep(0.2) # 200ms delay between requests
|
| 195 |
+
|
| 196 |
resp = requests.get(base_url, params=params, timeout=30)
|
| 197 |
+
request_count += 1
|
| 198 |
+
|
| 199 |
+
if resp.status_code == 429: # Rate limit
|
| 200 |
+
retry_after = int(resp.headers.get('Retry-After', 60))
|
| 201 |
+
print(f"β Firebase API rate limited: waiting {retry_after}s")
|
| 202 |
+
time.sleep(retry_after)
|
| 203 |
+
continue
|
| 204 |
+
elif resp.status_code != 200:
|
| 205 |
print(f"β Firebase API failed: {resp.status_code}")
|
| 206 |
+
if resp.status_code >= 500: # Server error, might be temporary
|
| 207 |
+
time.sleep(5)
|
| 208 |
+
continue
|
| 209 |
break
|
| 210 |
data = resp.json()
|
| 211 |
docs = data.get("documents", [])
|
cve_factchecker/firebase_service.py
CHANGED
|
@@ -1,10 +1,18 @@
|
|
| 1 |
from __future__ import annotations
|
|
|
|
| 2 |
from typing import Dict, Any, Optional, List
|
| 3 |
from .firebase_loader import FirebaseNewsLoader
|
| 4 |
from .retriever import VectorNewsRetriever
|
| 5 |
|
| 6 |
class FirebaseVectorSync:
|
| 7 |
def __init__(self, vector_persist_dir: str = "./vector_db"):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
self.firebase_loader = FirebaseNewsLoader()
|
| 9 |
self.vector_retriever = VectorNewsRetriever(persist_directory=vector_persist_dir)
|
| 10 |
def analyze_firebase_schema(self, collection: str = "Articles") -> Dict[str, Any]:
|
|
|
|
| 1 |
from __future__ import annotations
|
| 2 |
+
import os
|
| 3 |
from typing import Dict, Any, Optional, List
|
| 4 |
from .firebase_loader import FirebaseNewsLoader
|
| 5 |
from .retriever import VectorNewsRetriever
|
| 6 |
|
| 7 |
class FirebaseVectorSync:
|
| 8 |
def __init__(self, vector_persist_dir: str = "./vector_db"):
|
| 9 |
+
# Auto-detect writable directory for Hugging Face Spaces
|
| 10 |
+
if vector_persist_dir == "./vector_db":
|
| 11 |
+
if os.path.exists("/data"):
|
| 12 |
+
vector_persist_dir = "/data/vector_db"
|
| 13 |
+
elif os.path.exists("/tmp"):
|
| 14 |
+
vector_persist_dir = "/tmp/vector_db"
|
| 15 |
+
|
| 16 |
self.firebase_loader = FirebaseNewsLoader()
|
| 17 |
self.vector_retriever = VectorNewsRetriever(persist_directory=vector_persist_dir)
|
| 18 |
def analyze_firebase_schema(self, collection: str = "Articles") -> Dict[str, Any]:
|
cve_factchecker/orchestrator.py
CHANGED
|
@@ -13,11 +13,14 @@ class FactCheckSystem:
|
|
| 13 |
self.rewriter = QueryRewriter(cfg)
|
| 14 |
self.analyzer = ClaimAnalyzer(cfg)
|
| 15 |
self.firebase = FirebaseNewsLoader()
|
| 16 |
-
def ingest_firebase(self, collection: str = "
|
| 17 |
-
|
|
|
|
| 18 |
if not arts:
|
| 19 |
return {"synced": 0, "collection": collection, "success": False}
|
| 20 |
-
|
|
|
|
|
|
|
| 21 |
return {"synced": len(arts), "collection": collection, "success": True}
|
| 22 |
def fact_check(self, claim: str, k: int = 5) -> Dict[str, Any]:
|
| 23 |
base = self.retriever.semantic_search(claim, k=k)
|
|
|
|
| 13 |
self.rewriter = QueryRewriter(cfg)
|
| 14 |
self.analyzer = ClaimAnalyzer(cfg)
|
| 15 |
self.firebase = FirebaseNewsLoader()
|
| 16 |
+
def ingest_firebase(self, collection: str = "articles", limit: int = 5000, language: str = "English") -> Dict[str, Any]:
|
| 17 |
+
"""Load articles from Firebase and store in vector DB with language filter and limit."""
|
| 18 |
+
arts = self.firebase.fetch_articles(limit=limit, language=language)
|
| 19 |
if not arts:
|
| 20 |
return {"synced": 0, "collection": collection, "success": False}
|
| 21 |
+
# Clear and refresh vector store
|
| 22 |
+
self.retriever.store_articles_in_vector_db(arts, clear_first=True)
|
| 23 |
+
return {"synced": len(arts), "collection": collection, "success": True, "language": language}
|
| 24 |
return {"synced": len(arts), "collection": collection, "success": True}
|
| 25 |
def fact_check(self, claim: str, k: int = 5) -> Dict[str, Any]:
|
| 26 |
base = self.retriever.semantic_search(claim, k=k)
|
cve_factchecker/retriever.py
CHANGED
|
@@ -14,42 +14,120 @@ class VectorNewsRetriever:
|
|
| 14 |
def __init__(self, persist_directory: str = "./vector_db"):
|
| 15 |
# Allow overriding persist directory via env (e.g., /data on Hugging Face Spaces)
|
| 16 |
env_dir = os.environ.get("VECTOR_PERSIST_DIR")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
self.persist_directory = env_dir or persist_directory
|
| 18 |
self.embeddings = build_embeddings()
|
| 19 |
self.vector_store = self._initialize_vector_store()
|
| 20 |
def _initialize_vector_store(self) -> Chroma:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
try:
|
| 22 |
os.makedirs(self.persist_directory, exist_ok=True)
|
| 23 |
-
vs = Chroma(
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
try:
|
| 25 |
count = vs._collection.count()
|
| 26 |
-
print(f"β
Loaded vector database with {count} documents")
|
| 27 |
except Exception:
|
| 28 |
-
print("β
Vector database loaded")
|
| 29 |
return vs
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
except Exception as e:
|
| 31 |
-
print(f"β
|
| 32 |
print("π Using in-memory store")
|
| 33 |
return Chroma(embedding_function=self.embeddings, collection_name="news_articles_memory")
|
| 34 |
-
def
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
if not articles:
|
| 36 |
print("βΉοΈ No new articles to store")
|
| 37 |
return
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
|
| 39 |
docs: List[Document] = []
|
| 40 |
for art in articles:
|
| 41 |
chunks = splitter.split_text(art.content or "")
|
| 42 |
for chunk in chunks:
|
| 43 |
docs.append(Document(page_content=f"Title: {art.title}\n\n{chunk}", metadata={"url": art.url, "source": art.source, "published_date": art.published_date, "scraped_date": art.scraped_date, "id": art.article_id}))
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
try:
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
|
|
|
|
|
|
| 53 |
def semantic_search(self, query: str, k: int = 5) -> List[Dict[str, Any]]:
|
| 54 |
try:
|
| 55 |
docs = self.vector_store.similarity_search(query, k=k)
|
|
|
|
| 14 |
def __init__(self, persist_directory: str = "./vector_db"):
|
| 15 |
# Allow overriding persist directory via env (e.g., /data on Hugging Face Spaces)
|
| 16 |
env_dir = os.environ.get("VECTOR_PERSIST_DIR")
|
| 17 |
+
|
| 18 |
+
# For containerized environments, use appropriate writable directories
|
| 19 |
+
if not env_dir:
|
| 20 |
+
# Check for writable directories in order of preference
|
| 21 |
+
candidates = ["/data/vector_db", "/tmp/vector_db", "./vector_db"]
|
| 22 |
+
|
| 23 |
+
for candidate in candidates:
|
| 24 |
+
try:
|
| 25 |
+
# Test if we can create the directory
|
| 26 |
+
test_dir = os.path.dirname(candidate)
|
| 27 |
+
if not os.path.exists(test_dir):
|
| 28 |
+
os.makedirs(test_dir, exist_ok=True)
|
| 29 |
+
|
| 30 |
+
# Test write permission
|
| 31 |
+
test_file = os.path.join(test_dir, "test_write")
|
| 32 |
+
with open(test_file, 'w') as f:
|
| 33 |
+
f.write("test")
|
| 34 |
+
os.remove(test_file)
|
| 35 |
+
|
| 36 |
+
persist_directory = candidate
|
| 37 |
+
break
|
| 38 |
+
except (OSError, PermissionError):
|
| 39 |
+
continue
|
| 40 |
+
else:
|
| 41 |
+
# If all fail, use in-memory store
|
| 42 |
+
persist_directory = None
|
| 43 |
+
|
| 44 |
self.persist_directory = env_dir or persist_directory
|
| 45 |
self.embeddings = build_embeddings()
|
| 46 |
self.vector_store = self._initialize_vector_store()
|
| 47 |
def _initialize_vector_store(self) -> Chroma:
|
| 48 |
+
"""Initialize vector store with proper error handling for permission issues."""
|
| 49 |
+
# If no persist directory (failed all write tests), use in-memory
|
| 50 |
+
if self.persist_directory is None:
|
| 51 |
+
print("π Using in-memory vector store (no writable directory found)")
|
| 52 |
+
return Chroma(embedding_function=self.embeddings, collection_name="news_articles_memory")
|
| 53 |
+
|
| 54 |
try:
|
| 55 |
os.makedirs(self.persist_directory, exist_ok=True)
|
| 56 |
+
vs = Chroma(
|
| 57 |
+
persist_directory=self.persist_directory,
|
| 58 |
+
embedding_function=self.embeddings,
|
| 59 |
+
collection_name="news_articles"
|
| 60 |
+
)
|
| 61 |
try:
|
| 62 |
count = vs._collection.count()
|
| 63 |
+
print(f"β
Loaded vector database with {count} documents from {self.persist_directory}")
|
| 64 |
except Exception:
|
| 65 |
+
print(f"β
Vector database loaded from {self.persist_directory}")
|
| 66 |
return vs
|
| 67 |
+
except (PermissionError, OSError) as e:
|
| 68 |
+
print(f"β Error initializing vector store at {self.persist_directory}: {e}")
|
| 69 |
+
print("π Using in-memory store")
|
| 70 |
+
return Chroma(embedding_function=self.embeddings, collection_name="news_articles_memory")
|
| 71 |
except Exception as e:
|
| 72 |
+
print(f"β Unexpected error initializing vector store: {e}")
|
| 73 |
print("π Using in-memory store")
|
| 74 |
return Chroma(embedding_function=self.embeddings, collection_name="news_articles_memory")
|
| 75 |
+
def clear_vector_store(self) -> None:
|
| 76 |
+
"""Clear all documents from the vector store to start fresh."""
|
| 77 |
+
try:
|
| 78 |
+
# Delete the collection to start fresh
|
| 79 |
+
self.vector_store.delete_collection()
|
| 80 |
+
print("ποΈ Cleared existing vector store")
|
| 81 |
+
# Reinitialize the vector store
|
| 82 |
+
self.vector_store = self._initialize_vector_store()
|
| 83 |
+
except Exception as e:
|
| 84 |
+
print(f"β οΈ Could not clear vector store: {e}")
|
| 85 |
+
# Fallback: create new in-memory store
|
| 86 |
+
self.vector_store = Chroma(embedding_function=self.embeddings, collection_name="news_articles_fresh")
|
| 87 |
+
|
| 88 |
+
def store_articles_in_vector_db(self, articles: List[NewsArticle], clear_first: bool = False) -> None:
|
| 89 |
if not articles:
|
| 90 |
print("βΉοΈ No new articles to store")
|
| 91 |
return
|
| 92 |
+
|
| 93 |
+
if clear_first:
|
| 94 |
+
self.clear_vector_store()
|
| 95 |
+
|
| 96 |
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
|
| 97 |
docs: List[Document] = []
|
| 98 |
for art in articles:
|
| 99 |
chunks = splitter.split_text(art.content or "")
|
| 100 |
for chunk in chunks:
|
| 101 |
docs.append(Document(page_content=f"Title: {art.title}\n\n{chunk}", metadata={"url": art.url, "source": art.source, "published_date": art.published_date, "scraped_date": art.scraped_date, "id": art.article_id}))
|
| 102 |
+
|
| 103 |
+
# Process documents in batches to avoid Chroma batch size limits
|
| 104 |
+
batch_size = 4000 # Conservative batch size for Chroma
|
| 105 |
+
total_docs = len(docs)
|
| 106 |
+
print(f"π¦ Processing {total_docs} document chunks in batches of {batch_size}")
|
| 107 |
+
|
| 108 |
+
for i in range(0, total_docs, batch_size):
|
| 109 |
+
batch_docs = docs[i:i + batch_size]
|
| 110 |
+
batch_num = (i // batch_size) + 1
|
| 111 |
+
total_batches = (total_docs + batch_size - 1) // batch_size
|
| 112 |
+
print(f"πΎ Storing batch {batch_num}/{total_batches} ({len(batch_docs)} docs)")
|
| 113 |
+
|
| 114 |
+
try:
|
| 115 |
+
if hasattr(self.vector_store, "add_documents"):
|
| 116 |
+
self.vector_store.add_documents(batch_docs)
|
| 117 |
+
else:
|
| 118 |
+
self.vector_store.add_texts([d.page_content for d in batch_docs], metadatas=[d.metadata for d in batch_docs])
|
| 119 |
+
except Exception as e:
|
| 120 |
+
print(f"β Failed to store batch {batch_num}: {e}")
|
| 121 |
+
continue
|
| 122 |
+
continue
|
| 123 |
+
|
| 124 |
try:
|
| 125 |
+
# Only try to persist if we have a writable directory
|
| 126 |
+
if self.persist_directory and hasattr(self.vector_store, 'persist'):
|
| 127 |
+
self.vector_store.persist()
|
| 128 |
+
except Exception as e:
|
| 129 |
+
print(f"β οΈ Could not persist vector store: {e}")
|
| 130 |
+
print(f"β
Stored {len(docs)} chunks from {len(articles)} articles")
|
| 131 |
def semantic_search(self, query: str, k: int = 5) -> List[Dict[str, Any]]:
|
| 132 |
try:
|
| 133 |
docs = self.vector_store.similarity_search(query, k=k)
|
cve_factchecker/wsgi.py
CHANGED
|
@@ -1,5 +1,21 @@
|
|
| 1 |
from __future__ import annotations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
from .app import app as application # WSGI entry point
|
| 3 |
|
| 4 |
# This file allows running with a production server like gunicorn/waitress:
|
| 5 |
-
# gunicorn -w
|
|
|
|
| 1 |
from __future__ import annotations
|
| 2 |
+
import os
|
| 3 |
+
import sys
|
| 4 |
+
|
| 5 |
+
# Add the parent directory to Python path
|
| 6 |
+
current_dir = os.path.dirname(os.path.abspath(__file__))
|
| 7 |
+
parent_dir = os.path.dirname(current_dir)
|
| 8 |
+
sys.path.insert(0, parent_dir)
|
| 9 |
+
|
| 10 |
+
# Run startup setup
|
| 11 |
+
try:
|
| 12 |
+
from startup import setup_environment
|
| 13 |
+
setup_environment()
|
| 14 |
+
print("β
Environment setup completed")
|
| 15 |
+
except Exception as e:
|
| 16 |
+
print(f"β οΈ Startup setup warning: {e}")
|
| 17 |
+
|
| 18 |
from .app import app as application # WSGI entry point
|
| 19 |
|
| 20 |
# This file allows running with a production server like gunicorn/waitress:
|
| 21 |
+
# gunicorn -w 1 -b 0.0.0.0:7860 cve_factchecker.wsgi:application
|
health_check.py
ADDED
|
@@ -0,0 +1,202 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Health check and diagnostic script for CVE Fact Checker.
|
| 4 |
+
Run this to diagnose common issues and verify system health.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import sys
|
| 9 |
+
import tempfile
|
| 10 |
+
import requests
|
| 11 |
+
import time
|
| 12 |
+
from pathlib import Path
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def check_environment():
|
| 16 |
+
"""Check environment variables and directories."""
|
| 17 |
+
print("π Environment Check")
|
| 18 |
+
print("=" * 50)
|
| 19 |
+
|
| 20 |
+
# Check Python version
|
| 21 |
+
print(f"Python version: {sys.version}")
|
| 22 |
+
|
| 23 |
+
# Check key environment variables
|
| 24 |
+
env_vars = [
|
| 25 |
+
'PORT', 'VECTOR_PERSIST_DIR', 'SENTENCE_TRANSFORMERS_HOME',
|
| 26 |
+
'FIREBASE_API_KEY', 'OPENROUTER_API_KEY', 'AUTO_INGEST', 'LANGUAGE_FILTER'
|
| 27 |
+
]
|
| 28 |
+
|
| 29 |
+
for var in env_vars:
|
| 30 |
+
value = os.environ.get(var, 'Not set')
|
| 31 |
+
if 'API_KEY' in var and value != 'Not set':
|
| 32 |
+
value = f"{value[:8]}..." if len(value) > 8 else "***"
|
| 33 |
+
print(f" {var}: {value}")
|
| 34 |
+
|
| 35 |
+
print()
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
def check_directories():
|
| 39 |
+
"""Check directory permissions and setup."""
|
| 40 |
+
print("π Directory Permissions Check")
|
| 41 |
+
print("=" * 50)
|
| 42 |
+
|
| 43 |
+
dirs_to_check = [
|
| 44 |
+
('/tmp', 'System temp directory'),
|
| 45 |
+
('/tmp/vector_db', 'Vector DB storage'),
|
| 46 |
+
('/tmp/sentence_transformers', 'Model cache'),
|
| 47 |
+
('/data', 'Persistent data (if available)'),
|
| 48 |
+
('.', 'Current directory'),
|
| 49 |
+
]
|
| 50 |
+
|
| 51 |
+
for dir_path, description in dirs_to_check:
|
| 52 |
+
try:
|
| 53 |
+
path = Path(dir_path)
|
| 54 |
+
|
| 55 |
+
# Check if exists
|
| 56 |
+
exists = path.exists()
|
| 57 |
+
|
| 58 |
+
if not exists and dir_path != '/data': # /data might not exist
|
| 59 |
+
path.mkdir(parents=True, exist_ok=True)
|
| 60 |
+
exists = True
|
| 61 |
+
|
| 62 |
+
# Test write permission if exists
|
| 63 |
+
writable = False
|
| 64 |
+
if exists:
|
| 65 |
+
try:
|
| 66 |
+
test_file = path / "health_check_test"
|
| 67 |
+
test_file.write_text("test")
|
| 68 |
+
test_file.unlink()
|
| 69 |
+
writable = True
|
| 70 |
+
except:
|
| 71 |
+
pass
|
| 72 |
+
|
| 73 |
+
status = "β
" if (exists and writable) else "β"
|
| 74 |
+
print(f" {status} {dir_path} - {description}")
|
| 75 |
+
if exists and not writable:
|
| 76 |
+
print(f" β οΈ Exists but not writable")
|
| 77 |
+
elif not exists:
|
| 78 |
+
print(f" β οΈ Does not exist")
|
| 79 |
+
|
| 80 |
+
except Exception as e:
|
| 81 |
+
print(f" β {dir_path} - Error: {e}")
|
| 82 |
+
|
| 83 |
+
print()
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
def check_packages():
|
| 87 |
+
"""Check if required packages are importable."""
|
| 88 |
+
print("π¦ Package Import Check")
|
| 89 |
+
print("=" * 50)
|
| 90 |
+
|
| 91 |
+
packages = [
|
| 92 |
+
('flask', 'Flask web framework'),
|
| 93 |
+
('requests', 'HTTP requests'),
|
| 94 |
+
('langchain', 'LangChain framework'),
|
| 95 |
+
('chromadb', 'Vector database'),
|
| 96 |
+
('sentence_transformers', 'Embedding models'),
|
| 97 |
+
('firebase_admin', 'Firebase SDK'),
|
| 98 |
+
]
|
| 99 |
+
|
| 100 |
+
for package, description in packages:
|
| 101 |
+
try:
|
| 102 |
+
__import__(package)
|
| 103 |
+
print(f" β
{package} - {description}")
|
| 104 |
+
except ImportError as e:
|
| 105 |
+
print(f" β {package} - {description} (Error: {e})")
|
| 106 |
+
except Exception as e:
|
| 107 |
+
print(f" β οΈ {package} - {description} (Warning: {e})")
|
| 108 |
+
|
| 109 |
+
print()
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
def check_app_health():
|
| 113 |
+
"""Check if the Flask app can be imported and basic functionality."""
|
| 114 |
+
print("π₯ Application Health Check")
|
| 115 |
+
print("=" * 50)
|
| 116 |
+
|
| 117 |
+
try:
|
| 118 |
+
# Test basic imports
|
| 119 |
+
from cve_factchecker.config import load_openrouter_config
|
| 120 |
+
print(" β
Config module imports")
|
| 121 |
+
|
| 122 |
+
from cve_factchecker.embeddings import build_embeddings
|
| 123 |
+
print(" β
Embeddings module imports")
|
| 124 |
+
|
| 125 |
+
from cve_factchecker.models import NewsArticle
|
| 126 |
+
print(" β
Models module imports")
|
| 127 |
+
|
| 128 |
+
# Test embeddings
|
| 129 |
+
embeddings = build_embeddings()
|
| 130 |
+
test_embedding = embeddings.embed_query("test")
|
| 131 |
+
if len(test_embedding) > 0:
|
| 132 |
+
print(" β
Embeddings functional")
|
| 133 |
+
else:
|
| 134 |
+
print(" β Embeddings not working")
|
| 135 |
+
|
| 136 |
+
# Test Flask app import
|
| 137 |
+
from cve_factchecker.app import app
|
| 138 |
+
print(" β
Flask app imports")
|
| 139 |
+
|
| 140 |
+
# Test basic app functionality
|
| 141 |
+
with app.test_client() as client:
|
| 142 |
+
response = client.get('/health')
|
| 143 |
+
if response.status_code == 200:
|
| 144 |
+
print(" β
Health endpoint responds")
|
| 145 |
+
else:
|
| 146 |
+
print(f" β Health endpoint error: {response.status_code}")
|
| 147 |
+
|
| 148 |
+
except Exception as e:
|
| 149 |
+
print(f" β Application error: {e}")
|
| 150 |
+
import traceback
|
| 151 |
+
traceback.print_exc()
|
| 152 |
+
|
| 153 |
+
print()
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
def check_firebase_connectivity():
|
| 157 |
+
"""Check Firebase connectivity."""
|
| 158 |
+
print("π₯ Firebase Connectivity Check")
|
| 159 |
+
print("=" * 50)
|
| 160 |
+
|
| 161 |
+
try:
|
| 162 |
+
from cve_factchecker.firebase_loader import FirebaseNewsLoader
|
| 163 |
+
|
| 164 |
+
loader = FirebaseNewsLoader()
|
| 165 |
+
print(f" β
Firebase loader initialized")
|
| 166 |
+
print(f" π Project ID: {loader.project_id}")
|
| 167 |
+
|
| 168 |
+
# Test with very small limit to avoid rate limiting, filter for English
|
| 169 |
+
print(" π Testing connectivity with 1 English article limit...")
|
| 170 |
+
articles = loader.fetch_articles(limit=1, language="English")
|
| 171 |
+
|
| 172 |
+
if articles:
|
| 173 |
+
print(f" β
Successfully fetched {len(articles)} article(s)")
|
| 174 |
+
print(f" π Sample title: {articles[0].title[:50]}...")
|
| 175 |
+
else:
|
| 176 |
+
print(" β οΈ No articles fetched (might be rate limited or empty)")
|
| 177 |
+
|
| 178 |
+
except Exception as e:
|
| 179 |
+
print(f" β Firebase error: {e}")
|
| 180 |
+
|
| 181 |
+
print()
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
def main():
|
| 185 |
+
"""Run all health checks."""
|
| 186 |
+
print("π₯ CVE Fact Checker Health Check")
|
| 187 |
+
print("=" * 60)
|
| 188 |
+
print(f"Timestamp: {time.strftime('%Y-%m-%d %H:%M:%S')}")
|
| 189 |
+
print()
|
| 190 |
+
|
| 191 |
+
check_environment()
|
| 192 |
+
check_directories()
|
| 193 |
+
check_packages()
|
| 194 |
+
check_app_health()
|
| 195 |
+
check_firebase_connectivity()
|
| 196 |
+
|
| 197 |
+
print("π Health check complete!")
|
| 198 |
+
print("=" * 60)
|
| 199 |
+
|
| 200 |
+
|
| 201 |
+
if __name__ == "__main__":
|
| 202 |
+
main()
|
ingest.lock
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
19844
|
run_production.py
ADDED
|
@@ -0,0 +1,111 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Production startup script for CVE Fact Checker.
|
| 4 |
+
This script handles environment setup and graceful startup.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import sys
|
| 9 |
+
import time
|
| 10 |
+
import signal
|
| 11 |
+
import subprocess
|
| 12 |
+
from pathlib import Path
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def setup_signal_handlers():
|
| 16 |
+
"""Setup graceful shutdown handlers."""
|
| 17 |
+
def signal_handler(signum, frame):
|
| 18 |
+
print(f"\nπ Received signal {signum}, shutting down gracefully...")
|
| 19 |
+
sys.exit(0)
|
| 20 |
+
|
| 21 |
+
signal.signal(signal.SIGTERM, signal_handler)
|
| 22 |
+
signal.signal(signal.SIGINT, signal_handler)
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
def run_health_check():
|
| 26 |
+
"""Run basic health check before starting."""
|
| 27 |
+
print("π₯ Running pre-startup health check...")
|
| 28 |
+
|
| 29 |
+
try:
|
| 30 |
+
result = subprocess.run([
|
| 31 |
+
sys.executable, "health_check.py"
|
| 32 |
+
], capture_output=True, text=True, timeout=30)
|
| 33 |
+
|
| 34 |
+
if result.returncode == 0:
|
| 35 |
+
print("β
Health check passed")
|
| 36 |
+
return True
|
| 37 |
+
else:
|
| 38 |
+
print("β οΈ Health check warnings:")
|
| 39 |
+
print(result.stdout)
|
| 40 |
+
print(result.stderr)
|
| 41 |
+
return True # Continue anyway
|
| 42 |
+
|
| 43 |
+
except Exception as e:
|
| 44 |
+
print(f"β οΈ Health check failed: {e}")
|
| 45 |
+
return True # Continue anyway
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
def start_application():
|
| 49 |
+
"""Start the application with gunicorn."""
|
| 50 |
+
print("π Starting CVE Fact Checker...")
|
| 51 |
+
|
| 52 |
+
# Environment setup
|
| 53 |
+
env = os.environ.copy()
|
| 54 |
+
env.update({
|
| 55 |
+
'PYTHONUNBUFFERED': '1',
|
| 56 |
+
'PYTHONDONTWRITEBYTECODE': '1',
|
| 57 |
+
})
|
| 58 |
+
|
| 59 |
+
# Gunicorn command
|
| 60 |
+
cmd = [
|
| 61 |
+
"gunicorn",
|
| 62 |
+
"-w", "1", # Single worker to avoid race conditions
|
| 63 |
+
"-k", "gthread",
|
| 64 |
+
"--threads", "4",
|
| 65 |
+
"-b", f"0.0.0.0:{env.get('PORT', '7860')}",
|
| 66 |
+
"--timeout", "120",
|
| 67 |
+
"--preload", # Preload app for better memory usage
|
| 68 |
+
"--access-logfile", "-", # Log to stdout
|
| 69 |
+
"--error-logfile", "-", # Log to stderr
|
| 70 |
+
"cve_factchecker.wsgi:application"
|
| 71 |
+
]
|
| 72 |
+
|
| 73 |
+
print(f"π Command: {' '.join(cmd)}")
|
| 74 |
+
print(f"π Will bind to port {env.get('PORT', '7860')}")
|
| 75 |
+
|
| 76 |
+
try:
|
| 77 |
+
# Run gunicorn
|
| 78 |
+
subprocess.run(cmd, env=env, check=True)
|
| 79 |
+
|
| 80 |
+
except KeyboardInterrupt:
|
| 81 |
+
print("\nπ Interrupted by user")
|
| 82 |
+
except subprocess.CalledProcessError as e:
|
| 83 |
+
print(f"β Application failed with exit code {e.returncode}")
|
| 84 |
+
sys.exit(e.returncode)
|
| 85 |
+
except Exception as e:
|
| 86 |
+
print(f"β Unexpected error: {e}")
|
| 87 |
+
sys.exit(1)
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
def main():
|
| 91 |
+
"""Main startup routine."""
|
| 92 |
+
print("=" * 60)
|
| 93 |
+
print("π€ CVE Fact Checker - Production Startup")
|
| 94 |
+
print("=" * 60)
|
| 95 |
+
print(f"β° Starting at: {time.strftime('%Y-%m-%d %H:%M:%S')}")
|
| 96 |
+
print(f"π Python: {sys.version}")
|
| 97 |
+
print(f"π Working directory: {os.getcwd()}")
|
| 98 |
+
print()
|
| 99 |
+
|
| 100 |
+
setup_signal_handlers()
|
| 101 |
+
|
| 102 |
+
# Optional health check
|
| 103 |
+
if os.path.exists("health_check.py"):
|
| 104 |
+
run_health_check()
|
| 105 |
+
print()
|
| 106 |
+
|
| 107 |
+
start_application()
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
if __name__ == "__main__":
|
| 111 |
+
main()
|
start_production.py
ADDED
|
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Production-ready CVE Fact Checker with English language filtering.
|
| 4 |
+
This script sets up the environment and starts the application.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import sys
|
| 9 |
+
import subprocess
|
| 10 |
+
|
| 11 |
+
def setup_production_environment():
|
| 12 |
+
"""Setup production environment variables."""
|
| 13 |
+
|
| 14 |
+
# Core configuration
|
| 15 |
+
env_vars = {
|
| 16 |
+
'OPENROUTER_API_KEY': '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',
|
| 17 |
+
'LANGUAGE_FILTER': 'English',
|
| 18 |
+
'AUTO_INGEST': 'true',
|
| 19 |
+
'PORT': '7860',
|
| 20 |
+
'VECTOR_PERSIST_DIR': '/tmp/vector_db',
|
| 21 |
+
'SENTENCE_TRANSFORMERS_HOME': '/tmp/sentence_transformers',
|
| 22 |
+
}
|
| 23 |
+
|
| 24 |
+
# Apply environment variables
|
| 25 |
+
for key, value in env_vars.items():
|
| 26 |
+
os.environ[key] = value
|
| 27 |
+
print(f"β
Set {key}")
|
| 28 |
+
|
| 29 |
+
print(f"\nπ§ Environment configured for English articles only")
|
| 30 |
+
|
| 31 |
+
def start_production_server():
|
| 32 |
+
"""Start the production server."""
|
| 33 |
+
print("\nπ Starting CVE Fact Checker (Production)")
|
| 34 |
+
print("=" * 50)
|
| 35 |
+
|
| 36 |
+
try:
|
| 37 |
+
# Use gunicorn for production
|
| 38 |
+
cmd = [
|
| 39 |
+
"gunicorn",
|
| 40 |
+
"-w", "1", # Single worker to avoid race conditions
|
| 41 |
+
"-k", "gthread",
|
| 42 |
+
"--threads", "4",
|
| 43 |
+
"-b", f"0.0.0.0:{os.environ.get('PORT', '7860')}",
|
| 44 |
+
"--timeout", "120",
|
| 45 |
+
"--preload",
|
| 46 |
+
"--access-logfile", "-",
|
| 47 |
+
"--error-logfile", "-",
|
| 48 |
+
"cve_factchecker.wsgi:application"
|
| 49 |
+
]
|
| 50 |
+
|
| 51 |
+
print(f"π Command: {' '.join(cmd)}")
|
| 52 |
+
print(f"π Server will start on port {os.environ.get('PORT', '7860')}")
|
| 53 |
+
print(f"π Language filter: {os.environ.get('LANGUAGE_FILTER', 'English')}")
|
| 54 |
+
print("π Access at: http://localhost:7860")
|
| 55 |
+
print("\n" + "="*50)
|
| 56 |
+
|
| 57 |
+
# Start the server
|
| 58 |
+
subprocess.run(cmd, check=True)
|
| 59 |
+
|
| 60 |
+
except FileNotFoundError:
|
| 61 |
+
print("β Gunicorn not found. Installing...")
|
| 62 |
+
subprocess.run([sys.executable, "-m", "pip", "install", "gunicorn"], check=True)
|
| 63 |
+
print("β
Gunicorn installed. Retrying...")
|
| 64 |
+
subprocess.run(cmd, check=True)
|
| 65 |
+
|
| 66 |
+
except KeyboardInterrupt:
|
| 67 |
+
print("\nπ Server stopped by user")
|
| 68 |
+
|
| 69 |
+
except subprocess.CalledProcessError as e:
|
| 70 |
+
print(f"β Server failed: {e}")
|
| 71 |
+
sys.exit(1)
|
| 72 |
+
|
| 73 |
+
def start_development_server():
|
| 74 |
+
"""Start development server with Flask."""
|
| 75 |
+
print("\nπ§ Starting CVE Fact Checker (Development)")
|
| 76 |
+
print("=" * 50)
|
| 77 |
+
|
| 78 |
+
try:
|
| 79 |
+
from cve_factchecker.app import app
|
| 80 |
+
|
| 81 |
+
print(f"π Server will start on port {os.environ.get('PORT', '7860')}")
|
| 82 |
+
print(f"π Language filter: {os.environ.get('LANGUAGE_FILTER', 'English')}")
|
| 83 |
+
print("π Access at: http://localhost:7860")
|
| 84 |
+
print("\n" + "="*50)
|
| 85 |
+
|
| 86 |
+
app.run(
|
| 87 |
+
host='0.0.0.0',
|
| 88 |
+
port=int(os.environ.get('PORT', '7860')),
|
| 89 |
+
debug=False # Set to False for stability
|
| 90 |
+
)
|
| 91 |
+
|
| 92 |
+
except KeyboardInterrupt:
|
| 93 |
+
print("\nπ Server stopped by user")
|
| 94 |
+
|
| 95 |
+
def main():
|
| 96 |
+
"""Main entry point."""
|
| 97 |
+
import argparse
|
| 98 |
+
|
| 99 |
+
parser = argparse.ArgumentParser(description="CVE Fact Checker with Language Filtering")
|
| 100 |
+
parser.add_argument("--mode", choices=["dev", "prod"], default="prod",
|
| 101 |
+
help="Run in development or production mode")
|
| 102 |
+
|
| 103 |
+
args = parser.parse_args()
|
| 104 |
+
|
| 105 |
+
print("π€ CVE Fact Checker - English Articles Only")
|
| 106 |
+
print("=" * 60)
|
| 107 |
+
|
| 108 |
+
# Setup environment
|
| 109 |
+
setup_production_environment()
|
| 110 |
+
|
| 111 |
+
# Start appropriate server
|
| 112 |
+
if args.mode == "dev":
|
| 113 |
+
start_development_server()
|
| 114 |
+
else:
|
| 115 |
+
start_production_server()
|
| 116 |
+
|
| 117 |
+
if __name__ == "__main__":
|
| 118 |
+
main()
|
startup.py
ADDED
|
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Startup script to handle environment setup and pre-checks.
|
| 4 |
+
This script ensures proper directory permissions and environment setup before starting the app.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import sys
|
| 9 |
+
import tempfile
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
def setup_environment():
|
| 14 |
+
"""Setup environment variables and directories."""
|
| 15 |
+
|
| 16 |
+
# Set cache directories to writable locations
|
| 17 |
+
cache_dirs = {
|
| 18 |
+
'SENTENCE_TRANSFORMERS_HOME': ['/tmp/sentence_transformers', '/data/sentence_transformers'],
|
| 19 |
+
'VECTOR_PERSIST_DIR': ['/tmp/vector_db', '/data/vector_db'],
|
| 20 |
+
'HF_HOME': ['/tmp/huggingface', '/data/huggingface'],
|
| 21 |
+
'TRANSFORMERS_CACHE': ['/tmp/transformers', '/data/transformers'],
|
| 22 |
+
}
|
| 23 |
+
|
| 24 |
+
for env_var, candidates in cache_dirs.items():
|
| 25 |
+
if env_var not in os.environ:
|
| 26 |
+
for candidate in candidates:
|
| 27 |
+
try:
|
| 28 |
+
Path(candidate).mkdir(parents=True, exist_ok=True)
|
| 29 |
+
# Test write permission
|
| 30 |
+
test_file = Path(candidate) / "test_write"
|
| 31 |
+
test_file.write_text("test")
|
| 32 |
+
test_file.unlink()
|
| 33 |
+
os.environ[env_var] = candidate
|
| 34 |
+
print(f"β
Set {env_var} to {candidate}")
|
| 35 |
+
break
|
| 36 |
+
except (OSError, PermissionError):
|
| 37 |
+
continue
|
| 38 |
+
else:
|
| 39 |
+
# Fallback to system temp
|
| 40 |
+
temp_dir = os.path.join(tempfile.gettempdir(), env_var.lower())
|
| 41 |
+
try:
|
| 42 |
+
Path(temp_dir).mkdir(parents=True, exist_ok=True)
|
| 43 |
+
os.environ[env_var] = temp_dir
|
| 44 |
+
print(f"β οΈ Fallback: Set {env_var} to {temp_dir}")
|
| 45 |
+
except Exception as e:
|
| 46 |
+
print(f"β Could not set {env_var}: {e}")
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
def check_permissions():
|
| 50 |
+
"""Check and report on directory permissions."""
|
| 51 |
+
dirs_to_check = [
|
| 52 |
+
os.environ.get('SENTENCE_TRANSFORMERS_HOME', '/tmp/sentence_transformers'),
|
| 53 |
+
os.environ.get('VECTOR_PERSIST_DIR', '/tmp/vector_db'),
|
| 54 |
+
'/tmp',
|
| 55 |
+
'.',
|
| 56 |
+
]
|
| 57 |
+
|
| 58 |
+
print("\nπ Directory Permissions Check:")
|
| 59 |
+
for dir_path in dirs_to_check:
|
| 60 |
+
if not dir_path:
|
| 61 |
+
continue
|
| 62 |
+
|
| 63 |
+
try:
|
| 64 |
+
path = Path(dir_path)
|
| 65 |
+
if not path.exists():
|
| 66 |
+
path.mkdir(parents=True, exist_ok=True)
|
| 67 |
+
|
| 68 |
+
# Test write permission
|
| 69 |
+
test_file = path / "test_permissions"
|
| 70 |
+
test_file.write_text("test")
|
| 71 |
+
test_file.unlink()
|
| 72 |
+
|
| 73 |
+
print(f"β
{dir_path} - writable")
|
| 74 |
+
except Exception as e:
|
| 75 |
+
print(f"β {dir_path} - not writable: {e}")
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
def main():
|
| 79 |
+
"""Main startup routine."""
|
| 80 |
+
print("π Starting CVE Fact Checker setup...")
|
| 81 |
+
|
| 82 |
+
setup_environment()
|
| 83 |
+
check_permissions()
|
| 84 |
+
|
| 85 |
+
print("\nπ― Environment setup complete!")
|
| 86 |
+
print("π Key directories:")
|
| 87 |
+
for key in ['SENTENCE_TRANSFORMERS_HOME', 'VECTOR_PERSIST_DIR']:
|
| 88 |
+
print(f" {key}: {os.environ.get(key, 'Not set')}")
|
| 89 |
+
|
| 90 |
+
return True
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
if __name__ == "__main__":
|
| 94 |
+
success = main()
|
| 95 |
+
sys.exit(0 if success else 1)
|
test_language_filter.py
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Test script for the updated CVE Fact Checker with language filtering.
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import os
|
| 7 |
+
import sys
|
| 8 |
+
import time
|
| 9 |
+
|
| 10 |
+
def test_language_filtering():
|
| 11 |
+
"""Test the language filtering functionality."""
|
| 12 |
+
print("π§ͺ Testing Language Filtering")
|
| 13 |
+
print("=" * 50)
|
| 14 |
+
|
| 15 |
+
try:
|
| 16 |
+
# Set environment variables
|
| 17 |
+
os.environ['OPENROUTER_API_KEY'] = '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'
|
| 18 |
+
os.environ['LANGUAGE_FILTER'] = 'English'
|
| 19 |
+
|
| 20 |
+
# Test Firebase loader
|
| 21 |
+
from cve_factchecker.firebase_loader import FirebaseNewsLoader
|
| 22 |
+
|
| 23 |
+
print("β
Importing Firebase loader...")
|
| 24 |
+
loader = FirebaseNewsLoader()
|
| 25 |
+
|
| 26 |
+
print(f"π Project: {loader.project_id}")
|
| 27 |
+
|
| 28 |
+
# Test fetching English articles only
|
| 29 |
+
print("π Fetching 5 English articles...")
|
| 30 |
+
start_time = time.time()
|
| 31 |
+
articles = loader.fetch_articles(limit=5, language="English")
|
| 32 |
+
fetch_time = time.time() - start_time
|
| 33 |
+
|
| 34 |
+
print(f"β
Fetched {len(articles)} articles in {fetch_time:.2f}s")
|
| 35 |
+
|
| 36 |
+
if articles:
|
| 37 |
+
print("\nπ Sample Articles:")
|
| 38 |
+
for i, article in enumerate(articles[:3], 1):
|
| 39 |
+
print(f" {i}. {article.title[:60]}...")
|
| 40 |
+
print(f" Source: {article.source}")
|
| 41 |
+
print(f" URL: {article.url[:50]}...")
|
| 42 |
+
print()
|
| 43 |
+
|
| 44 |
+
# Test orchestrator
|
| 45 |
+
print("π§ Testing Orchestrator...")
|
| 46 |
+
from cve_factchecker.orchestrator import FactCheckSystem
|
| 47 |
+
|
| 48 |
+
system = FactCheckSystem()
|
| 49 |
+
print("β
System initialized")
|
| 50 |
+
|
| 51 |
+
# Test fact checking (if we have articles)
|
| 52 |
+
if articles:
|
| 53 |
+
print("π Testing fact check...")
|
| 54 |
+
test_claim = "This is a test claim about cybersecurity."
|
| 55 |
+
result = system.fact_check(test_claim)
|
| 56 |
+
|
| 57 |
+
print(f"π Fact check result:")
|
| 58 |
+
print(f" Verdict: {result.get('verdict')}")
|
| 59 |
+
print(f" Confidence: {result.get('confidence')}")
|
| 60 |
+
print(f" Sources used: {result.get('sources_used')}")
|
| 61 |
+
|
| 62 |
+
return True
|
| 63 |
+
|
| 64 |
+
except Exception as e:
|
| 65 |
+
print(f"β Test failed: {e}")
|
| 66 |
+
import traceback
|
| 67 |
+
traceback.print_exc()
|
| 68 |
+
return False
|
| 69 |
+
|
| 70 |
+
def test_app_endpoints():
|
| 71 |
+
"""Test the Flask app endpoints."""
|
| 72 |
+
print("\nπ Testing Flask App")
|
| 73 |
+
print("=" * 50)
|
| 74 |
+
|
| 75 |
+
try:
|
| 76 |
+
from cve_factchecker.app import app
|
| 77 |
+
|
| 78 |
+
with app.test_client() as client:
|
| 79 |
+
# Test health endpoint
|
| 80 |
+
print("π₯ Testing /health endpoint...")
|
| 81 |
+
response = client.get('/health')
|
| 82 |
+
print(f" Status: {response.status_code}")
|
| 83 |
+
if response.status_code == 200:
|
| 84 |
+
data = response.get_json()
|
| 85 |
+
print(f" Uptime: {data.get('uptime_sec')}s")
|
| 86 |
+
|
| 87 |
+
# Test root endpoint
|
| 88 |
+
print("π Testing / endpoint...")
|
| 89 |
+
response = client.get('/')
|
| 90 |
+
print(f" Status: {response.status_code}")
|
| 91 |
+
if response.status_code == 200:
|
| 92 |
+
data = response.get_json()
|
| 93 |
+
print(f" API Name: {data.get('name')}")
|
| 94 |
+
status = data.get('status', {})
|
| 95 |
+
print(f" Ingestion finished: {status.get('ingestion_finished')}")
|
| 96 |
+
print(f" Synced articles: {status.get('synced_articles')}")
|
| 97 |
+
|
| 98 |
+
return True
|
| 99 |
+
|
| 100 |
+
except Exception as e:
|
| 101 |
+
print(f"β App test failed: {e}")
|
| 102 |
+
return False
|
| 103 |
+
|
| 104 |
+
def main():
|
| 105 |
+
"""Run all tests."""
|
| 106 |
+
print("π CVE Fact Checker - Language Filtering Test")
|
| 107 |
+
print("=" * 60)
|
| 108 |
+
print(f"β° Started at: {time.strftime('%Y-%m-%d %H:%M:%S')}")
|
| 109 |
+
print()
|
| 110 |
+
|
| 111 |
+
success1 = test_language_filtering()
|
| 112 |
+
success2 = test_app_endpoints()
|
| 113 |
+
|
| 114 |
+
print("\nπ Test Summary")
|
| 115 |
+
print("=" * 50)
|
| 116 |
+
print(f"Language Filtering: {'β
PASS' if success1 else 'β FAIL'}")
|
| 117 |
+
print(f"Flask App: {'β
PASS' if success2 else 'β FAIL'}")
|
| 118 |
+
|
| 119 |
+
if success1 and success2:
|
| 120 |
+
print("\nπ All tests passed! The language filtering is working correctly.")
|
| 121 |
+
print("\nπ Key Features:")
|
| 122 |
+
print(" β
Firebase language filtering (English articles only)")
|
| 123 |
+
print(" β
Structured query support")
|
| 124 |
+
print(" β
Rate limiting protection")
|
| 125 |
+
print(" β
Vector database integration")
|
| 126 |
+
print(" β
Flask API endpoints")
|
| 127 |
+
print("\nπ Ready for deployment!")
|
| 128 |
+
else:
|
| 129 |
+
print("\nβ οΈ Some tests failed. Check the output above.")
|
| 130 |
+
|
| 131 |
+
return success1 and success2
|
| 132 |
+
|
| 133 |
+
if __name__ == "__main__":
|
| 134 |
+
success = main()
|
| 135 |
+
sys.exit(0 if success else 1)
|