B2B SaaS system that automates invoice processing using AWS Textract + Computer Vision + ML for liquidation stores and small-to-medium retail businesses in Colombia.
β
Functional MVP: End-to-end invoice processing with manual pricing
π In development: Anti-duplicates + POS integrations
π― Next milestone: Paying customer (2 months)
- Backend: FastAPI + PostgreSQL + SQLAlchemy
- AI/ML: AWS Textract + OpenCV + Transformers (Zero-shot classification)
- Cloud: AWS S3 + LocalStack (development)
- Database: PostgreSQL with Docker
- Computer Vision: OpenCV + img2pdf for mobile photos
π± Mobile Photo β π§ OpenCV Enhancement β π PDF β π€ AWS Textract β π Structured Data β ποΈ PostgreSQL- β Invoice upload: Direct PDF + mobile photos with enhancement
- β ML processing: AWS Textract + Computer Vision pipeline
- β Manual pricing: Interface for setting sale prices with margin calculation
- β Multi-tenant: Support for multiple clients
- β Basic analytics: Processed invoice reports
- β Product Classification: Zero-shot learning for product categorization
- β Smart Pricing: Recommendations based on category + historical data
- β Price Rounding: Intelligent rounding (10,800 β 11,000)
- β Anti-Duplicates: Fuzzy matching detection system (90% threshold)
- π Square POS: API integration (in development)
- π Excel Export: Automatic file generation
- π Mayasis POS: Client-specific integration (next)
- π Webhook/API: Generic system for custom integrations
- Python 3.11+
- Docker & Docker Compose
- AWS Account (for Textract)
# 1. Clone repository
git clone https://github.com/EdwLearn/aws-document-processing.git
cd aws-document-processing
# 2. Setup environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# 3. Install dependencies
pip install -r requirements.txt
# 4. Start database
docker-compose up -d postgres
# 5. Configure environment variables
cp .env.example .env
# Edit .env with your AWS credentials
# 6. Run migrations
alembic upgrade head
# 7. Start server
uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000# Test ML Classification
curl -X POST "http://localhost:8000/api/v1/invoices/test-ml-classification" \
-H "x-tenant-id: test" \
-H "Content-Type: application/json" \
-d '["Nike shoes", "Cotton t-shirt", "Bluetooth headphones"]'
# Test price rounding
curl -X POST "http://localhost:8000/api/v1/invoices/test-price-rounding" \
-H "x-tenant-id: test" \
-H "Content-Type: application/json" \
-d '[10800, 15300, 1250, 450]'##π‘ API Endpoints
POST /api/v1/invoices/upload # Direct PDF
POST /api/v1/invoices/upload-photo # Mobile photo + enhancement
GET /api/v1/invoices/{id}/status # Processing status
GET /api/v1/invoices/{id}/data # Extracted dataGET /api/v1/invoices/{id}/pricing # Data for manual pricing
POST /api/v1/invoices/{id}/pricing # Set sale prices
POST /api/v1/invoices/{id}/confirm-pricing # Confirm and update inventoryPOST /api/v1/invoices/{id}/check-duplicates # Detect similar products
POST /api/v1/invoices/{id}/resolve-duplicates # Resolve conflictsGET /api/v1/integrations/ # List integrations
POST /api/v1/integrations/sync-inventory # Sync to external systems##ποΈ Database Structure Main Tables:
- tenants - Multi-tenant clients
- processed_invoices - Processed invoices with metadata
- invoice_line_items - Invoice products with pricing
- products - Product catalog for inventory
- suppliers - Supplier directory
sqlprocessed_invoices:
βββ id: UUID (PK)
βββ tenant_id: VARCHAR(100)
βββ status: uploaded | processing | completed | failed
βββ pricing_status: pending | partial | completed | confirmed
βββ invoice_number, supplier_name, total_amount
βββ textract_raw_response: JSONB
invoice_line_items:
βββ id: UUID (PK)
βββ invoice_id: UUID (FK)
βββ product_code, description, quantity, unit_price
βββ sale_price: NUMERIC(15,2) -- Manual price
βββ markup_percentage: NUMERIC(5,2) -- Calculated margin
βββ is_priced: BOOLEAN -- Pricing flagpython# Automatic product classification
"Nike Air Max shoes size 42" β {
'category': 'shoes',
'confidence': 0.94,
'margin_percentage': 55.0,
'reasoning': 'ML classified as footwear'
}python# Intelligent price recommendations
cost_price = 28000 β {
'recommended_price': 43000, # Colombian rounded
'confidence': 0.89,
'margin_percentage': 53.6,
'reasoning': 'Footwear category + supplier pattern'
}python# Similar product detection
"Nike AirMax 42" vs "Nike Air Max shoes size 42" β {
'similarity_score': 0.92,
'is_duplicate': True,
'price_difference': -7000, # New supplier 15% cheaper
'recommendation': 'Better supplier found'
}Value Proposition:
- β‘ Speed: 15 min β 2 min per invoice
- π― Accuracy: >95% with Colombian invoices
- π° ROI: 300%+ documented
- π Integration: Connects with existing POS
This Week:
- Anti-duplicates with fuzzy matching
- Auto-update inventory on pricing confirmation
- UX panel for duplicate resolution
Next 2 Weeks:
- Mayasis integration (CSV upload)
- Improved pricing frontend
- Staging deployment for client
Month 2:
- Square POS integration
- Advanced analytics dashboard
- "AlmacΓ©n MedellΓn JA" onboarding
Environment Variables (.env):
# Database
DB_HOST=localhost
DB_PORT=5432
DB_NAME=document_processing
DB_USER=postgres
DB_PASSWORD=postgres
# AWS
AWS_REGION=us-east-1
S3_DOCUMENT_BUCKET=invoice-saas-textract-dev
# API
API_HOST=0.0.0.0
API_PORT=8000
ENVIRONMENT=developmentDocker Services:
# docker-compose.yml includes:
- PostgreSQL 15 (port 5432)
- Redis (port 6379)
- LocalStack (AWS simulation, port 4566)π Known Issues Dependencies:
bash# SQLAlchemy conflict - use specific versions:
pip install "sqlalchemy>=1.4.42,<1.5" "databases>=0.8.0" "alembic>=1.13.1"- First ML load: 2-5 minutes (model download)
- After: <1 second
- Textract: 15-30 seconds per invoice
Commit structure:
feat: new feature
fix: bug fix
docs: documentation
refactor: code refactoring
test: add testsDevelopment workflow:
- Fork the repo
- Feature branch: git checkout -b feature/new-feature
- Commit: git commit -m "feat: description"
- Push: git push origin feature/new-feature
- Pull Request
Technical:
- Accuracy: >95% Colombian invoices
- Processing time: <30s per invoice
- API response: <200ms query endpoints
- Uptime: 99.9% (target)