Gutenberg Digital Publishing is building an AI-native infrastructure layer that transforms large-scale Arabic heritage publications into machine-readable corpora, semantic indices, and knowledge graphs for research and next-generation AI systems.
Arabic intellectual heritage is largely trapped in scanned PDFs: fragmented, non-searchable, and unusable for modern AI workflows. Without structured metadata, entity linking, and semantic indexing, this knowledge cannot serve researchers or AI systems at scale.
We built a pipeline that converts historical publications into a structured, queryable dataset: metadata + entities + topic models + semantic retrieval + knowledge graph outputs.
The core innovation is a scalable pipeline that turns legacy Arabic heritage scans into structured, machine-readable knowledge. This is infrastructure: not content publishing.
Our platform produces structured assets that can be consumed by search systems, researchers, and AI pipelines. The live MVP demonstrates the browse & reading layer, but the core value is the structured data layer.
Next step: stable API endpoints for researchers and developers to query the structured corpus, entity graphs, and semantic search layer.
We already process pages in batches. Scaling is primarily a function of cloud inference throughput and storage + orchestration reliability.
With cloud credits, the goal is predictable throughput and faster enrichment (entities/topics/translation), not just raw OCR.
Review-friendly estimate showing how compute converts into throughput. Replace the numbers below with your latest internal metrics if needed.
| Scenario | Pages / day | Pages / month | 400,000 pages ETA | Notes |
|---|---|---|---|---|
| Constrained | 500 | 15,000 | ~27 months | Limited inference + slower enrichment cadence |
| With credits | 2,000 | 60,000 | ~7 months | Batch OCR + entity/topic enrichment + translation in parallel |
| Optimized | 4,000 | 120,000 | ~3.5 months | Higher concurrency, stronger caching, automated QA routing |
A predictable pipeline that outputs monthly dataset snapshots (corpus + index + KG) and keeps the public MVP fast via prebuilt, paginated data artifacts.
We’re applying to scale batch processing and structured indexing for the remaining pages without cost explosion. Google Cloud unlocks reliable storage, scalable inference, and analytics across a growing corpus.
| Need | Google Cloud | Outcome |
|---|---|---|
| Batch enrichment & metadata generation | Vertex AI | Faster OCR refinement, entity/topic extraction, translation |
| Archive-scale scan storage | Cloud Storage | Durable storage, versioning, scalable access |
| Structured analytics | BigQuery | Corpus analytics, topic trends, entity networks |
| Automated workflows | Cloud Run / Functions | Ingestion + processing orchestration |
| Fast global access | CDN + Hosting | Lower latency for the public MVP and demos |
We can’t scale to hundreds of thousands of pages efficiently without cloud batch processing and data infrastructure. Credits directly convert into throughput: more pages/day and more enriched structure per page.
The MVP is publicly accessible and demonstrates structured browsing, article pages, classifications, and dashboard-style insights from the current processed corpus.
MVP/Beta is live. The next phase is scaling the pipeline to process the remaining pages and ship API-ready datasets and knowledge graph outputs.
If you need any clarifications for the Google for Startups Cloud application, please reach out:
We are targeting a scalable institutional business model. The core product is the structured data layer + semantic search + API interfaces. Revenue comes from services and products built on this layer, while maintaining a free public interface as a value demonstration and citable reference.
Credits convert directly into throughput (pages/day) and deeper enrichment (entities/topics/translation), then into a sellable product: paid API + institutional subscriptions + dataset licensing.