Gutenberg Digital Publishing

Google for Startups Cloud • Application
AI Deep Tech Infrastructure

Arabic knowledge, structured for AI at scale.

Gutenberg Digital Publishing is building an AI-native infrastructure layer that transforms large-scale Arabic heritage publications into machine-readable corpora, semantic indices, and knowledge graphs for research and next-generation AI systems.

2,000,000+
Pages scanned
Raw heritage scans from Arabic periodicals (1887–1975).
8,000+
Indexed articles
Metadata, authors, topics, and connections.
99%
OCR accuracy
Validated quality across the current dataset.
450,000+
Pages — Phase 1
Structuring target through 1953 into searchable, AI-ready data.

The Problem → The Infrastructure Opportunity

Why this matters for cloud-scale AI

Problem

Arabic intellectual heritage is largely trapped in scanned PDFs: fragmented, non-searchable, and unusable for modern AI workflows. Without structured metadata, entity linking, and semantic indexing, this knowledge cannot serve researchers or AI systems at scale.

1
Unstructured scansGreat content, but not machine-readable.
2
No semantic layerMissing entities, topics, timelines, and links.
3
AI data imbalanceArabic remains underrepresented in high-quality corpora.

Solution

We built a pipeline that converts historical publications into a structured, queryable dataset: metadata + entities + topic models + semantic retrieval + knowledge graph outputs.

A
AI OCR + validationHigh-fidelity extraction with quality checks.
B
Metadata + entity extractionAuthors, persons, topics, time, context.
C
Semantic indexing + KGCross-article links, clustering, and graph-ready output.

Architecture Overview

Simple pipeline view (review-friendly)
INGESTION → OCR → STRUCTURE → ENRICH → INDEX → GRAPH/API Ingestion Scans → Storage AI OCR Extract text + QC Structuring Metadata schema Enrichment Entities + topics Indexing Search + vectors KG API Output: Machine-readable corpus • Semantic search • Knowledge graph • Research/AI-ready datasets

What’s “deep tech” here?

The core innovation is a scalable pipeline that turns legacy Arabic heritage scans into structured, machine-readable knowledge. This is infrastructure: not content publishing.

OCR + QA Entity linking Semantic retrieval Scale bottleneck

Research & Dataset Outputs

What we produce (not just pages)

Outputs are AI-ready artifacts

Our platform produces structured assets that can be consumed by search systems, researchers, and AI pipelines. The live MVP demonstrates the browse & reading layer, but the core value is the structured data layer.

DS
Structured corpus exports Article-level JSON: title, author, date, pages, summaries, topics, entities, citations.
KG
Knowledge graph snapshots Author networks, entity links, cross-article relations, timeline edges.
VX
Semantic index (vector-ready) Embeddings-ready records for semantic retrieval and research assistants.
EN
Internationalization layer English metadata translation to support non-Arabic researchers and global discovery.
{ "article_id": "RISALA_1937_05_10_001", "title_ar": "عنوان المقال", "title_en": "Translated title", "author_ar": "اسم الكاتب", "date": "1937-05-10", "pages": [17,18,19], "topics": ["literature","criticism"], "entities": [{"type":"person","value":"طه حسين"}], "links": {"related_articles": ["..."], "issue_url": "..."} }

API direction

Next step: stable API endpoints for researchers and developers to query the structured corpus, entity graphs, and semantic search layer.

/a
/articlesFilter by year, author, topic, entity.
/g
/graphEdges between authors, topics, and entities.
/s
/searchFull-text + semantic retrieval.

Scale Plan (Non-Financial Math)

Throughput targets + cloud impact

Current baseline

We already process pages in batches. Scaling is primarily a function of cloud inference throughput and storage + orchestration reliability.

T0
TodayBatch processing with constrained compute budget.
QC
Quality gateAutomated checks + human spot QA for flagged pages.
IO
Data IOHigh-res inputs + versioned outputs per page/article.

Targets after credits

With cloud credits, the goal is predictable throughput and faster enrichment (entities/topics/translation), not just raw OCR.

P/D
Pages/dayIncrease batch throughput via Vertex AI + orchestration.
E2E
End-to-end completionOCR → structure → enrichment → index per batch.
API
DeliverablesPublish dataset snapshots and KG/API artifacts regularly.

Simple scaling math (illustrative)

Review-friendly estimate showing how compute converts into throughput. Replace the numbers below with your latest internal metrics if needed.

Scenario Pages / day Pages / month 400,000 pages ETA Notes
Constrained 500 15,000 ~27 months Limited inference + slower enrichment cadence
With credits 2,000 60,000 ~7 months Batch OCR + entity/topic enrichment + translation in parallel
Optimized 4,000 120,000 ~3.5 months Higher concurrency, stronger caching, automated QA routing

What “success” looks like

A predictable pipeline that outputs monthly dataset snapshots (corpus + index + KG) and keeps the public MVP fast via prebuilt, paginated data artifacts.

Monthly snapshots Stable schema KG growth Latency down

Why Google Cloud

Credits → measurable throughput

Cloud resources map to pipeline stages

We’re applying to scale batch processing and structured indexing for the remaining pages without cost explosion. Google Cloud unlocks reliable storage, scalable inference, and analytics across a growing corpus.

Need Google Cloud Outcome
Batch enrichment & metadata generation Vertex AI Faster OCR refinement, entity/topic extraction, translation
Archive-scale scan storage Cloud Storage Durable storage, versioning, scalable access
Structured analytics BigQuery Corpus analytics, topic trends, entity networks
Automated workflows Cloud Run / Functions Ingestion + processing orchestration
Fast global access CDN + Hosting Lower latency for the public MVP and demos

Current bottleneck

We can’t scale to hundreds of thousands of pages efficiently without cloud batch processing and data infrastructure. Credits directly convert into throughput: more pages/day and more enriched structure per page.

ComputeBatch inference for OCR + enrichment.
🧠
QualityValidation loops + structured consistency.
📦
StorageHigh-res assets + versioned outputs.

Product Proof (Live MVP)

Publicly accessible

Live demo

The MVP is publicly accessible and demonstrates structured browsing, article pages, classifications, and dashboard-style insights from the current processed corpus.

Status

MVP/Beta is live. The next phase is scaling the pipeline to process the remaining pages and ship API-ready datasets and knowledge graph outputs.

StageMVP / Beta
Next milestoneScale processing + API outputs
Core assetStructured Arabic corpus

Contact

Get in touch

If you need any clarifications for the Google for Startups Cloud application, please reach out:

Business Model & Revenue

Revenue projections after platform completion

Where does revenue come from? (B2B first)

We are targeting a scalable institutional business model. The core product is the structured data layer + semantic search + API interfaces. Revenue comes from services and products built on this layer, while maintaining a free public interface as a value demonstration and citable reference.

API
API Access (SaaS) Monthly/annual usage-based plans (limits + seats) for programmatic access to /articles, /search, /graph, and structured export operations.
UNI
University & Research Center Subscriptions Advanced analytics dashboards, classification and indexing tools, structured snapshot downloads, and institutional access control (SSO/Access control).
LIC
Structured Dataset Licensing Licensing structured corpus and Knowledge Graph snapshots for research or commercial use under clear agreements and defined usage scope.
ENT
Enterprise Solutions & Execution Services Running the same processing pipeline for institutions with closed archives (digitization + OCR + structuring + enrichment) with structured data delivery and internal search interfaces.

Why is this suited for Google Cloud Credits?

Credits convert directly into throughput (pages/day) and deeper enrichment (entities/topics/translation), then into a sellable product: paid API + institutional subscriptions + dataset licensing.

API Revenue Institutional Subscriptions Dataset Licensing Enterprise Services

Team & Advisors

Key members, roles, short bios, and available LinkedIn profiles.
AR

Leadership

Ahmed Elwakil

Founder & Managing Director
25+ years building Arabic digital content and knowledge systems. Founder of Arabia for Research & Information Systems; leading the transformation of heritage archives into structured, searchable knowledge.

Advisors

Ahmed El-Dakhakhny

Engineering Consultant · Technical Advisor
Senior software engineer & technical architect (9+ years). Built scalable mobile and distributed systems; contributed to scaling fintech products to 500K+ monthly active users.

Dr. Abdel-Razek Eissa

Research Advisor · Modern History
Supports scholarly verification, historical context, and research guidance for the archive outputs.

Ghareeb Qassem

Linguistics Advisor · PhD Researcher in Arabic Language
Supports linguistic analysis, terminology normalization, and quality control for entities and semantic linking across the pipeline.

Adel Naggaar

Project Supervisor · Sources & Acquisition
Oversees source acquisition and supply coordination, supporting project execution management and overall operations.

Operations Team

Imaging
Alaa MahmoudHager MorshedNoura El-Qabbani
Review
Rashid El-KhashabAli El-Helaly
Indexing
Mohamed BadranAbdelrahman Sherif
Quality Control
Khadija TamimYoussef Elwakil