AI Deep Tech Infrastructure

Arabic knowledge, structured for AI at scale.

Gutenberg Digital Publishing is building an AI-native infrastructure layer that transforms large-scale Arabic heritage publications into machine-readable corpora, semantic indices, and knowledge graphs for research and next-generation AI systems.

View Live MVP ↗ See Architecture → Contact ✦

2,000,000+

Pages scanned

Raw heritage scans from Arabic periodicals (1887–1975).

8,000+

Indexed articles

Metadata, authors, topics, and connections.

99%

OCR accuracy

Validated quality across the current dataset.

450,000+

Pages — Phase 1

Structuring target through 1953 into searchable, AI-ready data.

The Problem → The Infrastructure Opportunity

Why this matters for cloud-scale AI

Problem

Arabic intellectual heritage is largely trapped in scanned PDFs: fragmented, non-searchable, and unusable for modern AI workflows. Without structured metadata, entity linking, and semantic indexing, this knowledge cannot serve researchers or AI systems at scale.

Unstructured scansGreat content, but not machine-readable.

No semantic layerMissing entities, topics, timelines, and links.

AI data imbalanceArabic remains underrepresented in high-quality corpora.

Solution

We built a pipeline that converts historical publications into a structured, queryable dataset: metadata + entities + topic models + semantic retrieval + knowledge graph outputs.

AI OCR + validationHigh-fidelity extraction with quality checks.

Metadata + entity extractionAuthors, persons, topics, time, context.

Semantic indexing + KGCross-article links, clustering, and graph-ready output.

Architecture Overview

Simple pipeline view (review-friendly)

What’s “deep tech” here?

The core innovation is a scalable pipeline that turns legacy Arabic heritage scans into structured, machine-readable knowledge. This is infrastructure: not content publishing.

OCR + QA Entity linking Semantic retrieval Scale bottleneck

Research & Dataset Outputs

What we produce (not just pages)

Outputs are AI-ready artifacts

Our platform produces structured assets that can be consumed by search systems, researchers, and AI pipelines. The live MVP demonstrates the browse & reading layer, but the core value is the structured data layer.

Structured corpus exports Article-level JSON: title, author, date, pages, summaries, topics, entities, citations.

Knowledge graph snapshots Author networks, entity links, cross-article relations, timeline edges.

Semantic index (vector-ready) Embeddings-ready records for semantic retrieval and research assistants.

Internationalization layer English metadata translation to support non-Arabic researchers and global discovery.

{ "article_id": "RISALA_1937_05_10_001", "title_ar": "عنوان المقال", "title_en": "Translated title", "author_ar": "اسم الكاتب", "date": "1937-05-10", "pages": [17,18,19], "topics": ["literature","criticism"], "entities": [{"type":"person","value":"طه حسين"}], "links": {"related_articles": ["..."], "issue_url": "..."} }

API direction

Next step: stable API endpoints for researchers and developers to query the structured corpus, entity graphs, and semantic search layer.

/articlesFilter by year, author, topic, entity.

/graphEdges between authors, topics, and entities.

/searchFull-text + semantic retrieval.

Scale Plan (Non-Financial Math)

Throughput targets + cloud impact

Current baseline

We already process pages in batches. Scaling is primarily a function of cloud inference throughput and storage + orchestration reliability.

TodayBatch processing with constrained compute budget.

Quality gateAutomated checks + human spot QA for flagged pages.

Data IOHigh-res inputs + versioned outputs per page/article.

Targets after credits

With cloud credits, the goal is predictable throughput and faster enrichment (entities/topics/translation), not just raw OCR.

P/D

Pages/dayIncrease batch throughput via Vertex AI + orchestration.

E2E

End-to-end completionOCR → structure → enrichment → index per batch.

API

DeliverablesPublish dataset snapshots and KG/API artifacts regularly.

Simple scaling math (illustrative)

Review-friendly estimate showing how compute converts into throughput. Replace the numbers below with your latest internal metrics if needed.

Scenario	Pages / day	Pages / month	400,000 pages ETA	Notes
Constrained	500	15,000	~27 months	Limited inference + slower enrichment cadence
With credits	2,000	60,000	~7 months	Batch OCR + entity/topic enrichment + translation in parallel
Optimized	4,000	120,000	~3.5 months	Higher concurrency, stronger caching, automated QA routing

What “success” looks like

A predictable pipeline that outputs monthly dataset snapshots (corpus + index + KG) and keeps the public MVP fast via prebuilt, paginated data artifacts.

Monthly snapshots Stable schema KG growth Latency down

Why Google Cloud

Credits → measurable throughput

Cloud resources map to pipeline stages

We’re applying to scale batch processing and structured indexing for the remaining pages without cost explosion. Google Cloud unlocks reliable storage, scalable inference, and analytics across a growing corpus.

Need	Google Cloud	Outcome
Batch enrichment & metadata generation	Vertex AI	Faster OCR refinement, entity/topic extraction, translation
Archive-scale scan storage	Cloud Storage	Durable storage, versioning, scalable access
Structured analytics	BigQuery	Corpus analytics, topic trends, entity networks
Automated workflows	Cloud Run / Functions	Ingestion + processing orchestration
Fast global access	CDN + Hosting	Lower latency for the public MVP and demos

Current bottleneck

We can’t scale to hundreds of thousands of pages efficiently without cloud batch processing and data infrastructure. Credits directly convert into throughput: more pages/day and more enriched structure per page.

⚙

ComputeBatch inference for OCR + enrichment.

🧠

QualityValidation loops + structured consistency.

📦

StorageHigh-res assets + versioned outputs.

Product Proof (Live MVP)

Publicly accessible

Live demo

The MVP is publicly accessible and demonstrates structured browsing, article pages, classifications, and dashboard-style insights from the current processed corpus.

Open Live MVP ↗ Main Website ↗

Status

MVP/Beta is live. The next phase is scaling the pipeline to process the remaining pages and ship API-ready datasets and knowledge graph outputs.

StageMVP / Beta

Next milestoneScale processing + API outputs

Core assetStructured Arabic corpus

Contact

Get in touch

If you need any clarifications for the Google for Startups Cloud application, please reach out:

Email: editor@gutenbergdigital.net ✉ Draft re-evaluation email ↗

Gutenberg Digital Publishing

Arabic knowledge, structured for AI at scale.

The Problem → The Infrastructure Opportunity

Problem

Solution

Architecture Overview

What’s “deep tech” here?

Research & Dataset Outputs

Outputs are AI-ready artifacts

API direction

Scale Plan (Non-Financial Math)

Current baseline

Targets after credits

Simple scaling math (illustrative)

What “success” looks like

Why Google Cloud

Cloud resources map to pipeline stages

Current bottleneck

Product Proof (Live MVP)

Live demo

Status

Contact

Business Model & Revenue

Where does revenue come from? (B2B first)

Why is this suited for Google Cloud Credits?

Team & Advisors

Leadership

Ahmed Elwakil

Advisors

Ahmed El-Dakhakhny

Dr. Abdel-Razek Eissa

Ghareeb Qassem

Adel Naggaar

Operations Team

Imaging

Review

Indexing

Quality Control