Skip to content

chyyynh/newsence

Repository files navigation

newsence

A content discovery engine that helps LLMs understand your world

License: MIT Cloudflare Workers npm: newsence MCP Website

Deploy to Cloudflare

English | 繁體中文


What is newsence?

newsence is a content discovery system. It continuously monitors sources across the web, extracts structured knowledge from every article, and makes it available for search, analysis, and AI-powered workflows.

Think of it as an always-on research assistant that reads everything, extracts who's involved, what technologies are mentioned, and what events are happening — then organizes it all into a searchable knowledge base.

Core loop:

Sources arrive (RSS, Twitter, YouTube, HN, Bilibili, Xiaohongshu, manual)
  → AI reads and analyzes each article
  → Extracts entities (people, orgs, products, tech, events)
  → Generates bilingual summaries (EN + Traditional Chinese)
  → Creates semantic embeddings for search
  → Links articles through shared entities

This repo is the engine: a single Cloudflare Worker that handles the full content pipeline.

Supported Platforms

RSS YouTube X Hacker News Bilibili Xiaohongshu

Platform Type Schedule What it does
RSS Feeds Monitor Every 5 min Fetches feeds, deduplicates by URL, detects HN links
Twitter/X Monitor Every 6 hours Tracks users via Kaito API — tweets, threads, articles, media
YouTube Monitor Every 30 min Atom feed → video metadata, transcripts, chapters, AI highlights
Bilibili Monitor Every 30 min gRPC mobile API → user dynamics, video cards
Xiaohongshu Monitor Every 30 min Profile scraping → user notes, covers
Hacker News Processor Via RSS Detects HN links → fetches comments via Algolia → generates editorial notes
Web Scraper On demand Full content extraction (Readability + Cheerio), OG metadata
User Submissions Ingestion Real-time POST /submit — full crawl + AI, sync response
Telegram Bot Ingestion Real-time Send URL in chat → get bilingual summary back

All platforms output a unified ScrapedContent shape → same AI pipeline.

How it works

Each article goes through an automated workflow with independent retries:

URL arrives (RSS cron / Twitter cron / user submit / Telegram bot)
  │
  ├─  1. Fetch Article ──── Load article from database
  ├─  2. AI Analysis ────── Gemini Flash → bilingual title, summary, tags, keywords, entities
  ├─  3. Fetch OG Image ─── Grab OG image if missing (lightweight, first 32KB)
  ├─  4. Translate Content ─ Full article → Traditional Chinese
  ├─  5. Save to DB ──────── Write all AI results in a single UPDATE
  ├─  5b. Sync Entities ─── Upsert entities to normalized tables, link to article
  ├─  6. Notify Telegram ─── Push results to Telegram bot (if triggered via bot)
  ├─  7. YouTube Highlights  Generate AI highlights from transcript (YouTube only)
  └─  8. Embed ───────────── BGE-M3 → 1024-dim vector from title + summary + content + entities

~30 seconds per article. Each step retries independently with exponential backoff.

AI Pipeline

Stage Model What it does
Analysis Gemini Flash Lite Article → bilingual title, summary, tags, keywords, category
Entity Extraction Gemini Flash Lite Article → named entities (person, organization, product, technology, event) with EN + zh-TW names
Content Translation Gemini Flash Full article content → Traditional Chinese
Embedding BGE-M3 (1024d) Title + summary + content + entity names → dense vector (HNSW-indexed)

Entity extraction happens in the same LLM call as analysis — zero extra API cost.

Stack

Layer Technology
Runtime Cloudflare Workers (V8 isolates)
Orchestration Cloudflare Queues + Workflows
Database Supabase PostgreSQL + pgvector
LLM OpenRouter → Gemini Flash / Flash Lite
Embeddings Cloudflare Workers AI → BGE-M3
Twitter Data Kaito API

Quick Start

pnpm install
cp wrangler.jsonc.example wrangler.jsonc   # add your API keys
pnpm dev                                    # local dev server
pnpm run deploy                             # deploy to Cloudflare

API

# Health check
curl https://your-worker.workers.dev/health

# Submit a URL
curl -X POST https://your-worker.workers.dev/submit \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article"}'

# Generate embeddings
curl -X POST https://your-worker.workers.dev/embed \
  -H "Content-Type: application/json" \
  -d '{"text": "search query"}'
Response example
{
  "success": true,
  "results": [{
    "articleId": "550e8400-e29b-41d4-a716-446655440000",
    "title": "Article Title",
    "sourceType": "web",
    "alreadyExists": false
  }]
}

Optional auth: X-Internal-Token header. Rate limiting: 20 req/60s per key (configurable).

CLI & MCP

Also available as a CLI and MCP server:

npx newsence search "AI agents"       # search articles
npx newsence recent --hours 6         # recent articles

claude mcp add newsence -- npx newsence mcp   # Claude Code
# Remote MCP: https://www.newsence.app/api/mcp

Architecture

src/
├── index.ts                  # Entry — routes HTTP, Cron, Queue
├── platforms/                # Each platform is self-contained
│   ├── twitter/              # monitor, scraper, processor, metadata
│   ├── youtube/              # monitor, scraper, highlights, metadata
│   ├── hackernews/           # scraper, processor, metadata
│   ├── rss/                  # monitor, parser, feed-config
│   └── web/                  # scraper (shared web + OG extraction)
├── domain/
│   ├── workflow.ts           # Workflow orchestration
│   ├── processors.ts         # AI processor factory + DefaultProcessor
│   ├── ai-utils.ts           # Shared AI functions (Gemini, translation)
│   ├── entities.ts           # Entity sync to normalized tables
│   └── distribute.ts         # Subscription fan-out for non-default sources
├── infra/                    # OpenRouter, Workers AI, DB, HTTP utilities
├── models/                   # Types, platform metadata union
└── app/handlers/             # HTTP route handlers

Environment Variables

Variable Required Description
SUPABASE_URL Yes Supabase project URL
SUPABASE_SERVICE_ROLE_KEY Yes Supabase service role key
OPENROUTER_API_KEY Yes OpenRouter API key
CORE_WORKER_INTERNAL_TOKEN No Auth token for /submit
YOUTUBE_API_KEY No YouTube Data API
KAITO_API_KEY No Kaito API (Twitter)
TRANSCRIPT_API_KEY No YouTube transcript API

License

MIT

About

A content discovery engine that helps LLMs understand your world

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Contributors