Article Content Extraction API
Extract clean article body, title, author, and publish date from any blog or news page — without the ads and clutter.
The Agenty Content API extracts the main article body from any blog or news URL, automatically removing navigation, ads, sidebars, and footers. Get the title, author, publish date, hero image, and a clean HTML or plain-text body — ready for aggregators, newsletters, or LLM pipelines.
Features
- Auto content detectionFind the main article block on any page.
- Title & metadataTitle, author, publish date, and language.
- Clean HTML or textOutput as semantic HTML or plain text.
- Multi-languageWorks on 50+ languages out of the box.
- Media URLsExtract hero images and embedded videos.
- Tags & categoriesPull article tags when present in markup.
- Word countWord count and estimated reading time.
- Paywall supportPass cookies for subscriber content.
Use cases
- News and blog aggregation feeds
- Newsletter and content curation pipelines
- Competitive content and SEO analysis
- Building clean text corpora for LLM training
- Reader-mode features in apps and extensions
API examples
curl -X GET "https://api.agenty.ai/v1/content?url=https://example.com/blog/post" \
-H "Authorization: Bearer YOUR_API_KEY"const res = await fetch(
'https://api.agenty.ai/v1/content?url=https://example.com/blog/post',
{ headers: { 'Authorization': 'Bearer YOUR_API_KEY' } },
);
const article = await res.json();
console.log(article.title, article.text);import requests
res = requests.get(
"https://api.agenty.ai/v1/content",
headers={"Authorization": "Bearer YOUR_API_KEY"},
params={"url": "https://example.com/blog/post"},
)
article = res.json()
print(article["title"], article["text"])How Agenty compares
| Feature | Agenty | Readability | Mercury | Postlight |
|---|---|---|---|---|
| Automatic content detection | Yes | Yes | Yes | Yes |
| Author & date extraction | Yes | Limited | Yes | Yes |
| Multi-language (50+) | Yes | Limited | Yes | Yes |
| Image & video extraction | Yes | No | Yes | Yes |
| Hosted API + free tier | Yes | Self-host | Self-host | Yes |
Frequently asked questions
What is the Article Content Extraction API?
The Agenty Content API automatically identifies and extracts the main article on any web page. It returns clean structured data: title, author, publish date, article body, and embedded media URLs.
Can I get plain text instead of HTML?
Yes. Set outputFormat: "text" to receive plain text with paragraphs preserved. The default is "html" which returns clean semantic HTML.
Does it work with paywalled content?
Yes. Pass cookies or auth headers via the headers parameter. We also support session-based authentication for platforms like Medium and Substack.
Is there a free tier?
Yes. All accounts include a free tier. Visit our pricing page for details.