Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 18 additions & 12 deletions deploy/docker/c4ai-doc-context.md
Original file line number Diff line number Diff line change
Expand Up @@ -4877,23 +4877,29 @@ By default, Crawl4AI automatically generates Markdown from each crawled page. Ho
### Example: Using a Filter with `DefaultMarkdownGenerator`

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
async def main():
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)

config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)

async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))

if __name__ == "__main__":
asyncio.run(main())

async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))
```

**Note**: If you do **not** specify a content filter or markdown generator, you’ll typically see only the raw Markdown. `PruningContentFilter` may adds around `50ms` in processing time. We’ll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial.
Expand Down
29 changes: 17 additions & 12 deletions docs/md_v2/core/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,23 +97,28 @@ By default, Crawl4AI automatically generates Markdown from each crawled page. Ho
### Example: Using a Filter with `DefaultMarkdownGenerator`

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
async def main():
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)

config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)

async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))

if __name__ == "__main__":
asyncio.run(main())
```
Comment on lines +100 to 122
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Make the snippet resilient to both string and structured result.markdown shapes.

Elsewhere in the doc, result.markdown is treated like a string, but here it’s accessed as an object with raw_markdown/fit_markdown. To avoid version drift and keep this runnable across releases, compute lengths via getattr fallback.

Apply this diff:

-        print("Raw Markdown length:", len(result.markdown.raw_markdown))
-        print("Fit Markdown length:", len(result.markdown.fit_markdown))
+        md = result.markdown
+        raw = getattr(md, "raw_markdown", md)
+        fit = getattr(md, "fit_markdown", md)
+        print("Raw Markdown length:", len(raw))
+        print("Fit Markdown length:", len(fit))

Also consider aligning the earlier “Your First Crawl” example to use the same shape (or add a short note clarifying the return type), so readers don’t get conflicting guidance. I can open a follow-up PR if you’d like.


🏁 Script executed:

#!/bin/bash
set -e

echo "Searching for AsyncWebCrawler class definition..."
grep -R "class AsyncWebCrawler" -n .

echo "Searching for arun method definition..."
grep -R "async def arun" -n .

echo "Searching for Result class definitions..."
grep -R "class .*Result" -n .

echo "Searching for .markdown usage in code (in crawl4ai folder if exists)..."
if [ -d "crawl4ai" ]; then
  grep -R "\.markdown" -n crawl4ai || echo "No .markdown in crawl4ai folder"
else
  echo "No crawl4ai directory, searching entire repo..."
  grep -R "\.markdown" -n .
fi

echo "Locating 'Your First Crawl' in docs..."
grep -R "Your First Crawl" -n docs/md_v2/core/quickstart.md

echo "Locating 'result.markdown' in docs..."
grep -R "result.markdown" -n docs/md_v2/core/quickstart.md

Length of output: 282964


Make snippet resilient to both string and structured result.markdown

File: docs/md_v2/core/quickstart.md (lines 100-122) — CrawlResult.markdown can be a plain/str-subclass or a MarkdownGenerationResult object; use getattr fallback.

-        print("Raw Markdown length:", len(result.markdown.raw_markdown))
-        print("Fit Markdown length:", len(result.markdown.fit_markdown))
+        md = result.markdown
+        raw = getattr(md, "raw_markdown", md)
+        fit = getattr(md, "fit_markdown", md)
+        print("Raw Markdown length:", len(raw))
+        print("Fit Markdown length:", len(fit))

Also align the earlier “Your First Crawl” example (or add a short note clarifying the return shape) to avoid conflicting guidance.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
async def main():
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))
if __name__ == "__main__":
asyncio.run(main())
```
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
md = result.markdown
raw = getattr(md, "raw_markdown", md)
fit = getattr(md, "fit_markdown", md)
print("Raw Markdown length:", len(raw))
print("Fit Markdown length:", len(fit))
if __name__ == "__main__":
asyncio.run(main())
🤖 Prompt for AI Agents
In docs/md_v2/core/quickstart.md around lines 100 to 122, the example assumes
result.markdown is always a MarkdownGenerationResult object and fails if it's a
plain string; update the snippet to handle both shapes by using a getattr
fallback (e.g., obtain raw_markdown = getattr(result.markdown, "raw_markdown",
result.markdown) and fit_markdown = getattr(result.markdown, "fit_markdown",
result.markdown)) before printing lengths, and also adjust the earlier “Your
First Crawl” example or add a short note clarifying that CrawlResult.markdown
may be either a string or a MarkdownGenerationResult object so readers aren’t
given conflicting guidance.


**Note**: If you do **not** specify a content filter or markdown generator, you’ll typically see only the raw Markdown. `PruningContentFilter` may adds around `50ms` in processing time. We’ll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial.
Expand Down