Dependencies and URL extraction

Our link tracker will need to know how to read a page of HTML text and extract links.

For example, the following HTML page has a single link to https://blog.boot.dev:

<html>
    <body>
        <a href="https://blog.boot.dev"><span>Go to Boot.dev</span></a>
    </body>
</html>

We want to write a new function called getURLsFromHTML in the same crawl.js file. It takes a string of HTML as input and returns a list of all the link URLs. To do so, we'll use a third-party HTML parsing library called JSDOM.

Installing JSDOM

npm install jsdom

This will install jsdom as a "dependency" (as opposed to jest which is a "devDependency"). "Dev dependencies" are not required to run your application, they're only required for development (like testing). Regular dependencies are required to run the program itself.

I'll try not to give too many hints: you should go read the JSDOM docs! That said here are a few:

const { JSDOM } = require('jsdom')
new JSDOM(htmlBody) creates a new "document object model"
dom.window.document.querySelectorAll('a') returns an array of "a" tag elements

In HTML, "a" tags are links. e.g:

<a href="https://boot.dev">Learn Backend Development</a>

getURLsFromHTML()

getURLsFromHTML(htmlBody, baseURL) takes 2 arguments. The first is an HTML string as we discussed earlier, while the second is the root URL of the website we're crawling. This will allow us to rewrite relative URLs into absolute URLs.

It returns an un-normalized array of all the URLs found within the HTML.

Write some tests before writing your code

Test that relative URLs are converted to absolute URLs.
Test to be sure you find all the <a> tags in a body of HTML

Then go ahead and implement the function itself!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!