You’ve successfully scraped some HTML from the internet, but when you look at it, it looks like a mess. There are tons of HTML elements here and there, thousands of attributes scattered around—and maybe there’s some JavaScript mixed in as well? It’s time to parse this lengthy code response with the help of Python to make it more accessible so you can pick out the data that you want.
Beautiful Soup is a Python library for parsing structured data. It allows you to interact with HTML in a similar way to how you interact with a web page using developer tools. The library exposes intuitive methods that you can use to explore the HTML you received. To get started, use your terminal to install Beautiful Soup into your virtual environment:
Then, import the library in your Python script and create a BeautifulSoup
object:
When you add the two highlighted lines of code, then you create a BeautifulSoup
object that takes page.content
as input, which is the HTML content that you scraped earlier.
At this point, you’re set up with a BeautifulSoup
object that you named soup
. You can now run your script using Python’s interactive mode:
When you use the command-option -i
to run a script, then Python executes the code and drops you into a REPL environment. This can be a good way to continue exploring the scraped HTML through the user-friendly lens of Beautiful Soup.
Find Elements by ID
In an HTML web page, every element can have an id
attribute assigned. As the name already suggests, that id
attribute makes the element uniquely identifiable on the page. You can begin to parse your page by selecting a specific element by its ID.
Switch back to developer tools and identify the HTML object that contains all the job postings. Explore by hovering over parts of the page and using right-click to Inspect.
Note: It helps to periodically switch back to your browser and explore the page interactively using developer tools. You’ll get a better idea of where and how to find the exact elements that you’re looking for.
In this case, the element that you’re looking for is a
id
attribute that has the value "ResultsContainer"
. It has some other attributes as well, but below is the gist of what you’re looking for:
Beautiful Soup allows you to find that specific HTML element by its ID:
For easier viewing, you can prettify any BeautifulSoup
object when you print it out. If you call .prettify()
on the results
variable that you assigned above, then you’ll see all the HTML contained within the
When you find an element by its ID, you can pick out one specific element from among the rest of the HTML, no matter how large the source code of the website is. Now you can focus on working with only this part of the page’s HTML. It looks like your soup just got a little thinner! Nevertheless, it’s still quite dense.
Find Elements by HTML Class Name
You’ve seen that every job posting is wrapped in a
card-content
. Now you can work with your new object called results
and select only the job postings in it. These are, after all, the parts of the HTML that you’re interested in! You can pick out all job cards in a single line of code:
Here, you call .find_all()
on results
, which is a BeautifulSoup
object. It returns an iterable containing all the HTML for all the job listings displayed on that page.
Take a look at all of them:
That’s pretty neat already, but there’s still a lot of HTML! You saw earlier that your page has descriptive class names on some elements. You can pick out those child elements from each job posting with .find()
:
Each job_card
is another BeautifulSoup()
object. Therefore, you can use the same methods on it as you did on its parent element, results
.
With this code snippet, you’re getting closer and closer to the data that you’re actually interested in. Still, there’s a lot going on with all those HTML tags and attributes floating around:
Next, you’ll learn how to narrow down this output to access only the text content that you’re interested in.
You only want to see the title, company, and location of each job posting. And behold! Beautiful Soup has got you covered. You can add .text
to a BeautifulSoup
object to return only the text content of the HTML elements that the object contains:
Run the above code snippet, and you’ll see the text of each element displayed. However, you’ll also get some extra whitespace. But no worries, because you’re working with Python strings so you can .strip()
the superfluous whitespace. You can also apply any other familiar Python string methods to further clean up your text:
The results finally look much better! You’ve now got a readable list of jobs, associated company names, and each job’s location. However, you’re specifically looking for a position as a software developer, and these results contain job postings in many other fields as well.
Find Elements by Class Name and Text Content
Not all of the job listings are developer jobs. Instead of printing out all the jobs listed on the website, you’ll first filter them using keywords.
You know that job titles in the page are kept within
elements. To filter for only specific jobs, you can use the string
argument:
This code finds all
elements where the contained string matches "Python"
exactly. Note that you’re directly calling the method on your first results
variable. If you go ahead and print()
the output of the above code snippet to your console, then you might be disappointed because it’ll be empty:
There was a Python job in the search results, so why isn’t it showing up?
When you use string
as you did above, your program looks for that string exactly. Any variations in the spelling, capitalization, or whitespace will prevent the element from matching. In the next section, you’ll find a way to make your search string more general.
Pass a Function to a Beautiful Soup Method
In addition to strings, you can sometimes pass functions as arguments to Beautiful Soup methods. You can change the previous line of code to use a function instead:
Now you’re passing an anonymous function to the string
argument. The lambda function looks at the text of each
element, converts it to lowercase, and checks whether the substring "python"
is found anywhere. You can check whether you managed to identify all the Python jobs with this approach:
Your program has found ten matching job posts that include the word "python"
in their job title!
Finding elements based on their text content is a powerful way to filter your HTML response for specific information. Beautiful Soup allows you to use exact strings or functions as arguments for filtering text in BeautifulSoup
objects.
However, when you try to print the information of the filtered Python jobs like you’ve done before, you run into an error:
This traceback message is a common error that you’ll run into a lot when you’re scraping information from the internet. Inspect the HTML of an element in your python_jobs
list. What does it look like? Where do you think the error is coming from?
Identify Error Conditions
When you look at a single element in python_jobs
, you’ll see that it consists of only the
element that contains the job title:
When you revisit the code you used to select the items, you’ll notice that’s what you targeted. You filtered for only the
title elements of the job postings that contain the word "python"
. As you can see, these elements don’t include the rest of the information about the job.
The error message you received earlier was related to this:
You tried to find the job title, the company name, and the job’s location in each element in python_jobs
, but each element contains only the job title text.
Your diligent parsing library still looks for the other ones, too, and returns None
because it can’t find them. Then, print()
fails with the shown error message when you try to extract the .text
attribute from one of these None
objects.
The text you’re looking for is nested in sibling elements of the
elements that your filter returns. Beautiful Soup can help you select sibling, child, and parent elements of each BeautifulSoup
object.
Access Parent Elements
One way to get access to all the information for a job is to step up in the hierarchy of the DOM starting from the
elements that you identified. Take another look at the HTML of a single job posting, for example, using your developer tools. Then, find the
element that contains the job title and its closest parent element that contains the information you’re interested in:
The
card-content
class contains all the information you want. It’s a third-level parent of the
title element that you found using your filter.
With this information in mind, you can now use the elements in python_jobs
and fetch their great-grandparent elements to get access to all the information you want:
You added a list comprehension that operates on each of the
title elements in python_jobs
that you got by filtering with the lambda expression. You’re selecting the parent element of the parent element of the parent element of each
title element. That’s three generations up!
When you were looking at the HTML of a single job posting, you identified that this specific parent element with the class name card-content
contains all the information you need.
Now you can adapt the code in your for
loop to iterate over the parent elements instead:
When you run your script another time, you’ll see that your code once again has access to all the relevant information. That’s because you’re now looping over the
title elements.
Using the .parent
attribute that each BeautifulSoup
object comes with gives you an intuitive way to step through your DOM structure and address the elements you need. You can also access child elements and sibling elements in a similar manner. Read up on navigating the tree for more information.
At this point, you’ve already written code that scrapes the site and filters its HTML for relevant job postings. Well done! However, what’s still missing is fetching the link to apply for a job.
While inspecting the page, you found two links at the bottom of each card. If you use .text
on the link elements in the same way you did for the other elements, then you won’t get the URLs that you’re interested in:
If you execute the code shown above, then you’ll get the link text for Learn
and Apply
instead of the associated URLs.
That’s because the .text
attribute leaves only the visible content of an HTML element. It strips away all HTML tags, including the HTML attributes containing the URL, and leaves you with just the link text. To get the URL instead, you need to extract the value of one of the HTML attributes instead of discarding it.
The URL of a link element is associated with the href
HTML attribute. The specific URL that you’re looking for is the value of the href
attribute of the second tag at the bottom of the HTML for a single job posting:
Start by fetching all the elements in a job card. Then, extract the value of their
href
attributes using square-bracket notation:
In this code snippet, you first fetch all the links from each of the filtered job postings. Then, you extract the href
attribute, which contains the URL, using ["href"]
and print it to your console.
Each job card has two links associated with it. However, you’re only looking for the second link, so you’ll apply a small edit to the code:
In the updated code snippet, you use indexing to pick the second link element from the results of .find_all()
using its index ([1]
). Then, you directly extract the URL using the square-bracket notation with the "href"
key, thereby fetching the value of the href
attribute.
You can use the same square-bracket notation to extract other HTML attributes as well.
Assemble Your Code in a Script
You’re now happy with the results and are ready to put it all together into your scraper.py
script. When you assemble the useful lines of code that you wrote during your exploration, you’ll end up with a Python web scraping script that extracts the job title, company, location, and application link from the scraped website:
You could continue to work on your script and refactor it, but at this point, it does the job you wanted and presents you with the information you need when you want to apply for a Python developer job:
All you need to do now to check for new Python jobs on the job board is run your Python script. This leaves you with plenty of time to get out there and catch some waves!
Keep Practicing
If you’ve written the code alongside this tutorial, then you can run your script as is to see the fake job information pop up in your terminal. Your next step is to tackle a real-life job board! To keep practicing your new skills, you can revisit the web scraping process described in this tutorial by using any or all of the following sites:
The linked websites return their search results as static HTML responses, similar to the Fake Python job board. Therefore, you can scrape them using only Requests and Beautiful Soup.
Start going through this tutorial again from the beginning using one of these other sites. You’ll see that each website’s structure is different and that you’ll need to rebuild the code in a slightly different way to fetch the data you want. Tackling this challenge is a great way to practice the concepts that you just learned. While it might make you sweat every so often, your coding skills will be stronger in the end!
During your second attempt, you can also explore additional features of Beautiful Soup. Use the documentation as your guidebook and inspiration. Extra practice will help you become more proficient at web scraping with Python, Requests, and Beautiful Soup.
To wrap up your journey, you could then give your code a final makeover and create a command-line interface (CLI) app that scrapes one of the job boards and filters the results by a keyword that you can input on each execution. Your CLI tool could allow you to search for specific types of jobs, or jobs in particular locations.
If you’re interested in learning how to adapt your script as a command-line interface, then check out the Build Command-Line Interfaces With Python’s argparse tutorial.
Conclusion
The Requests library provides a user-friendly way to scrape static HTML from the internet with Python. You can then parse the HTML with another package called Beautiful Soup. You’ll find that Beautiful Soup will cater to most of your parsing needs, including navigation and advanced searching. Both packages will be trusted and helpful companions on your web scraping adventures.
In this tutorial, you’ve learned how to:
- Step through a web scraping pipeline from start to finish
- Inspect the HTML structure of your target site with your browser’s developer tools
- Decipher the data encoded in URLs
- Download the page’s HTML content using Python’s Requests library
- Parse the downloaded HTML with Beautiful Soup to extract relevant information
- Build a script that fetches job offers from the web and displays relevant information in your console
With this broad pipeline in mind and two powerful libraries in your toolkit, you can go out and see what other websites you can scrape. Have fun, and always remember to be respectful and use your programming skills responsibly. Happy scraping!
Take the Quiz: Test your knowledge with our interactive “Beautiful Soup: Build a Web Scraper With Python” quiz. You’ll receive a score upon completion to help you track your learning progress:
Interactive Quiz
Beautiful Soup: Build a Web Scraper With Python
In this quiz, you'll test your understanding of web scraping using Python. By working through this quiz, you'll revisit how to inspect the HTML structure of a target site, decipher data encoded in URLs, and use Requests and Beautiful Soup for scraping and parsing data from the Web.
Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Web Scraping With Beautiful Soup and Python
Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.
Martin likes automation, goofy jokes, and snakes, all of which fit into the Python community. He enjoys learning and exploring and is up for talking about it, too. He writes and records content for Real Python and CodingNomads.
Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:
Master Real-World Python Skills With Unlimited Access to Real Python
Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:
Master Real-World Python Skills
With Unlimited Access to Real Python
Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:
[fluentform id="8"]