Building a web3 job scraper in python.

Tyan Brooker Thorpe
3 min readMay 6, 2022
Photo by Shubham Dhage on Unsplash

Web3 is one of the most popular new words, and I guess industries to get into as of late. Having worked in it previously as part of a family office, I find it rewarding to be involved.

Unfortunately, I am currently out of a job as being part of the family office that I was at did not work out. So being a self-taught coder and hacker, I thought it would be a good idea to try and build a web3 job scraper as this will allow me to access the jobs in a much quicker way and find ones that best fit my industry and expertise.

Here is a link to the Github repo.

I want to give full credit to this project’s original creator, PythonEatsSquirrel. Who made a job scraper for Indeed. I have adapted this code to make it work for me by allowing me to scrape jobs off of a custom website, in this instance, the website is web3 jobs. However, if you want the link to the original video, here you go.

The main things to keep in mind when going through this process are Extract, Transform, and Learn.

The external libraries that this project uses are:
-BeautifulSoup
-Pandas
-Requests

We want to start by creating a function called extract, which takes an argument called a page.

Okay, to start with, you want to collect the URL of the page you are scraping but convert it into an f string so you can iterate through the different pages that you will need to scrape.
You also want to find out your user agent to ensure that this runs correctly on your machine. This will be different for everyone, but typing in my user agent into google should work for you. You then want to take a request variable that takes headers and the URL. Building on this, we will be using one of the libraries now calling beautiful soup and calling r.content and calling for the HTML parser. You want to ensure that this variable returns soup; you can find the code below for this function.

def extract(page):headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.88 Safari/537.36"}url =  f"https://web3.career/marketing+remote-jobs?page={page}"r = requests.get(url,headers)soup = BeautifulSoup(r.content, "html.parser")return soup

This is where we move on to the Transform aspect of the codebase.
Here we will create a new function called transform and pass it the soup from the extract function. This is where things get tricky and will be different for your code. You will want to go into the debug tool of the browser that you are running and start reading the HTML. To get to the aspect of the code that you need your job scaper to read. It will most likely be a div which holds all the information of the job postings that you are looking for. You then need to scan each aspect of the job role you are looking for e.g. job name and salary and store that in a variable. The final stage within this variable is to store all of this within a dictionary for each iteration and append that to the job list you will have.

def transform(soup):divs = soup.find_all('div', class_="d-flex align-middle")for item in divs:title = item.find('a').text.strip()company = item.find('div', class_="mt-auto d-block d-md-flex").text.strip()try:salary = item.find('p', class_="text-salary").text.strip()except:salary = ''job = {"title":title,"company":company,"salary":salary}joblist.append(job)return

The Learn aspect of the project is relatively straightforward. You only need to create the job list where all the jobs will be stored and format it, so it all fits within a pandas data frame.

--

--