In this project I will try to determine the competition level in the staffing and recruiting sector based on data that I will scrape from the Linkedin platform.
Firstly, I would like to say that the process is for educational reasons and with all respect to the LinkedIn platform and the companies.
So, in this project we will work with Beautiful soup and Selenium libraries. In addition, it’s required to install a web driver, in my case I installed the final version of Chrome web driver. Notice that it's essential to find the driver that works with your current browser version, if you don’t know your Browser version you can download the last driver version and update your Browser. The last thing that you have to take care before jumping to the scrape part is to save the driver to a path that you can find easily. I saved the driver to the same folder with my coding files which is a solution that is not required to remember or search for a path.
After that step we are ready to open an editor, I used Jupyter notebook, a platform that you can print every step and I think for web scraping and in general is an amazing programming platform. Let’s start by importing the required libraries.
Now let’s open a browser using the web driver, I will work on Chrome platform but you can use whatever platform you prefer, notice that it’s essential to keep the following window always open, this window will be the base of our next searching steps.
Above is the base window, for reaching the LinkedIn web page we need a specific URL below is the URL and the code that I used.
Running the code we are getting the following window where we should complete our personal information. In order to keep my personal information personal I created a txt file where I wrote in two lines my email and my password if you are not planning to share your code you can define two variables that will contain your personal information.
Sign in | Submit |
---|---|
And here is the code that I used to login:
So now we are in the initial LinkedIn page and we want to parse data based on a certain search. We could implement this step with different approaches, I think the easiest is to determine the link of the page that we are looking for, otherwise we have to use the Selenium library click in the search box determine our query after that we could also specify different filters in order to earn the information that we need. Definitely, using the direct link we avoid many lines of code.
So, the following code could be adjustable and everyone could be able to search for companies in specific sector’s. As I said, if you are interested in something else change the link and follow the next steps.
Notice that I used the Time Python module and the sleep command, don’t set aside this command and give some time to the browser to load the page in order to parse the page source code properly.
We are in the page that contains the information that we need to parse, in this step is necessary to know some html structure in order to inspect the elements that we need. For parsing elements I used the Beautiful soup library, we are in a static page now and we can detect the elements doing a left mouse click.
After inspecting the page and determining the elements that we need, it’s time to develop a scraping function. This is the part that scared me when I started with the web scraping but the easiest way to reach what you need from a page is to find a big element. For instance a class which contains a list of objects that you need and then start to break down the list into smaller element such as, tittle, description, number of followers, etc. Let’s jump to the point and see the code:
We are almost done with the scraping process, I say almost because we also have to iterate through all the available searching pages. In my case the pages are 10 and in order to scrape every page I figured out that I could accomplish this by changing the URL by adding a number at the end of it as the following code.
As you can see I found the list that contains the available pages of my search, after that I found the total number of the elements of the list. Definitely, I’m not saying that is the optimum solution, maybe you can find out something more efficient to parse data from all the pages; furthermore you have also to consider that web pages are having different structure and occasionally it’s essential to be creative. Now we have the parts that we need to iterate through all the pages, so let’s finish the scraping by adding everything to a list.
Lastly, we may also convert the list to a csv file, named the columns as we prefer and we are ready for starting to analyse our data.