Web Scraping using Selenium and Python

Web Scraping using Selenium and Python

Web Scraping using Selenium and Python

What is selenium:-

Selenium is a web-based automation tool that is open-source. Selenium is mostly used in the business for testing, however it may also be used for web scraping. We'll use Chrome, but you may use any browser; it'll work almost as well.

Quickstart:- If you wish to utilise python for web scraping, you'll need to download chrome, chrome web driver, and the selenium package. For working, you'll also need to obtain a Python IDE. For this project, I'm using Pycharm, but you may use whichever IDE you like.

Run the following line in your command prompt to download the Selenium package:-


    pip install selenium

Chrome will launch in full screen mode (like regular Chrome, which is controlled by your Python code). A notice explaining that the browser is managed by automated software should appear.

You may run Chrome on a server in headless mode (without a graphical user interface). Consider the following scenario:

    
        from selenium import webdriver
        from selenium.webdriver.chrome.options import Options

        options = Options()
        options.headless = True
        options.add_argument("--window-size=1920,1200")

        driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
        driver.get("https://www.nintendo.com/")
        print(driver.page_source)
        driver.quit()

    

The driver.page_source will return the HTML code for the entire page. Here are two more WebDriver characteristics to consider:

driver.title gets the page's title.

  • driver.current_url obtains the most recent URL (this can be useful when there are redirections on the website and you need the final URL).
  • Locating Elements:-

    One of Selenium's major use cases is locating data on a website, either for a test suite (ensuring that a certain element is present/absent on the page) or to extract data and save it for later analysis (web scraping).

    The Selenium API provides a number of ways for selecting items on a page. You can make use of:

    • Tag name
    • Class name
    • IDs
    • XPath
    • CSS selectors
    Find Elements

    To access all of these elements on our web page, right-click on it and pick the inspect option. Hover your cursor over the element you wish to choose and click it after selecting examine. Then, in your console, you'll see an html code for that code. Select Copy from the context menu when you right-click on that html line. You'll see a menu of these pieces, and you may copy and paste any of them into your code.

    find element:-

    There are a variety of methods for locating an element in selenium. Let's suppose we're looking for the h1 tag in this HTML:

    
        <html>
            <head>
                ... some stuff
            <head>
            <body>
                <h1 class="someclass" id="greatID">Super title
            </body>
        </html>
    
    
        
        h1 = driver.find_element_by_name('h1')
        h1 = driver.find_element_by_class_name('someclass')
        h1 = driver.find_element_by_xpath('//h1')
        h1 = driver.find_element_by_id('greatID')
    
        
    

    All of these methods have a find_elements (plural) method that returns a list of elements.

    Use the following code to obtain all anchors on a page:

        
        all_links = driver.find_elements_by_tag_name('a')
        
    

    When an ID or a basic class isn't enough to access an element, you'll need to use an XPath expression. You could also have a lot of components with the same class (the ID is supposed to be unique).

    XPath is my preferred method of finding items on a web page. It's a strong method for extracting any element on a page based on its absolute or relative location on the DOM.

    Taking a screenshot:-

    We could easily take a screenshot using:

        
        driver.save_screenshot('screenshot.png')
        
    

    When taking a screenshot using Selenium, keep in mind that a lot of things might go wrong. To begin, double-check that the window size is set appropriately. After then, double-check that every asynchronous HTTP call made by the frontend JavaScript code has completed and that the page has been fully displayed.

    Conclusion:-

    Selenium is frequently used to extract data from websites that include a large amount of JavaScript. Running a large number of Selenium/Headless Chrome instances at scale is difficult.

    Selenium is also a fantastic tool for automating nearly any online task.

    It's probably* if you conduct repeated chores like filling out forms or verifying information behind a login form on a website that doesn't have an API.

    I Hope you enjoy this blog

    Thank You

    Author
    Himanshu Pant

    (Quality Analyst)

    Want a Team that Delivers Result ? Connect now with us.

    Our Offices

    INDIA

    F-429, Phase 8B, Industrial Area, SAS Nagar, Punjab 160059

    +91 82198-18163

    USA

    13506 Summerport Village Pky Suite 355 Windermere, FL 34786

    +1 (321) 900-0079

    CANADA

    15 Meltwater Cres, Brampton L6P3V8

    +1 (647) 892-6147