Web Scraping Bot in Python

Image credit: www.pixabay.com

The web scraper will be used to extract informations about real estate listings from the Luxembourgish real estate website athome.lu.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

We start by extracting the number of available pages containing real estate listings. In general, each page consists of 20 listings. However, we do not choose to extract data from all available pages (due to running time limitation), but rather from the first 5 pages.

URL = 'https://www.athome.lu/en/srp/?tr=buy&sort=date_desc&q=faee1a4a&loc=L2-luxembourg'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

# Get number of pages on website
last_page = 5 # We choose to use only the first 5 pages

#last_page = int(soup.find('a',class_='page last').text) # --> total number of pages available

As every single listing is associated to a an individual URL, we create a list containing all the URLs in question.

urls = []  # list with collected urls (each url represents a listing)
i=0 # computation progress counter
for pagenumber in range(last_page):
    URL = 'https://www.athome.lu/en/srp/?tr=buy&sort=date_desc&q=faee1a4a&loc=L2-luxembourg&page=' + str(pagenumber)
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    results = soup.find_all('article', class_=['standard', 'silver', 'gold', 'platinum'])
    for result in results:
         urls.append('https://www.athome.lu' + result.find('link', itemprop='url')['href'])
    
    # Display computation progress
    i +=1
    print(str(100*i/last_page) + ' %')
    
    # Return number of collected urls
    if pagenumber == (last_page-1):
      print(str(len(urls)) + " URLs have been collected")
      
## 20.0 %
## 40.0 %
## 60.0 %
## 80.0 %
## 100.0 %
## 100 URLs have been collected

Then, we loop through the entire URL list while extracting all the available data for every listing.

Houses = pd.DataFrame()
i=0 # computation progress counter
url_non_ex = 0 # counter for non existent urls
for url in urls[:len(urls)]:
    try:
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')

        # Find all characteristics blocks for a house
        features = soup.find_all('li', class_='feature-bloc-content-specification-content')

        names=[]
        data=[]
        # For every feature of the house find name of feature ('names') and the corresponding value ('data')
        for feature in features:
            names.append(feature.find('div', class_='feature-bloc-content-specification-content-name').text)
            data.append(feature.find('div', class_='feature-bloc-content-specification-content-response').text)
            
        # Add type of house and location to the feature list
        names.extend(['genre', 'Lieu'])
        data.append(soup.find('h1', class_='KeyInfoBlockStyle__PdpTitle-sc-1o1h56e-2 hWEtva').text.split()[0])
        address = soup.find('div', class_='block-localisation-address').text
        lieu = address[address.find('-')+2:]
        data.append(lieu)

        House = dict(zip(names, data))

        Houses = Houses.append(House, ignore_index=True)

        i +=1
        print(str(round(100*i/len(urls),2)) + ' %')

    except AttributeError:
        url_non_ex +=1 # number of non-existent urls
## 1.0 %
## 2.0 %
## 3.0 %
## 4.0 %
## 5.0 %
## 6.0 %
## 7.0 %
## 8.0 %
## 9.0 %
## 10.0 %
## 11.0 %
## 12.0 %
## 13.0 %
## 14.0 %
## 15.0 %
## 16.0 %
## 17.0 %
## 18.0 %
## 19.0 %
## 20.0 %
## 21.0 %
## 22.0 %
## 23.0 %
## 24.0 %
## 25.0 %
## 26.0 %
## 27.0 %
## 28.0 %
## 29.0 %
## 30.0 %
## 31.0 %
## 32.0 %
## 33.0 %
## 34.0 %
## 35.0 %
## 36.0 %
## 37.0 %
## 38.0 %
## 39.0 %
## 40.0 %
## 41.0 %
## 42.0 %
## 43.0 %
## 44.0 %
## 45.0 %
## 46.0 %
## 47.0 %
## 48.0 %
## 49.0 %
## 50.0 %
## 51.0 %
## 52.0 %
## 53.0 %
## 54.0 %
## 55.0 %
## 56.0 %
## 57.0 %
## 58.0 %
## 59.0 %
## 60.0 %
## 61.0 %
## 62.0 %
## 63.0 %
## 64.0 %
## 65.0 %
## 66.0 %
## 67.0 %
## 68.0 %
## 69.0 %
## 70.0 %
## 71.0 %
## 72.0 %
## 73.0 %
## 74.0 %
## 75.0 %
## 76.0 %
## 77.0 %
## 78.0 %
## 79.0 %
## 80.0 %
## 81.0 %
## 82.0 %
## 83.0 %
## 84.0 %
## 85.0 %
## 86.0 %
## 87.0 %
## 88.0 %
## 89.0 %
## 90.0 %
## 91.0 %
## 92.0 %
## 93.0 %
## 94.0 %
## 95.0 %
## 96.0 %
## 97.0 %
## 98.0 %
## 99.0 %
## 100.0 %

As an example, we display the first 20 rows of the created Dataframe.

Balcony Basement Bathroom Closed parking space Energy class Laundry Lieu Livable surface Living room Number of bedrooms Open kitchen Pump heating Renovation year Sale price Thermal insulation class genre Attic Bathooms Fitted kitchen Garden Gas heating Land Open parking space Separate kitchen Terrace Lift Pets accepted Restroom Shower rooms Year of construction Acces for mobility-impared people Shower room Availability Property’s floor Indoor parking space(s) Number of rooms Renovated Monthly charges Fireplace Fuel heating Parquet Solar panels Convertible attic Electric heating Wine cellar Converted attic
4.87 m² Yes 1 2 A Yes Wiltz 94.21 m² Yes 2 Yes Yes 2018 449,435 € B Apartment / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /
/ Yes / 1 G / Biwer 120 m² Yes 4 / / 2018 984,000 € G House Yes 2 Yes Yes Yes 3 ares 2 Yes Yes / / / / / / / / / / / / / / / / / / / / /
/ Yes / 3 NC Yes Bissen 210 m² / 4 Yes / 2018 1,050,324 € NC House Yes 3 / / Yes 4.76 ares / / 36.62 m² / / / / / / / / / / / / / / / / / / / / /
/ / 1 1 B Yes Capellen 204 m² Yes 4 / / 2019 1,875,000 € B Detached / / Yes Yes Yes 5.09 ares 3 Yes Yes Yes Yes 1 2 2013 / / / / / / / / / / / / / / / /
/ / 1 2 H Yes Dudelange 150 m² / 4 / / 2015 1,035,000 € I House Yes / / / Yes 3.35 ares 2 / / / Yes / / 1954 / / / / / / / / / / / / / / / /
/ / 1 / A Yes 07 rue des Romains - Strassen 118.68 m² Yes 3 / / 2018 1,691,250 € A Apartment / / Yes Yes Yes / / Yes 70 m² Yes / / / 2021 Yes 1 / / / / / / / / / / / / / /
/ / / / A Yes 07 rue des Romains - Strassen 57.89 m² Yes 1 / / 2018 879,835 € A Apartment / / Yes / Yes / / Yes 44 m² Yes / 1 / 2021 Yes 1 / / / / / / / / / / / / / /
/ / / / A Yes 07 rue des Romains - Strassen 101.31 m² Yes 3 / / / 1,341,010 € A Apartment / / Yes / Yes / / Yes 20.4 m² Yes / 1 / 2021 Yes 1 / / / / / / / / / / / / / /
Yes / / 1 B Yes Capellen 203 m² / 4 / / 2018 1,875,000 € B House / 3 / 350 m² / 5.09 ares 1 / Yes / / / / / / / / / / / / / / / / / / / / /
/ / / 1 NC / Luxembourg 14 m² / / / / / 76,500 € NC Indoor / / / / / / / / / / / / / / / / To be agreed / / / / / / / / / / / / /
/ / / 3 A / Wiltz 264 m² Yes 4 Yes Yes / 799,900 € B House / 3 / Yes / 3.23 ares / / Yes / / Yes / / / / / / / / / / / / / / / / / /
/ / / 3 A / Wiltz 263.83 m² Yes 4 Yes Yes 2018 799,900 € B House / 3 / Yes / 3.23 ares / / Yes / / / / / / / / / / / / / / / / / / / / /
/ / 1 / G / Ehlerange 70 m² Yes 2 / / 2018 490,000 € F Apartment / / / Yes / / / Yes / / / / / 1967 / / À convenir 2 / / / / / / / / / / / /
/ / / 3 A / Wiltz 263.83 m² Yes 4 / Yes / 795,350 € B House / 3 / Yes / 3.11 ares / / Yes / / Yes / / / / / / / / / / / / / / / / / /
Yes Yes 1 / E Yes Rumelange 101 m² Yes 2 Yes / 2018 575,000 € E Duplex / / Yes / Yes / 1 / / Yes / 1 / 1995 / / To be agreed / / / / / / / / / / / / /
/ / / 3 A / Wiltz 263 m² Yes 4 / Yes 2015 782,900 € B House / 3 / Yes / 2.74 ares / / Yes / / / / / / / / / / / / / / / / / / / / /
/ / / 3 A / Wiltz 263 m² Yes 4 Yes Yes 2018 766,900 € B House / 3 / Yes / 2.56 ares Yes / Yes / / Yes / / / / / / / / / / / / / / / / / /
/ / / 1 NC / Luxembourg / / / / / 2018 885,000 € NC Detached / / / Yes / 3.93 ares 2 / / / / / / 1947 / / immédiate 0 / / / / / / / / / / / /
Yes / / 1 NC / Bettembourg 90 m² Yes 2 / / 2018 1 € NC Apartment / / Yes / / / / / / Yes / / / Sur plan Yes / To be agreed 2 / / / / / / / / / / / /
10 m² Yes 1 1 E Yes Steinsel 120 m² / 2 / / 2018 895,000 € NC Duplex / / Yes Yes Yes / 2 / Yes / / Yes / / / / To be agreed / 1 8 Yes / / / / / / / / /

The collected data could then be used to carry out statistical analysis on the current real estate market in Luxembourg.

Fred Philippy
Fred Philippy
Master’s degree student in Statistics

I am a 2nd-year Master’s degree student in Statistics with a particular interest in Machine Learning, Computational Statistics, Data Analysis and High-Dimensional Statistics.

Related