This repository contains R scripts for scraping and processing housing data from the DuProprio website. The project includes scripts for extracting property URLs, scraping detailed property information, and performing feature engineering on the gathered data.
- URL_SCRAPPER.R: This script collects URLs for individual property listings from DuProprio. It identifies relevant links from the main search pages to prepare for detailed scraping.
- HOUSING_PAGE_SCRAPPER.R: This script accesses each property’s page using the URLs from
URL_SCRAPPER.R
, then extracts key information such as price, location, property features, and descriptions. - FEATURE_ENGINEERING.R: After data collection, this script processes and refines the data for analysis, including transformations, creating derived features, and handling missing values.
To run these scripts, you will need:
- R 4.0 or newer
- R packages:
rvest
(for web scraping)tidyverse
(for data manipulation)httr
(for handling HTTP requests)jsonlite
(for JSON parsing, if applicable)
To install the required packages, run the following commands in your R environment:
install.packages("rvest")
install.packages("tidyverse")
install.packages("httr")
install.packages("jsonlite")
-
URL Extraction: Start by running
URL_SCRAPPER.R
to collect URLs of individual property listings. This script will output a list of URLs for further scraping. -
Property Data Scraping: Use
HOUSING_PAGE_SCRAPPER.R
to extract detailed information from each property page, based on the URLs gathered in the first step. -
Feature Engineering: After gathering the data, run
FEATURE_ENGINEERING.R
to clean and preprocess the data for analysis. This script will handle data transformations and create additional features as needed.
- Compliance: Ensure compliance with DuProprio's terms of service when using this scraper, as some websites prohibit automated data extraction.
- Data Storage: Modify scripts as needed to save output data in the desired format (e.g., CSV, JSON, database).
This repository is for educational and personal use.