Master Thesis MIIS - 02-07-24.docx

rw-book-cover

Metadata

Author: readwise.io
Full Title: Master Thesis MIIS - 02-07-24.docx
URL: https://readwise.io/reader/document_raw_content/192737311

Highlights

authors like Thamarai & Malarvizhi have used attributes which refer to quality of life for their horse price prediction, such as traveling and school facilities from home or shopping malls close to the house. (View Highlight)
The aim is to create a system that can predict housing prices considering not only traditional features such as location and size, but also aspects related to sustainability and quality of life in different neighborhoods of the city. This will be accomplished using housing listing data and evaluating relevant indicators for the SDGs, such as access to education, water quality and energetic efficiency. (View Highlight)
This approach is expected to enable homebuyers and sellers to make more informed decisions, promoting equity and sustainability in the Barcelona real estate market. Furthermore, it seeks to contribute to the development of more effective urban policies and the improvement of quality of life in the city. (View Highlight)
Within this sector, house pricing stands out as a central concern, with various factors significantly influencing property valuations. Traditionally, parameters such as property size, location, type, and year of construction have been pivotal in determining house prices. However, in the contemporary landscape, there is a growing recognition of the need to integrate additional factors, particularly those related to sustainability and quality of life, into housing valuations. (View Highlight)
The main objective of this project is to generate a Machine Learning recommendation system that adjusts housing prices based on certain parameters. These parameters will not only include traditional ones such as the size of the property, whether it has land or not, whether it is a house or an apartment, its location in the city, the size, the number of rooms, etc., but they will also encompass parameters related to SDGs 3 (availability of hospitals and pharmacies) and 4 (quality education, accessible and equitable). For both SDGs, data from OpenData BCN will be used. This is an open-data service created by Ajuntament de Barcelona, which includes not only datasets from several areas, but also visualizations, statistics and applications. (View Highlight)
For SDG 3, as previously mentioned, the data will be used to create features which will represent both the total number of hospitals in a 2 km radius, and the amount of pharmacies. For SDG 4, the parameters to be used will be the number of nursery schools, primary schools, secondary schools, vocational education centers, and universities within a radius of 2 kilometers. This distance has been selected as it is feasible to cover it in about 10-15 minutes for any person in full health. It is also emphasized that they should be public schools, as the SDG focuses on ensuring access to education. (View Highlight)
As for the current state of the art in the context of the study, there are several related studies that have been carried out during the previous years. Using “Idealista ” as the main data source and XGBoost as the used model, has shown a good accuracy to predict the price of houses in Barcelona, by just using the traditional parameters (Miravé Carreño, 2023). Models like GARandom Forest, deep-Random Forest (DRF) and lightGBM have also shown good performance in the same task in Singapore (Li, Z., & Li, Z., 2023) (View Highlight)
Using a Stacking-Sorted-Weighted-Ensemble (SSWE) has also shown quite good results in terms of predicting the price of houses in Guandong (Li, Li, Xie, & Zhang, 2022). Some other studies, however, have also tried to implement some other features, similar to the ones that want to be considered in this study, such as travelling and school facilities from home, and shopping malls close to each house (Thamarai & Malarvizhi, 2020), with remarkable results. (View Highlight)
To predict the adjusted price of each house while considering SDG-related attributes, it is essential to obtain data for each relevant attribute. In this study, health and education are the two key concepts included in the algorithm. (View Highlight)
each house will have a count of the items of each type within a 2-kilometer radius. (View Highlight)
The source data for creating these attributes comes from the OpenData BCN portal. Specifically, the education dataset is named “Educacio ensenyament reglat” and the health dataset is named “Llista equipaments sanitat” To validate that the context of these datasets is appropriate for this study, it is noted in their descriptions that they pertain to SDGs 4 and 3, respectively. The “Geopandas” library, widely used in Python for handling geospatial data, is employed to create these features. (View Highlight)
In order to obtain the attributes from the education dataset, the file is preprocessed by first filtering the “secondary_filters_name” column, which specifies the type of entity for each record, to retain only kindergartens, primary and secondary schools, vocational training centers, and universities. Similarly, a similar process is applied to the health dataset, keeping only “Pharmacy,” “CAPs,” “CUAPs,” and “Hospitals and Clinics.” Additionally, since the study focuses only on the AMB (Área Metropolitana de Barcelona), the “addresses_town” column is filtered to retain only locations in the municipality of Barcelona. Finally, all formats are validated to ensure correctness. (View Highlight)
Once the two SDG datasets are prepared, the features mentioned in the previous section are generated for each property. It is important to note that machine learning models are primarily designed to handle numerical data. To manage these categorical features, two main techniques are commonly used: Label Encoding and One-Hot Encoding (View Highlight)
Label Encoding assigns a unique numerical value to each possible value of the categorical feature, whereas One-Hot Encoding creates n binary columns, where n is the number of unique values in the original feature. F (View Highlight)
Label Encoding is particularly useful when the values of the feature have an ordinal relationship (e.g., poor, fair, good, excellent), while One-Hot Encoding is used when the values do not follow any logical order. (View Highlight)
In this study, the categorical features in the dataset are ‘propertyType’, ‘district’, ‘neighborhood’, and ‘status’. The ‘propertyType’ feature represents the type of property, with possible values being “countryHouse”, “duplex”, “flat”, “penthouse”, and “studio”. While it is theoretically possible to rank these property types from best to worst, such an assessment would be subjective and observer-dependent. Therefore, One-Hot Encoding is used to process this feature. Similarly, both ‘district’ and ‘neighborhood’ do not have any inherent numerical ranking, so the same technique is applied. (View Highlight)
The ‘status’ of the property is less straightforward, as it only includes new or renovated properties. If the categories were new versus second-hand, it would be clear that one option is generally preferred over the other. However, the difference between new and renovated properties is minimal. Therefore, to maintain consistency with the other categorical features, One-Hot Encoding is also used for the ‘status’ feature. This approach ensures standardization in the data processing code. (View Highlight)
To evaluate the performance of each model on the dataset, fine-tuning will be performed for each model using GridSearchCV. (View Highlight)
As seen in the previous sections, the new SDG-related attributes do not significantly impact house price prediction. However, Random Forest regressor is a better predictor to create a system that adjusts the original price based on the weights assigned to each attribute by the best model. This system increases the value of houses with favorable SDG indicators and decreases it for those with less favorable indicators. (View Highlight)
the tuned RandomForestRegressor is used to extract the weights assigned to each SDG attribute. The weights are then normalized so that the sum of all the weights for each house does not exceed 0.1. While any other value could be used, 0.1 (10%) was selected for convenience after testing different values and evaluating the results. (View Highlight)
After fine tuning the models, in Figure 5, are shown the results for the MAE (Mean Absolute Error), MSE (Mean Squared Error), RMSE (Root Mean Squared Error), R2 (R-Square) and MAPE (Mean Absolute Percentage Error). (View Highlight)
As can be seen from the results, the models that were theorized to perform best with this dataset indeed show superior performance. Specifically, focusing on the MAPE metric, (View Highlight)
a significant difference is observed between XGBoost, Random Forest, and Gradient Boosting compared to the other models. (View Highlight)
For informative purposes, it is important to note that the values of the smallest bubbles represent approximately a 0.5% deviation, the medium-sized bubbles represent around a 5-10% deviation, and the largest bubbles represent around a 50% deviation. However, it should be highlighted that only 1.5% of the homes exhibit a deviation greater than 10% in the price prediction. (View Highlight)
Interestingly, for two of these models (XGBoost and Random Forest), the results are slightly better when considering the SDG attributes than when not considering them. This also occurs with SVR, Linear Regression, Ridge Regression, and Lasso Regression. On one hand, it makes sense that personal valuations by sellers and buyers might subconsciously factor in elements such as how well the neighborhood is connected by public transport, proximity to shops, presence of parks, etc. (View Highlight)
, when examining the feature importances of the models (both with and without SDG attributes), it is observed that 99% of the decision weight is determined by the size (around 71% depending on the model) and the price per square meter (28%). Given these values, only 1% is left, which is distributed among the classic characteristics of the property (number of bathrooms, presence of a garage, etc.) and those related to the SDG (proximity to schools, medical centers, etc.). Therefore, it seems irrelevant to introduce parameters related to the SDG in a price prediction mode (View Highlight)
To confirm this, a Student’s t-test is conducted to determine whether the inclusion of SDG attributes in the Machine Learning model is significant or not. The null hypothesis H0 states that there is no significant difference between the mean relative errors of the models with SDG attributes and the models without SDG attributes. (View Highlight)
t is pertinent to display, on a map of Barcelona using Microsoft Power BI reporting application, the price differences between the tuned Random Forest models with and without SDG attributes. The visualization shows the price delta, with positive values indicating cases where the prediction with SDG attributes is higher than the prediction without those attributes. Additionally, the color legend indicates the district to which each property belongs. (View Highlight)
Regarding information on housing listings, access to data is straightforward and simple. The website used for data collection in this project was the Idealista real estate portal, which offers an API with certain free but limited functions for developers. Other websites, such as “pisos.com,” which also has an API for data extraction, were also tested. However, these two websites posed a challenge for the project due to the limited availability of information on the energy consumption of the houses, which is especially understandable in the case of old houses with the worst possible energy rating. In these cases, it is logical to assume that sellers would prefer not to disclose this information to potential buyers. (View Highlight)
On the other hand, information related to the Sustainable Development Goals (SDGs) is equally accessible through the OpenData portal (in this case, the OpenData portal of the Barcelona City Council was used). This system allows for data extraction in both CSV and JSON formats, which greatly facilitates the work for developers when including these datasets in their systems or models. Moreover, the extracted data is already cleaned, and is updated once a week. However, there are also some challenges in obtaining the data. Specifically, this project initially aimed to distinguish between public and private schools, and similarly with hospitals, but it was impossible to obtain this information from the datasets. The only way to obtain this data is by visiting the website https://escoles.barcelona/es/, but this requires web scraping techniques that may violate the website’s policies. This site does not provide any means to extract data in a file or through an API. (View Highlight)
● SDG-related attributes integration: As seen in the “Model Performance and Selection” section, the attributes generated with the collected information are not sufficiently relevant to be considered by the Machine Learning model. The most appropriate way to integrate them into a price calculation model is to create a system that adjusts the prices based on the prediction made by the most suitable Machine Learning model. (View Highlight)
This way, more importance can be given to the SDG-related attributes which, although not considered important by the model, need to be highlighted to align the model more closely with the goals of the 2030 Agenda. The results of this model are promising, as it achieves the objective of slightly increasing the value of properties with better SDG indicators. (View Highlight)
: Observing the results of the adjusted prices, a pattern is evident in which houses on the outskirts of Barcelona experience a price increase, while those in the city center are adjusted downwards. (View Highlight)
This is an unexpected but interesting result, as it suggests that the houses in the outskirts of Barcelona may be better suited in terms of educational entities and medical centers. In order to check whether this segregation is due to just the location or if there is any other parameter affecting the results, some plots are made to find any patterns with the SDG-related variables and with the normal variables. (View Highlight)
After analyzing the results obtained in this project, it can be concluded that integrating Sustainable Development Goals (SDGs) into housing price calculation or prediction models is feasible. Although the dominant parameters such as size and price per square meter overshadow the influence of SDG-related attributes, a more diverse and extensive dataset could enhance this integration. (View Highlight)
An important interpretation of the results is that houses with better SDG indicators tend to have higher prices, which aligns with the capitalist socio-economic environment where desirable attributes command higher prices. Therefore, implementing a housing price calcula (View Highlight)
● Inclusion of More SDG-Related Attributes: Expanding the model to incorporate additional SDG-related attributes could provide a more comprehensive view of how sustainable and socially responsible factors impact housing prices. Attributes such as energy efficiency, water quality, access to green spaces, and air quality should be considered. There are plenty of them available in the OpenData BCN Data Repository. ● Expansion to More Cities: To enhance the generalizability of the model, it is crucial to apply the methodology to multiple cities. This would help in understanding how different urban settings and regional characteristics influence the integration of SDG-related attributes in housing price predictions. ● Distinguishing Between Public and Private Entities: Differentiating between public and private educational and healthcare institutions can provide more nuanced insights into their respective impacts on housing prices. This distinction could refine the model’s accuracy and relevance. ● Considering Entity Capacity Relative to Neighborhood Population: It is essential to factor in the capacity of educational and healthcare entities relative to the population of their surrounding neighborhoods. This would provide a more accurate measure of accessibility and service quality, enhancing the model’s predictive power. It is better to have two hospitals in a 2km radius with capacity for 2,000 people rather than having two small hospitals, but currently the model cannot take that into account. (View Highlight)
n addition to the areas for future work, it’s important to note that there are some limitations that need to be addressed to improve future models.
1. Dominance of Property Size and Price Per Square Meter: One of the main limitations identified in this study is the excessive importance of traditional parameters such as property size and price per square meter. These factors overshadow the influence of SDG-related attributes, making it challenging to fully integrate sustainability indicators into the housing valuation model.
2. Data Availability Constraints: The availability of comprehensive and detailed data is a significant constraint. While Idealista provides valuable information, the lack of certain critical data points, such as energy efficiency ratings, limits the model’s effectiveness in evaluating sustainability-related attributes.
3. Complexity of Model Integration: Incorporating SDG-related attributes into a pricing model that accurately reflects market dynamics is complex. The current approach suggests a promising direction, but further refinement and validation are necessary to ensure that the adjusted pricing model aligns with real-world market behaviors. (View Highlight)

Pelayo Arbués

Explorer

Recent Notes

A Balanced Approach to Seeking Help

Why You Should Dive into Hand-Labeling Yourself

Change Resistance as a Corporate Autoimmune Disease

Master Thesis MIIS - 02-07-24.docx

Metadata

Highlights

Graph View

Table of Contents

Backlinks

Now Reading

Bradley–Terry model - Wikipedia