Information extraction on real estate descriptions

Introduction

Online real estate listings often contain unstructured textual descriptions that may have missing feature values. These missing values can negatively impact the performance of machine learning models used for pricing predictions or other tasks. In this project, we demonstrate the application of Information Extraction (IE) to extract relevant feature values from real estate descriptions. Specifically, we investigate the use of Question Answering (Q&A) models to retrieve missing feature values by asking a set of questions. We compare the performance of two approaches: one that uses a French language-specific model for IE, and another that translates the descriptions into English before performing IE.

Methods

We collected a dataset of real estate listings in French and their corresponding feature values, which included information such as the number of bedrooms, bathrooms, and parking spaces. We split the dataset into training and testing sets and compared the performance of the two approaches.

Approach 1: French Language-Specific Model

We used a pre-trained French language model from HuggingFace to perform IE on the real estate descriptions. We formulated a set of questions based on the expected feature values and used the Q&A model to retrieve the missing values.

Approach 2: Translation + English Model

We translated the real estate descriptions into English using a pre-trained translation model from HuggingFace before performing IE using a pre-trained English language model. We formulated a similar set of questions to retrieve the missing values.

Results

We evaluated the performance of both approaches using several metrics, including accuracy, precision, recall, and F1-score. Our results showed that the French language-specific model achieved higher accuracy and F1-score compared to the translation + English model. However, we also observed that the effectiveness of the Q&A model is highly dependent on the formulation of the questions, and even small changes to the questions can significantly impact performance.

Conclusion

Our project demonstrates the potential of Information Extraction and Question Answering models to retrieve missing feature values from unstructured textual data. We compared two approaches for IE on real estate descriptions and provided insights on how the formulation of questions can impact the effectiveness of Q&A models. Our results suggest that using a language-specific model may be more effective than translation + an English model, but further research is needed to confirm this finding.

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:

@online{azizi2022,
  author = {Azizi, Ilia},
  title = {Information Extraction on Real Estate Descriptions},
  date = {2022-06-06},
  url = {https://iliaazizi.com/projects/real_estate_ie/},
  langid = {en}
}

For attribution, please cite this work as:

Azizi, Ilia. 2022. “Information Extraction on Real Estate Descriptions.” June 6, 2022. https://iliaazizi.com/projects/real_estate_ie/.