The Data Observatory is a modular platform designed to support empirical research on immigration discourse. It combines automated web scraping, LLM-assisted preprocessing, and human-in-the-loop validation to collect and analyze texts from parliamentary debates, policy documents, news media, and online forums.

Goals

Create a reusable infrastructure for collecting immigration-related data from diverse sources (e.g., parliamentary records, news sites, public forums).
Support annotation and analysis tasks such as stance, framing, sentiment, and topic labeling.
Enable researchers and community partners to explore how migrants and immigration policies are discussed over time.
Embed responsible AI principles—transparency, reproducibility, and human oversight—into the data pipeline.

Methods and Architecture

Selenium- and API-based scrapers orchestrated through reproducible Python pipelines.
Storage of raw and processed data in structured formats (e.g., JSON, CSV, vector stores) with clear metadata.
Use of LLMs for assisted tasks such as document filtering, query expansion, and candidate annotations.
Integration with HPC environments for large-scale processing (e.g., embeddings, topic models).
Human-in-the-loop workflows to validate and correct automated outputs, with attention to ethical and legal constraints.

Status and Outputs

Status: Under active development and use in multiple projects (e.g., parliamentary discourse on AI and immigration, Reddit-based immigration studies).
Outputs:
- A shared data infrastructure for the Bridging Divides program.
- Scripts and documentation for reproducible scraping and analysis.
- Datasets and case studies on immigration debates and narratives in Canada and beyond.