The Data Observatory is a modular platform designed to support empirical research on immigration discourse. It combines automated web scraping, LLM-assisted preprocessing, and human-in-the-loop validation to collect and analyze texts from parliamentary debates, policy documents, news media, and online forums.
Goals
- Create a reusable infrastructure for collecting immigration-related data from diverse sources (e.g., parliamentary records, news sites, public forums).
- Support annotation and analysis tasks such as stance, framing, sentiment, and topic labeling.
- Enable researchers and community partners to explore how migrants and immigration policies are discussed over time.
- Embed responsible AI principles—transparency, reproducibility, and human oversight—into the data pipeline.
Methods and Architecture
- Selenium- and API-based scrapers orchestrated through reproducible Python pipelines.
- Storage of raw and processed data in structured formats (e.g., JSON, CSV, vector stores) with clear metadata.
- Use of LLMs for assisted tasks such as document filtering, query expansion, and candidate annotations.
- Integration with HPC environments for large-scale processing (e.g., embeddings, topic models).
- Human-in-the-loop workflows to validate and correct automated outputs, with attention to ethical and legal constraints.
Status and Outputs
- Status: Under active development and use in multiple projects (e.g., parliamentary discourse on AI and immigration, Reddit-based immigration studies).
- Outputs:
- A shared data infrastructure for the Bridging Divides program.
- Scripts and documentation for reproducible scraping and analysis.
- Datasets and case studies on immigration debates and narratives in Canada and beyond.