1. Business Need
A healthcare organization needed to improve its process for accessing and retrieving data for epidemiological research. The existing process was manual – field selection, population filtering, and structuring the dataset – all performed via Word documents and email threads, resulting in inconsistencies, errors, and inefficiencies across research teams.
2. The Challenge
- Aggregating data from diverse sources (Data Lake, databases)
- Translating technical tables into human-readable research fields
- Coordinating between multiple roles: researcher, epidemiologist, and data engineer
- Maintaining consistency and version control across iterations
- Supporting external inputs (e.g., CSV files)
3. The Solution
Phase 1: ETL to Data Lake
Automated scheduled ETL using Talend from various data sources into Cloudera Hadoop. Data is standardized and stored in Parquet format for structured access.
Phase 2: Web-Based Research Management
A full-featured web application (React + Spring Boot) was developed.
- Support for calculated fields and external research documents
- Human-readable catalog of all Ministry of Health data sources
- Advanced field filtering with version history per iteration
Phase 3: Output Generation
The system generates both HTML previews and XLSX files for structured data delivery. Final outputs are sent to the data engineering team for extraction execution.
4. Results
- Reduced turnaround time from days to hours
- Unified, auditable extraction protocol
- Minimized manual errors
5. Technologies Used
- Frontend: React + TypeScript
- Backend: Spring Boot (REST API)
- Database: PostgreSQL
- Data Lake: Cloudera (HDFS)
- ETL: Talend

