Hello!! I am Advait Khawase
a Data Engineer.

dedicated to building robust, data-driven solutions.

My Skills

Databricks

PySpark

Spark

Hadoop

Kafka

Airflow

Azure Data Factory

Azure Cloud

AWS Cloud

Git

Shell Scripting

NiFi

LangGraph

SQL

Experience

Data EngineerDec 2023 - Present

Jio Platform Limited (JPL)

Delivered 300+ workflows co-owning 6,000+ Delta Bronze tables across Oracle, MSSQL, and MongoDB sources via a reusable ingestion framework — a 19-column CSV config and single parameterized notebook eliminated per-workflow boilerplate. (Databricks Asset Bundles, Auto Loader, ADLS Gen2)
Built and deployed 100+ serverless streaming pipelines ingesting from 20 telco BSS source systems (Ericsson, CleverTap, NSE, NSD, EPC XML) into Delta Bronze with exactly-once semantics. (Kafka, Airflow, DLT, Unity Catalog, ADLS Gen2)
Migrated 50+ Partner Center workflows (1M–10M+ daily records) from legacy Informatica to the Databricks gold layer, removing per-workflow license dependency and improving runtime by ~15%.
Engineered incremental file-detection across a NAS → ADLS document pipeline (DocFlow), eliminating redundant reprocessing of 10,000+ daily files (PDFs, DOCXs, JPGs). (Airflow, ADF, Databricks)
Drove the cloud-migration architecture POC — recommending NiFi for 1TB+/unit historical ingestion and ADF for full & incremental loads, adopted across 400+ pipelines.
Built a Trade Compliance aggregation pipeline powering downstream BI reports — consolidated 50+ source tables (10K–1M+ records each) into a unified Databricks aggregate, cutting report-refresh runtime to 30 min via caching, repartitioning, and broadcast joins.
Maintained 99% pipeline uptime over 12+ months by monitoring 200+ Airflow DAGs, resolving Spark/YARN bottlenecks, and implementing failure alerting.

About Me

Data Engineer at Jio Platforms co-owning 300+ workflows across 6,000+ Delta tables in trade compliance, partner operations, and document processing. I design and deliver end-to-end data solutions — from raw ingestion to gold-layer analytics — on Databricks, Hive/HDFS, and Azure.

Specialized in Medallion architecture with schema evolution, data quality enforcement, and exactly-once streaming at scale. Core stack: PySpark, Databricks, ADF, Airflow, Kafka, NiFi. I've migrated 50+ workflows off legacy Informatica onto a single lakehouse and led the NiFi/ADF cloud-migration POC adopted across 400+ pipelines.

Beyond enterprise pipelines, I build AI-augmented tools — including lazyfruit.com, a full-stack job platform powered by Gemini and LangGraph that received a pre-seed funding offer (operating bootstrapped by choice). I hold a B.Tech in Computer Science (CGPA 8.7) and am an AWS Certified Cloud Practitioner.

Projects

AI-Powered Job Aggregation Platform

PythonPlaywrightGemini APILangGraphFastAPIPostgreSQLAWS

Full-stack job platform with a rule-free, stateful scraper that keeps 3,000+ live listings fresh across hundreds of ATS layouts (Workday, Greenhouse, Lever, iCIMS). Gemini extracts structured data from cleaned HTML while plain Python owns all control flow — a two-stage pipeline (sequential listing discovery → parallel detail extraction across 5 async workers on one playwright-stealth browser) at ~3.6s/job, with CDC-style lifecycle tracking, schema normalization, and dedup on incremental loads into PostgreSQL. Cheap-by-default model escalation and a FastAPI + WebSocket monitoring UI; orchestration is being rebuilt on LangGraph. Received a pre-seed funding offer (bootstrapped by choice).

Read the engineering write-up

PySpark DataFrame Lineage Tool

PythonPython ASTPySparkReactFlowDagreReactVite

Static analysis tool parsing PySpark via Python AST — no execution required — extracting 10+ operation types into a node-edge graph. React frontend renders interactive dataframe dependency graphs with column-level lineage and Dagre auto-layout.

Real-Time Data Streaming Pipeline

Apache AirflowKafkaSparkCassandraDockerPostgreSQLPython

Designed and implemented a real-time data pipeline for new user tracking utilizing Apache Airflow, Kafka, Spark, and Cassandra, with all components containerized via Docker. The pipeline automated data ingestion through a scheduled Airflow DAG that fetched user data from an API, published it to a Kafka topic, and processed it with Spark to stream into Cassandra — all within a latency of under 10 seconds.

Stock API Batch Processing

PythonAWS S3AWS GlueAmazon RedshiftREST API

Developed a data pipeline that collects real-time stock market data through an external API and stores it in Amazon S3. An AWS Glue crawler automatically detects and catalogs the latest data in S3. The data is then processed using AWS Glue jobs and loaded into Amazon Redshift, enabling efficient querying and advanced analysis for reporting and business intelligence use cases.

Contact Me

Want to start a Conversation?
Let's get Connected.

Open to new opportunities, collaborations, and interesting conversations. I'll get back to you as soon as possible.

Send Email Resume

linkedin.com/in/advait-khawase

GitHub

github.com/advaitkhawase15

Hello!! I am Advait Khawase a Data Engineer.

Want to start a Conversation?Let's get Connected.

Hello!! I am Advait Khawase
a Data Engineer.

Want to start a Conversation?
Let's get Connected.