
Computer science undergraduate specializing in data engineering and artificial intelligence. Developed high-throughput data processing systems utilizing Python and Apache Spark, with expertise in batch and real-time streaming architectures. Built retrieval-augmented generation pipelines, constructed knowledge graphs with Neo4j, and implemented hybrid search solutions using Elasticsearch. Proficient in automated data acquisition, HTML parsing, and containerizing environments with Docker, delivering scalable and reliable software solutions.
Assisted in developing high-throughput data ingestion pipeline for collecting tick-by-tick stock data from financial APIs and web sockets.
Supported engineering of speed layer using Spark Streaming for processing and aggregating raw market events with millisecond latency.
Contributed to implementing real-time windowing functions for transforming unstructured tick data into OHLCV candlesticks across multiple timeframes (1m, 5m, 15m).
Github: https://github.com/baoduy2048/bigdata
Assisted in developing web crawling system using BeautifulSoup4 for collecting legal documents from government HTML portals.
Supported the design of an HTML parsing pipeline to convert unstructured web data into structured formats while maintaining hierarchical relationships.
Contributed to the creation of a knowledge graph in Neo4j to illustrate complex legal hierarchies and cross-references between various law sets.
Github: https://github.com/baoduy2048/RAG-KG
Programming languages: Python, C, SQL
Data processing: Apache Spark, Pandas, NumPy
AI and RAG: LangChain, vector search, LLM integration
Web scraping and extraction: BeautifulSoup, HTML parsing