Hello I'm

Sai santosh Eedupuganti

Data Engineer/Data Analyst

About Me

Hi, I'm Sai, a seasoned Business Intelligence Engineer with a proven track record at Amazon, Verizon, Capital One, and SUNY Albany. At Amazon, I spearheaded the development of Analytical solutions, driving significant operational improvements and cost reductions. My expertise in SQL, Python, Pyspark, Hadoop, AWS, and Tableau has been instrumental in analyzing complex datasets and designing ETL pipelines. I am passionate about transforming data into actionable insights, and my career reflects a commitment to enhancing business processes through innovative data-driven strategies.You are looking at a Highly organized individual with proficiency in team collaboration and sound communication skills to engage with business and team members.

  • Python
  • SQL
  • PySpark
  • Apache Airflow
  • AWS Quicksight
  • Tableau
  • Snowflake
  • AWS S3
  • AWS EMR
  • Mongo DB
  • Data warehousing
  • Databricks
  • MySQL
  • GIT
  • Natural Languauge Processing
  • Dask
  • Pandas
  • Numpy
  • Scikit-Learn
  • NLTK
  • Matplotlib
  • Seaborn
  • Networkx
  • AWS EC2
  • AWS Sagemaker
Downlaod CV

What I do

Data Analysis

Extrating, Cleaning, Transforming and bringing insights from the data for better decision making using open source python libraries and data analysis tools.

Data Engineering

Building pipelines using modern ETL Techoiniques, desiging Schema's based on business use case, Migrating the data from one platform to other.

Data Visualization

Creating meaningful plots, Dashboards using Qlik sense, Tableau and open source libraries, Visualizing high dimensional data using Dimensionality reduction methods and Topological tools.

Technical Skills

Python
9/10
SQL
9/10
PySpark
8/10
Tableau/QuickSight/QilkSense
9/10
Redshift/Snowflake
8/10
Apache Airflow
9/10
AWS
8/10

Data Science Skills

  • Data Analysis
  • Buidling Pipelines
  • Data Visualization
  • Machine Learning

Recent Portfolio

  • All
  • Machine Learning
  • Data Analysis
  • Data Visualization

Social Network Analysis

Developed a model to predict link whether two users are going to be friend in future or not.

Created graph based features like shortest path, Jaccard, Cosine distance, Pagerank, Adar index, Katz Centrality and built random forest model, achieved a f-1 score of 93 and performed hyper-parameter tuning for best scores(Precision and recall).

  • Python
  • Numpy
  • Pandas
  • Scikit-learn
  • NetworkX
  • Random forest
  • XgBoost
View Code

Quora Question Pair Similarity

Developed a model to predict whether a pair of questions are duplicate or not, which is useful to instantly provide answer to questions that have already been answered.

Performed text feature engineering and implemented random forest, logistic regression, linear SVM model, XGBoost with Hyperparameter tuning and achieved a log loss of 0.313.

  • Python
  • Numpy
  • Pandas
  • Scikit-learn
  • Plotly
  • NLTK
  • BeautifulSoup
  • XgBoost
View Code

Taxi demand prediction in NYC

Cleaned and performed Exploratory data analysis and feature engineering on New York city taxi data using Python and Dask

Segmented the data by using K-means clustering and applied baseline models, Linear Regression, Random Forest Regressor, XGBoost Regressor to predict the number of taxi pickups in a given location at a particular time interval and computed a Mean Absolute per error of 11.57

  • Python
  • Numpy
  • Pandas
  • Scikit-learn
  • Matplotlib
  • Sqlite3
  • Dask
  • Folium
  • Gpxpy
  • XgBoost
View Code

Personalized Cancer Diagnosis

Goal of this project is to classify the given genetic variations/mutations based on evidence from text-based clinical literature. In this problem, we need to find the mutation-type given the gene, variation and some text data from published research.

Performed univariate analysis on text features and gene featues, then implemented various machine learning models and stacked models.

Achieved a average log loss score of 1.0578 using acking clasifer after doing necessary Hyperparameter tuning.

  • Python
  • Numpy
  • Pandas
  • Scikit-learn
  • Matplotlib
  • Sqlite3
  • Seaborn
  • Scipy
  • NLTK
  • Naive Bayes
  • KNN
  • Logistic regression
  • Linear SVM
  • Random forest
  • Stacking classifier
View Code

Human Activity Recognition

Goal of this project is to predict one of the six activities that a Smartphone user is performing at that 2.56 Seconds time window.

Performed Exploratory data analysis(EDA), Dimensionality reduction methods like T-SNE and implemented classification models and LSTM.

  • Python
  • Numpy
  • Pandas
  • Scikit-learn
  • Matplotlib
  • Keras
  • Seaborn
  • hyperas
  • T-SNE
  • Linear SVM
  • RBF SVM
  • Decision Trees
  • Random Forest
  • Gradient Boosted Decision tree
  • LSTM
View Code

Credit Card Fraud Detection

Aim of this project is to predict Fraud credit card transactions.

Performed Exploratory Data Analysis on full data then we have removed outliers using "LocalOutlierFactor", then finally we have used KNN technique to predict to train the data and to predict whether the transaction is Fraud or not. We have also applied T-SNE to visualize the Fraud and genuine transactions in 2-D.

  • Python
  • Numpy
  • Pandas
  • Scikit-learn
  • Matplotlib
  • T-SNE
  • K-NN
View Code

Heart attack prediction

Motivation of this project is to predict a potential patient to have a heart attack.

Collected data from Framingham heart study, Performed EDA, generated synthetic data using SMOTE to balance classes, used random forest model to select features and implemented various models to predict potential canditates to have a heart attack.

  • Python
  • Numpy
  • Pandas
  • Scikit-learn
  • Matplotlib
  • SMOTE
  • K-NN
  • Logistic regression
  • SVM
View Code

Stackoverflow Tag Prediction

Collaborated with a team of 3 to design a model to suggest tags for question posted on Stack overflow

Interpreted data, created statistical analysis of content of questions, framed a multi class classification problem and build logistic regressor and support vector regressor to predict tags for content of question.

Achieved a average F-1 score of 0.51 after necessary hyper-parameter tuning.

  • Python
  • Numpy
  • Pandas
  • Scikit-learn
  • Matplotlib
  • scipy
  • NLTK
  • sqlite3
View Code

Instacart Market Basket Analysis

The goal of the competition is to predict which products will be in a user's next order. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.

Concluded that reorder ratios are quite high during the early mornings compared to later half of the day.

  • Python
  • Numpy
  • Pandas
  • Seaborn
  • bohek
View Code

Uber's Ridership Data Analysis

Visualize Uber's ridership growth in NYC during the period

Characterize the demand based on identified patterns in the time series.

Estimate the value of the NYC market for Uber, and its revenue growth.

Other insights about the usage of the service.

  • Python
  • Numpy
  • Pandas
  • Seaborn
  • Plotly
View Code

Topological data analysis using Mapper

The mapper algorithm is a method of replacing a topological space by a simpler one, known as a simplicial complex, which captures topological and geometric features of the original space. The purpose of doing this is not only to obtain a visualization of high dimensional data sets in 3-D, but also because mathematical properties of simplicial complexes allow for the implementation of algebraic calculations that facilitate the classification of the topological features of the complex, and by extension, of the original topological space.

implemented various lens and clusters on the iris dataset to bring insights from petal length, petal width, sepal length, sepal width.

  • Python
  • Numpy
  • Pandas
  • Scikit-Learn
  • Kmapper
View Code

Zomato web scraping

Web scrapped area wise restaurant names, address, average price for two persons, cusine, rating, number of votes.

  • Python
  • requests
  • Pandas
  • BeautifulSoup
View Code