Hi, My name is Kiran
and I am a passionate

10

Python

2

Excel

3

SQL

2

Power BI

About Me

Hello,
Kiran this side.
Welcome to this small summary of "Tell me something about Yourself".
Firstly let me introduce myself without taking much time of yours.
My name is Kiran Kumar Sahu and I am perusing Masters in Computer Application with AI/ML specialization. Basically my hobby is Photography but to fulfill my hobbies I want a job which will justify my skills too. This is my resume comprising of all the details of my academic timespan. I hope you gonna checkout my skills as mentioned below. I am also having an active participation on LinkedIn. So, you can also find me and my certifications on LinkedIn.
Thank you

Skills

Experience

Education

Machine Learning
Creating Models using Machine Learning Algorithims
Deep Learning
Working on Neural Networks
Power BI
Designing Interactive Dashboards to visualize the datasets

2017 - Current [Machine Learning Intern]
Deep diving into Machine Learning Models
2024 Mar - June [Data Science Intern]
Creating Models using Machine Learning. Working with Supervized and Unsupervized Machine Learing models
2023 Oct - Dec [Web Developer Intern]
Designing Interactive Websites using HTML, CSS and Javascript

BCA
Learning Core concepts about Copmuter Programiing
MCA AI/ML Specialization
Heading towards creating models and working with AI

My Works

Road Accident Dashboard

Gaming Components Shop Dashboard

Atlitic Hardware Dashboard

Facial Recognition System

Credit Card Fraud Detection

Movie Recomendation System

Diabeties Prediction System

Audi Car Price Prediction

Movie Genre Classification

Credit Card Fraud Detection EDA & FE

SMS Spam Classification

Iris Flower Classification

Titanic Survival and Deaths Prediction

Road Accident Dashboard

Excel Project

Road Accident Dashboard

I created a dashboard using Excel and SQL to analyze road accidents, categorized by casualty severity (fatal, serious, slight), vehicle types, road and surface types, locations (urban or rural), lighting conditions, and date-based trends. The dashboard also compares current year vs previous year casualty data.

Github Make it yours

In this dashboard using Excel and SQL to analyze various aspects of road accidents. The data is categorized by the severity of casualties, including fatal, serious, and slight casualties, providing insights into how different types of accidents impact overall safety.

The dashboard also breaks down casualties by vehicle types, covering cars and a wide range of other vehicles, allowing for a detailed comparison of how each contributes to accident statistics. It includes a comparison of current year vs previous year data to track changes and improvements over time.

Another key element of the analysis is the categorization of accidents by road type (e.g., highways, local roads) and surface type (e.g., wet, dry), which helps in understanding the environmental factors contributing to accidents. The dashboard also distinguishes between accidents occurring in urban vs rural areas, shedding light on how location impacts casualty rates.

I also analyzed the effect of lighting conditions on accident severity, comparing accidents that occurred in daylight vs dark. In addition, the dashboard allows for date-based analysis, enabling the user to track accident trends over specific periods.

This comprehensive analysis provides a clear and detailed view of road safety trends and helps identify key areas for improvement in road and vehicle safety measures.

Gaming Components Shop Dashboard

Power BI Project

Gaming Components Shop Dashboard

I created a dashboard for a computer shop that sells a wide range of components, from budget to high-end, for both gaming and productivity. The dashboard revealed that monitors, CPUs, and graphic cards are the most profitable components, and it also compares supervisor performance, top states by sales and profit, and features an animated graph showing trends over time.

Github Make it yours

I developed a detailed dashboard for a computer shop that sells a comprehensive range of computer-related components, including gaming and productivity-focused items. The products range from budget-friendly options to high-end components, covering everything from A to Z in the computer world. Through the data, I discovered that monitors, CPUs, and graphic cards generate significantly more profit compared to other components, providing valuable insights into the shop's top-performing products.

Additionally, I included a comparison of supervisor performance, showing which supervisors contributed the most profit, and this analysis can be broken down by year and month for a more detailed view. The dashboard also highlights the top-performing states in terms of sales, profit, and specific product categories, giving the store a clear view of geographical performance.

One of the standout features of this dashboard is the animated graph I created, which visualizes sales and profit trends over time, providing a dynamic way to track changes and performance across different periods. This dashboard offers a clear, data-driven picture of the store's operations and helps inform future sales and marketing strategies.

Atlitic Hardware Dashboard

Power BI Project

Atlitic Hardware Dashboard

I developed a dashboard for a hardware shop that analyzes total sales, quantity sold, top regions, markets, customer segments, and product types. The dashboard visualizes trends based on year and month for effective decision-making.

Github Make it yours

I developed a comprehensive dashboard for a hardware store that specializes in products such as brick, mortar, sand, and various other hardware items. The goal of the dashboard was to provide actionable insights by leveraging the store's sales data. It analyzes key performance indicators such as total sales, total quantity sold, and how these metrics vary across different regions and markets. The dashboard also differentiates between distributed products and the store's own products, allowing for a more granular understanding of which product types are performing better.

One of the key features of the dashboard is its ability to identify top-performing regions and markets, giving insights into where the shop is most successful. I also included a sales breakdown by customer, which provides information on customer buying behavior, allowing the store to identify its most valuable customers.

Furthermore, the dashboard enables the visualization of data based on specific time periods, including by year and month, making it easy to spot trends and seasonal patterns in sales. This time-based analysis helps the shop make better decisions regarding inventory management, marketing strategies, and targeting specific markets during peak periods.

By providing clear and interactive visualizations, the dashboard offers the store a powerful tool to understand its performance at a detailed level. This ultimately helps in optimizing business strategies, improving product offerings, and focusing efforts on areas that generate the most value for the business.

Facial Recognition System

Deep Learning Project

Facial Recognition System

This project is a facial attendance system built using Flask, TensorFlow, and OpenCV. It captures faces using a webcam, recognizes them using a trained deep learning model, and records attendance. The system allows users to register their attendance through a GUI, stores face data, and can identify new faces or recognize unregistered ones. The MobileNetV2-based model is used for face recognition, and attendance is saved in a CSV file.

Github

This facial attendance system is designed for real-time face detection, recognition, and attendance marking using deep learning and web technologies. The system is built using the Flask web framework for the user interface, TensorFlow/Keras for deep learning model development, and OpenCV for image and video capture.

Key components of the system include:
Face Detection:
The system uses OpenCV's CascadeClassifier to detect faces from webcam input. It ensures real-time face detection using Haar cascades and captures images when users interact with the system.

Face Recognition Model:
The face recognition model is based on the MobileNetV2 architecture, which is pretrained on the ImageNet dataset. The model is fine-tuned for the task of facial recognition. It outputs a softmax-activated vector indicating the recognized user. The model is trained on resized face images, and a dropout layer is added for regularization to prevent overfitting.

Training and Retraining:
If new faces are added, the system supports retraining by either modifying the output layer to accommodate new classes (new users) or by creating a new model with MobileNetV2 as the base. Transfer learning is employed, where only the last few layers are retrained for faster training and better generalization.

Attendance Marking:
Once a face is recognized, the system logs the user's name, roll number, and the time of recognition into a CSV file named after the current date. The system also checks for unidentified users using a confidence threshold to avoid marking attendance for unregistered faces.

Flask-based GUI:
The project includes a Flask-based web interface where users can interact with the system, view attendance logs, and register their presence. The system uses HTML templates to render the attendance page and show live video feed for real-time face detection.

Face Registration and Attendance:
The project ensures that all registered users' face images are saved in a structured folder format, and attendance is recorded in a CSV file specific to the current date. The system can extract attendance data for a particular day and display the users who have marked their presence, along with the time.

Credit Card Fraud Detection

Machine Learning Project

Credit Card Fraud Detection

In this project for credit card fraud detection, several tools and techniques are employed to preprocess, analyze, and model the dataset. The dataset is first loaded using pandas, and exploratory data analysis is performed using matplotlib and seaborn to visualize distributions of transaction time and amounts. The boxcox transformation from the scipy library is used to normalize the skewed transaction amount data.
The SGDClassifier from scikit-learn is used for classification, with GridSearchCV applied for hyperparameter tuning, specifically optimizing the loss, penalty, and alpha parameters. Metrics such as accuracy score, classification report, and confusion matrix are used to evaluate the model's performance. Finally, pickle is employed to save and reload the trained model for future predictions.

Github Repo

This project focuses on identifying fraudulent credit card transactions using machine learning techniques. By utilizing both supervised and unsupervised learning methods, Training models to detect anomalies and fraudulent behavior. Various classification algorithms, such as Logistic Regression, Random Forest, and XGBoost, were used, alongside hyperparameter tuning, to achieve high precision and recall scores.

HTML Code ipynb File Dataset

Movie Recomendation System

Natural Language Processing Project

Movie Recomendation System

In this project movie recommendation system, various tools and techniques are utilized to recommend movies based on multiple parameters such as genres, keywords, cast, director, and more. The dataset is first preprocessed using pandas, with missing values filled appropriately. TF-IDF (Term Frequency-Inverse Document Frequency) from the sklearn library is used to convert text data (movie attributes) into feature vectors, capturing the importance of words within the dataset.
Cosine similarity is then applied to measure the similarity between the feature vectors, which helps identify movies that are most similar to the one inputted by the user. Difflib is used to find close matches for the user-provided movie title to ensure accuracy in searching. Finally, the most similar movies are displayed by sorting the similarity scores and presenting the top recommendations.

Github Repo Linkedin

This project involves building a content-based movie recommendation system that suggests movies to users based on their preferences and viewing history. Using techniques like cosine similarity and collaborative filtering, I provided personalized movie recommendations. The project showcases the use of recommendation algorithms, feature extraction, and matrix factorization.

HTML Code ipynb File Dataset

Diabeties Prediction System

Machine Learning Project

Diabeties Prediction System

The diabetes prediction project begins with importing necessary libraries such as pandas, numpy, and machine learning tools like LogisticRegression and SVC. The dataset is read from a CSV file, and exploratory data analysis (EDA) reveals no missing values, no object types, and some false values in the columns. Age is converted into a categorical variable to better interpret the results.
Next, the dataset is examined for correlations between features, which are found to be weak. EDA is visualized using bar plots for features such as Pregnancies, Glucose, and BMI, where some patterns like higher pregnancies and glucose levels are observed in diabetic individuals. Outliers in the dataset are handled by capping extreme values for features such as pregnancies, glucose, blood pressure, and insulin.
After preprocessing, a logistic regression model is trained using the processed features. The model achieves an accuracy of around 75%, and hyperparameter tuning is done using GridSearchCV to improve performance. The optimized logistic regression model slightly improves the accuracy. Additionally, support vector machine (SVM) and decision tree models are built and optimized using RandomizedSearchCV, with each achieving competitive performance metrics.
Finally, hyperparameter tuning for SVM and decision tree models further refines the performance of the models, and confusion matrices, classification reports, and ROC-AUC scores are calculated to evaluate the predictive accuracy of the models.

Github repo Linkedin

In this project, I created a model to predict the likelihood of diabetes based on patient health data, such as age, BMI, glucose levels, and blood pressure. Multiple classification models were implemented, including Decision Trees, Support Vector Machines (SVM), and K-Nearest Neighbors (KNN). Hyperparameter tuning was performed to optimize the model's performance, achieving a balance between precision and recall.

HTML Code ipynb file Dataset

Audi Car Price Prediction

Machine Learning Project

Audi Car Price Prediction

This Audi car price prediction project employs multiple machine learning regression techniques to predict the prices of Audi cars based on various features such as model, year, mileage, fuel type, and engine size. The dataset undergoes preprocessing, including label encoding for categorical features like transmission and fuel type, allowing the models to interpret the non-numeric data.
The project begins by splitting the data into training and testing sets and evaluating different regression models. First, Linear Regression is applied to establish a basic baseline for price prediction. This is followed by the Support Vector Regressor (SVR), which is particularly useful for capturing non-linear relationships within the data. To improve model performance, ensemble methods like Random Forest Regressor and Extra Tree Regressor are also tested. These models work by combining multiple decision trees to enhance prediction accuracy and handle data variance better.
To fine-tune the Random Forest model, both RandomizedSearchCV and GridSearchCV are used for hyperparameter tuning. RandomizedSearchCV provides faster results by sampling a set of hyperparameters, while GridSearchCV exhaustively tests all combinations for the best result, though at the cost of longer computation time. The CatBoost Regressor is also explored as a more specialized method for handling categorical data without the need for extensive preprocessing.
For performance evaluation, metrics such as R-squared, Mean Squared Error (MSE), and Mean Absolute Error (MAE) are used to compare the models. The project concludes by serializing the final model using Pickle, making it deployable for future predictions without needing to retrain it from scratch.
This comprehensive approach demonstrates how different machine learning techniques can be applied to regression problems, with each method's strengths and limitations highlighted throughout the process.

Github repo Linkedin

In this project, I developed a machine learning model to predict the price of Audi cars based on features such as model, year, mileage, and fuel type. I used multiple regression models and performed hyperparameter tuning to improve the prediction accuracy. This project demonstrates the ability to preprocess automotive datasets, feature engineering, and model optimization using advanced techniques.

HTML Code ipynb file Dataset

Movie Genre Classification

Natural Language Processing Project

Movie Genre Classification

This project for movie genre classification employs a combination of Natural Language Processing (NLP) and machine learning techniques to predict a film's genre based on its plot summary. The process begins with data loading using the pandas library, which is responsible for handling the datasets containing movie descriptions and genres. The datasets are read from text files and structured into DataFrames for easier manipulation and analysis.
Text preprocessing is a critical step in preparing the data for machine learning. The nltk library is utilized for natural language processing tasks, such as tokenization and stopword removal. A custom function is defined to clean the plot summaries by converting text to lowercase, removing URLs and punctuation, and eliminating common English stopwords, thereby retaining only the meaningful words that contribute to the classification task. The cleaned text is stored in a new column within the DataFrame.
To convert the cleaned text into a numerical format that machine learning algorithms can interpret, the code employs the TfidfVectorizer from sklearn This technique transforms the text data into TF-IDF vectors, capturing the importance of words in relation to the overall corpus. These vectors serve as input features for the classification models.
The models used for classification include Multinomial Naive Bayes, Logistic Regression, and Support Vector Classifier (SVC), all from the sklearn library. The data is split into training and testing sets using train_test_split, allowing for model training and evaluation. Each model is trained on the training data and subsequently evaluated on the test data to assess its accuracy and performance, using metrics such as accuracy score and classification report.
To visualize the performance of each classifier, a pie chart is created, showing the accuracy of each model in predicting the genres. The results are compiled into a new DataFrame that includes the predicted genres alongside the actual genres from the test solution dataset, enabling a direct comparison. Finally, the trained TF-IDF vectorizer and the Logistic Regression model are saved to disk using the pickle library for future predictions on new movie descriptions.
Overall, this code provides a comprehensive pipeline for movie genre classification, showcasing the effective use of text preprocessing, feature extraction, and multiple machine learning models to achieve accurate genre predictions based on textual data.

Github repo Linkedin

In this project, I built a classification model to predict the genre of a movie based on its plot description. By utilizing natural language processing (NLP) techniques, such as TF-IDF vectorization, and machine learning models like Naive Bayes and Random Forest, I classified movies into various genres. Hyperparameter tuning and cross-validation were used to refine the model's performance.

HTML Code ipynb file Dataset

Credit Card Fraud Detection EDA & Feature Engineering

Machine Learning Project

Credit Card Fraud Detection EDA & Feature Engineering

In this project, credit card fraud detection is achieved through a systematic application of data preprocessing, exploratory data analysis, and machine learning techniques. The dataset consists of two parts: a training dataset with 1,296,675 rows and a test dataset with 555,719 rows, both containing 23 columns. The target variable, `is_fraud`, indicates whether a transaction is fraudulent (1) or genuine (0).

Data Import and Initial Processing

The project begins by importing essential libraries such as pandas, numpy, matplotlib, and seaborn for data manipulation and visualization. It also includes various machine learning libraries from scikit-learn, such as LogisticRegression, KNeighborsClassifier, RandomForestClassifier, and SVC for building and evaluating models. The datasets are read into DataFrames, and unnecessary columns are dropped to streamline the analysis.

Data Preprocessing

The project performs critical data preprocessing steps, including converting date columns from object types to datetime types. It generates new features based on existing data, such as categorizing transactions by the time of day (morning, afternoon, evening, night) and age groups (Young, Middle_age, Old) based on the date of birth. This enhances the dataset's richness and provides better context for model training.

The project further employs one-hot encoding to convert categorical variables into numerical formats suitable for machine learning algorithms, ensuring the model can interpret them effectively. The dataset is split into training and testing sets using train_test_split, with 70% of the data reserved for training and 30% for testing.

Model Training and Evaluation

Multiple machine learning models are applied to the training data, including Logistic Regression, K-Neighbors Classifier, Random Forest Classifier, and Support Vector Classifier (SVC). Each model's performance is assessed using metrics such as accuracy, precision, recall, and F1-score, which provide insights into their effectiveness in classifying fraudulent transactions. Notably, the K-Neighbors Classifier outperforms other models with an accuracy of 99.5%, followed closely by Logistic Regression at >99.4%.

Hyperparameter tuning is implemented for the Random Forest Classifier using Grid Search to optimize model performance further, although it takes considerably longer to train. The SVC model is also explored, with the data standardized using StandardScaler to improve its training efficiency.

Conclusion and Recommendations

In conclusion, the project demonstrates a comprehensive approach to credit card fraud detection by utilizing a variety of data processing techniques and machine learning models. The high accuracy achieved indicates the models robustness, making them suitable for practical application in fraud detection systems. For future work, exploring deployment strategies while maintaining one-hot encoding or utilizing label encoding can facilitate integration into web applications, enabling real-time fraud detection capabilities.

Github repo Linkedin

This project emphasizes exploratory data analysis (EDA) and feature engineering for the credit card fraud detection dataset. Here, I visualized transaction patterns, explored class imbalance, and engineered new features to enhance model performance. By cleaning and transforming the dataset, the project showcases the importance of EDA in building robust predictive models.

HTML Code ipynb file Dataset

SMS Spam Classification

Machine Learning Project

SMS Spam Classification

The project implements a spam detection system using various libraries and techniques to process and analyze textual data. Initially, it employs Pandas to load and manipulate the dataset, which consists of SMS messages labeled as either spam or ham. The data undergoes preprocessing steps, including removing duplicates, renaming columns, and calculating features such as the number of characters, words, and sentences in each message, which aids in understanding the text's structure.
Text normalization techniques are applied, where all text is converted to lowercase, and unwanted characters are removed using regex. Additionally, the code utilizes NLTK to tokenize the text, eliminate stop words, and apply stemming, which reduces words to their root forms. This transformation results in a cleaner and more meaningful dataset. Following this, CountVectorizer is used to convert the processed text into a numerical format suitable for machine learning models.
For the classification task, the dataset is divided into training and testing sets using train_test_split. The models employed include Gaussian Naive Bayes and Bernoulli Naive Bayes, which are trained on the training dataset and then evaluated on the test set. The code concludes by printing the training and testing accuracy scores for both models, providing insight into their effectiveness in distinguishing between spam and ham messages. Overall, this code represents a comprehensive pipeline for building a spam detection system using natural language processing and machine learning techniques.

Github repo Linkedin

This project aims to classify SMS messages as either spam or legitimate using natural language processing (NLP) and machine learning techniques. I used models such as Naive Bayes and Support Vector Machines, applying hyperparameter tuning to enhance accuracy. The project demonstrates proficiency in handling text data, tokenization, and building a real-time classifier.

HTML Code ipynb file Dataset

Iris Flower Classification

Machine Learning Project

Iris Flower Classification

This project represents your first analysis project, focusing on the well-known Iris dataset, which is commonly used for classification tasks in machine learning. The project employs several libraries, including NumPy, Pandas, Matplotlib and Seaborn, to facilitate data manipulation, visualization, and analysis. The initial steps involve loading the dataset using Seaborn and exploring its structure through various methods, such as head(), isnull(), and describe(), which provide insights into the data distribution, missing values, and overall information about the dataset.
After the exploratory data analysis (EDA), the features and target variables are defined. The feature set (X) consists of the four measurements (sepal length, sepal width, petal length, and petal width), while the target variable (y) contains the species of the iris plants. A correlation matrix is generated to assess the relationships between the features, visualized using a heatmap, allowing for an understanding of how the features interact with each other.
For the modeling phase, the dataset is split into training and testing subsets using train_test_split from Scikit-learn. A Logistic Regression classifier is chosen for this classification task. To optimize the model's hyperparameters, GridSearchCV is employed, which systematically explores various combinations of parameters such as penalty types, regularization strength (C), and maximum iterations. The best parameters and the corresponding accuracy score are identified upon fitting the model.
Once trained, predictions are made on the test dataset, and the performance is evaluated using accuracy scores and a classification report that provides detailed metrics for each class. Finally, the code includes an example of making predictions for specific flower measurements, showcasing the model's practical application. This project not only demonstrates fundamental data analysis and modeling techniques but also serves as a foundational experience in applying machine learning to real-world data.

Github repo Linkedin

This classic project involves classifying Iris flowers into one of three species based on petal and sepal measurements. I applied multiple machine learning algorithms such as Logistic Regression, SVM, and k-NN to classify the flowers. The project includes EDA, feature scaling, and hyperparameter tuning to achieve optimal accuracy.

HTML code ipynb file Dataset

Titanic Survival and Deaths Prediction

Machine Learning Project

Titanic Survival and Deaths Prediction

In this project, I conduct an analysis of the Titanic dataset to predict passenger survival outcomes using Logistic Regression. The project begins by importing essential libraries such as Pandas, NumPy, Matplotlib, and Seaborn for data manipulation, visualization, and analysis. The dataset is loaded from a CSV file, and initial data exploration is performed using methods like `head()`, `nunique()`, and `describe()` to understand the data structure, unique values, and statistical summaries.
The analysis continues with an investigation of missing values, specifically in the 'Age' column, and an examination of survival counts across different passenger classes using **Seaborn's** `countplot`. The categorical variable 'Sex' is encoded into numerical values using **LabelEncoder** to facilitate model training. This is followed by visualizing the survival distribution based on sex, which reveals insights into how gender influences survival rates.
To handle the missing values in the 'Age' column, the median age is computed and used to fill any gaps, ensuring a complete dataset for modeling. The features selected for prediction are 'Pclass', 'Sex', and 'Age', while the target variable is 'Survived'. The dataset is then split into training and testing sets using **train_test_split** from **Scikit-learn**.
The Logistic Regression classifier is trained on the training set, and predictions are made on the test set. The model's performance is evaluated using accuracy scores and a classification report, which provides detailed metrics for understanding the model's effectiveness. Finally, the code includes examples of predicting the survival outcome for specific passengers based on their class, gender, and age, demonstrating the practical application of the model. This project not only showcases your ability to apply machine learning techniques but also highlights your skills in data preprocessing, visualization, and model evaluation in a real-world context.

Github repo Linkedin

In this project, I created a predictive model to determine the likelihood of a passenger surviving the Titanic disaster based on features such as age, class, and family size. I used various machine learning algorithms, including Random Forest, Logistic Regression, and Gradient Boosting, along with feature engineering and hyperparameter tuning to increase prediction accuracy.

HTML Code ipynb file Dataset