Hello,
Kiran this side.
Welcome to this small summary of "Tell me something about Yourself".
Firstly let me introduce myself without taking much time of yours.
My name is Kiran Kumar Sahu and I am perusing Masters in Computer Application with AI/ML
specialization.
Basically my hobby is Photography but to fulfill my hobbies I want a job which will justify
my skills too. This is my resume comprising of all the details of my academic timespan. I
hope you gonna checkout my skills as mentioned below. I am also having an active
participation on LinkedIn. So, you can also find me and my certifications on LinkedIn.
Thank you
Skills
Experience
Education
I created a dashboard using Excel and SQL to analyze road accidents, categorized by casualty severity (fatal, serious, slight), vehicle types, road and surface types, locations (urban or rural), lighting conditions, and date-based trends. The dashboard also compares current year vs previous year casualty data.
Github Make it yoursIn this dashboard using Excel and SQL to analyze various aspects of road
accidents. The data is categorized by the severity of casualties, including fatal, serious, and
slight casualties, providing insights into how different types of accidents impact overall
safety.
The dashboard also breaks down casualties by vehicle types, covering cars and a wide range of
other vehicles, allowing for a detailed comparison of how each contributes to accident
statistics. It includes a comparison of current year vs previous year data to track changes and
improvements over time.
Another key element of the analysis is the categorization of accidents by road type (e.g.,
highways, local roads) and surface type (e.g., wet, dry), which helps in understanding the
environmental factors contributing to accidents. The dashboard also distinguishes between
accidents occurring in urban vs rural areas, shedding light on how location impacts casualty
rates.
I also analyzed the effect of lighting conditions on accident severity, comparing accidents that
occurred in daylight vs dark. In addition, the dashboard allows for date-based analysis,
enabling the user to track accident trends over specific periods.
This comprehensive analysis provides a clear and detailed view of road safety trends and helps
identify key areas for improvement in road and vehicle safety measures.
I created a dashboard for a computer shop that sells a wide range of components, from budget to high-end, for both gaming and productivity. The dashboard revealed that monitors, CPUs, and graphic cards are the most profitable components, and it also compares supervisor performance, top states by sales and profit, and features an animated graph showing trends over time.
Github Make it yoursI developed a detailed dashboard for a computer shop that sells a comprehensive range of
computer-related components, including gaming and productivity-focused items. The products range
from budget-friendly options to high-end components, covering everything from A to Z in the
computer world. Through the data, I discovered that monitors, CPUs, and graphic cards generate
significantly more profit compared to other components, providing valuable insights into the
shop's top-performing products.
Additionally, I included a comparison of supervisor performance, showing which supervisors
contributed the most profit, and this analysis can be broken down by year and month for a more
detailed view. The dashboard also highlights the top-performing states in terms of sales,
profit, and specific product categories, giving the store a clear view of geographical
performance.
One of the standout features of this dashboard is the animated graph I created, which visualizes
sales and profit trends over time, providing a dynamic way to track changes and performance
across different periods. This dashboard offers a clear, data-driven picture of the store's
operations and helps inform future sales and marketing strategies.
I developed a dashboard for a hardware shop that analyzes total sales, quantity sold, top regions, markets, customer segments, and product types. The dashboard visualizes trends based on year and month for effective decision-making.
Github Make it yoursI developed a comprehensive dashboard for a hardware store that specializes in products such as
brick, mortar, sand, and various other hardware items. The goal of the dashboard was to provide
actionable insights by leveraging the store's sales data. It analyzes key performance indicators
such as total sales, total quantity sold, and how these metrics vary across different regions
and markets. The dashboard also differentiates between distributed products and the store's own
products, allowing for a more granular understanding of which product types are performing
better.
One of the key features of the dashboard is its ability to identify top-performing regions and
markets, giving insights into where the shop is most successful. I also included a sales
breakdown by customer, which provides information on customer buying behavior, allowing the
store to identify its most valuable customers.
Furthermore, the dashboard enables the visualization of data based on specific time periods,
including by year and month, making it easy to spot trends and seasonal patterns in sales. This
time-based analysis helps the shop make better decisions regarding inventory management,
marketing strategies, and targeting specific markets during peak periods.
By providing clear and interactive visualizations, the dashboard offers the store a powerful
tool to understand its performance at a detailed level. This ultimately helps in optimizing
business strategies, improving product offerings, and focusing efforts on areas that generate
the most value for the business.
This project is a facial attendance system built using Flask, TensorFlow, and OpenCV. It captures faces using a webcam, recognizes them using a trained deep learning model, and records attendance. The system allows users to register their attendance through a GUI, stores face data, and can identify new faces or recognize unregistered ones. The MobileNetV2-based model is used for face recognition, and attendance is saved in a CSV file.
GithubThis facial attendance system is designed for real-time face detection, recognition, and
attendance marking using deep learning and web technologies. The system is built using the Flask
web framework for the user interface, TensorFlow/Keras for deep learning model development, and
OpenCV for image and video capture.
Key components of the system include:
Face Detection:
The system uses OpenCV's CascadeClassifier to detect faces from
webcam input. It ensures real-time face detection using Haar cascades and captures images when
users interact with the system.
Face Recognition Model:
The face recognition model is based on the MobileNetV2
architecture, which is pretrained on the ImageNet dataset. The model is fine-tuned for the task
of facial recognition. It outputs a softmax-activated vector indicating the recognized user. The
model is trained on resized face images, and a dropout layer is added for regularization to
prevent overfitting.
Training and Retraining:
If new faces are added, the system supports retraining by
either modifying the output layer to accommodate new classes (new users) or by creating a new
model with MobileNetV2 as the base. Transfer learning is employed, where only the last few
layers are retrained for faster training and better generalization.
Attendance Marking:
Once a face is recognized, the system logs the user's name,
roll number, and the time of recognition into a CSV file named after the current date. The
system also checks for unidentified users using a confidence threshold to avoid marking
attendance for unregistered faces.
Flask-based GUI:
The project includes a Flask-based web interface where users can
interact with the system, view attendance logs, and register their presence. The system uses
HTML templates to render the attendance page and show live video feed for real-time face
detection.
Face Registration and Attendance:
The project ensures that all registered users'
face images are saved in a structured folder format, and attendance is recorded in a CSV file
specific to the current date. The system can extract attendance data for a particular day and
display the users who have marked their presence, along with the time.
In this project for credit card fraud detection, several tools and techniques are
employed to preprocess, analyze, and model the dataset. The dataset is first loaded
using pandas, and exploratory data analysis is performed using matplotlib
and seaborn to
visualize distributions of transaction time and amounts. The boxcox
transformation from
the scipy library is used to normalize the skewed transaction amount data.
The SGDClassifier from scikit-learn is used for classification, with
GridSearchCV
applied for hyperparameter tuning, specifically optimizing the loss,
penalty, and alpha
parameters. Metrics such as accuracy score, classification report, and
confusion matrix
are used to evaluate the model's performance. Finally, pickle is employed to save and
reload the trained model for future predictions.
This project focuses on identifying fraudulent credit card transactions using machine learning techniques. By utilizing both supervised and unsupervised learning methods, Training models to detect anomalies and fraudulent behavior. Various classification algorithms, such as Logistic Regression, Random Forest, and XGBoost, were used, alongside hyperparameter tuning, to achieve high precision and recall scores.
HTML Code ipynb File DatasetIn this project movie recommendation system, various tools and techniques are utilized
to
recommend movies based on multiple parameters such as genres, keywords, cast,
director, and
more. The dataset is first preprocessed using pandas, with missing values filled
appropriately. TF-IDF (Term Frequency-Inverse Document Frequency) from the
sklearn
library is used to convert text data (movie attributes) into feature vectors, capturing the
importance of words within the dataset.
Cosine similarity is then applied to measure the similarity between the feature
vectors,
which helps identify movies that are most similar to the one inputted by the user.
Difflib is used to find close matches for the user-provided movie title to ensure
accuracy in searching. Finally, the most similar movies are displayed by sorting the
similarity scores and presenting the top recommendations.
This project involves building a content-based movie recommendation system that suggests movies to users based on their preferences and viewing history. Using techniques like cosine similarity and collaborative filtering, I provided personalized movie recommendations. The project showcases the use of recommendation algorithms, feature extraction, and matrix factorization.
HTML Code ipynb File DatasetThe diabetes prediction project begins with importing necessary libraries such as
pandas, numpy,
and machine learning tools like LogisticRegression and SVC. The dataset is
read
from a CSV
file, and exploratory data analysis (EDA) reveals no missing values, no object types,
and
some
false values in the columns. Age is converted into a categorical variable to better
interpret
the results.
Next, the dataset is examined for correlations between features, which are found to be weak.
EDA
is visualized using bar plots for features such as Pregnancies, Glucose, and BMI,
where
some
patterns like higher pregnancies and glucose levels are observed in diabetic individuals.
Outliers in the dataset are handled by capping extreme values for features such as
pregnancies,
glucose, blood pressure, and insulin.
After preprocessing, a logistic regression model is trained using the processed
features.
The
model achieves an accuracy of around 75%, and hyperparameter tuning is done using
GridSearchCV
to improve performance. The optimized logistic regression model slightly improves the
accuracy.
Additionally, support vector machine (SVM) and decision tree models are built
and
optimized
using RandomizedSearchCV, with each achieving competitive performance metrics.
Finally, hyperparameter tuning for SVM and decision tree models further
refines
the performance
of the models, and confusion matrices, classification reports, and ROC-AUC
scores
are calculated
to evaluate the predictive accuracy of the models.
In this project, I created a model to predict the likelihood of diabetes based on patient health data, such as age, BMI, glucose levels, and blood pressure. Multiple classification models were implemented, including Decision Trees, Support Vector Machines (SVM), and K-Nearest Neighbors (KNN). Hyperparameter tuning was performed to optimize the model's performance, achieving a balance between precision and recall.
HTML Code ipynb file DatasetThis Audi car price prediction project employs multiple machine learning regression
techniques to
predict the prices of Audi cars based on various features such as model, year, mileage,
fuel
type, and engine size. The dataset undergoes preprocessing, including
label
encoding for
categorical features like transmission and fuel type, allowing the models to interpret the
non-numeric data.
The project begins by splitting the data into training and testing sets and evaluating
different
regression models. First, Linear Regression is applied to establish a basic baseline
for
price prediction. This is followed by the Support Vector Regressor (SVR), which is
particularly useful for capturing non-linear relationships within the data. To improve model
performance, ensemble methods like Random Forest Regressor and Extra Tree
Regressor are
also tested. These models work by combining multiple decision trees to enhance prediction
accuracy and handle data variance better.
To fine-tune the Random Forest model, both RandomizedSearchCV and GridSearchCV
are
used
for hyperparameter tuning. RandomizedSearchCV provides faster results by sampling a
set
of
hyperparameters, while GridSearchCV exhaustively tests all combinations for the best
result,
though at the cost of longer computation time. The CatBoost Regressor is also
explored as
a more
specialized method for handling categorical data without the need for extensive
preprocessing.
For performance evaluation, metrics such as R-squared, Mean Squared Error
(MSE),
and
Mean Absolute Error (MAE) are used to compare the models. The project concludes by
serializing the final model using Pickle, making it deployable for future predictions
without needing to retrain it from scratch.
This comprehensive approach demonstrates how different machine learning techniques can be
applied to regression problems, with each method's strengths and limitations highlighted
throughout the process.
In this project, I developed a machine learning model to predict the price of Audi cars based on features such as model, year, mileage, and fuel type. I used multiple regression models and performed hyperparameter tuning to improve the prediction accuracy. This project demonstrates the ability to preprocess automotive datasets, feature engineering, and model optimization using advanced techniques.
HTML Code ipynb file DatasetThis project for movie genre classification employs a combination of Natural Language
Processing
(NLP) and machine learning techniques to predict a film's genre based on its plot
summary. The
process begins with data loading using the pandas library, which is responsible for
handling
the datasets containing movie descriptions and genres. The datasets are read from text files
and
structured into DataFrames for easier manipulation and analysis.
Text preprocessing is a critical step in preparing the data for machine learning. The
nltk
library is utilized for natural language processing tasks, such as tokenization and
stopword
removal. A custom function is defined to clean the plot summaries by converting text
to
lowercase, removing URLs and punctuation, and eliminating common English stopwords, thereby
retaining only the meaningful words that contribute to the classification task. The cleaned
text
is stored in a new column within the DataFrame.
To convert the cleaned text into a numerical format that machine learning algorithms can
interpret, the code employs the TfidfVectorizer from sklearn This technique
transforms the
text data into TF-IDF vectors, capturing the importance of words in relation to the
overall
corpus. These vectors serve as input features for the classification models.
The models used for classification include Multinomial Naive Bayes, Logistic
Regression,
and
Support Vector Classifier (SVC), all from the sklearn library. The data is
split
into training
and testing sets using train_test_split, allowing for model training and evaluation.
Each
model is trained on the training data and subsequently evaluated on the test data to assess
its
accuracy and performance, using metrics such as accuracy score and classification
report.
To visualize the performance of each classifier, a pie chart is created, showing the
accuracy of
each model in predicting the genres. The results are compiled into a new DataFrame that
includes
the predicted genres alongside the actual genres from the test solution dataset, enabling a
direct comparison. Finally, the trained TF-IDF vectorizer and the Logistic
Regression model are
saved to disk using the pickle library for future predictions on new movie
descriptions.
Overall, this code provides a comprehensive pipeline for movie genre classification,
showcasing
the effective use of text preprocessing, feature extraction, and multiple machine
learning
models to achieve accurate genre predictions based on textual data.
In this project, I built a classification model to predict the genre of a movie based on its plot description. By utilizing natural language processing (NLP) techniques, such as TF-IDF vectorization, and machine learning models like Naive Bayes and Random Forest, I classified movies into various genres. Hyperparameter tuning and cross-validation were used to refine the model's performance.
HTML Code ipynb file DatasetIn this project, credit card fraud detection is achieved through a systematic
application
of data
preprocessing, exploratory data analysis, and machine learning techniques. The dataset
consists
of two parts: a training dataset with 1,296,675 rows and a test dataset with 555,719 rows,
both
containing 23 columns. The target variable, `is_fraud`, indicates whether a transaction is
fraudulent (1) or genuine (0).
Data Import and Initial Processing
The project begins by importing essential libraries such as pandas, numpy,
matplotlib,
and
seaborn for data manipulation and visualization. It also includes various machine
learning
libraries from scikit-learn, such as LogisticRegression, KNeighborsClassifier,
RandomForestClassifier, and SVC for building and evaluating models. The
datasets
are read
into DataFrames, and unnecessary columns are dropped to streamline the analysis.
Data Preprocessing
The project performs critical data preprocessing steps, including converting date columns
from
object types to datetime types. It generates new features based on existing data, such as
categorizing transactions by the time of day (morning, afternoon, evening, night) and
age
groups
(Young, Middle_age, Old) based on the date of birth. This enhances the dataset's
richness
and
provides better context for model training.
The project further employs one-hot encoding to convert categorical variables into numerical
formats suitable for machine learning algorithms, ensuring the model can interpret them
effectively. The dataset is split into training and testing sets using
train_test_split,
with
70% of the data reserved for training and 30% for
testing.
Model Training and Evaluation
Multiple machine learning models are applied to the training data, including Logistic
Regression, K-Neighbors Classifier, Random Forest Classifier, and Support Vector
Classifier
(SVC). Each model's performance is assessed using metrics such as accuracy,
precision, recall,
and F1-score, which provide insights into their effectiveness in classifying
fraudulent
transactions. Notably, the K-Neighbors Classifier outperforms other models with an
accuracy of
99.5%, followed closely by Logistic Regression at >99.4%.
Hyperparameter tuning is implemented for the Random Forest Classifier using Grid
Search to
optimize model performance further, although it takes considerably longer to train. The
SVC
model is also explored, with the data standardized using StandardScaler to improve
its
training efficiency.
Conclusion and Recommendations
In conclusion, the project demonstrates a comprehensive approach to credit card fraud
detection
by utilizing a variety of data processing techniques and machine learning models. The high
accuracy achieved indicates the models robustness, making them suitable for practical
application in fraud detection systems. For future work, exploring deployment strategies
while
maintaining one-hot encoding or utilizing label encoding can facilitate integration into web
applications, enabling real-time fraud detection capabilities.
This project emphasizes exploratory data analysis (EDA) and feature engineering for the credit card fraud detection dataset. Here, I visualized transaction patterns, explored class imbalance, and engineered new features to enhance model performance. By cleaning and transforming the dataset, the project showcases the importance of EDA in building robust predictive models.
HTML Code ipynb file DatasetThe project implements a spam detection system using various libraries and techniques
to
process
and analyze textual data. Initially, it employs Pandas to load and manipulate the
dataset,
which consists of SMS messages labeled as either spam or ham. The data
undergoes
preprocessing
steps, including removing duplicates, renaming columns, and calculating features such as the
number of characters, words, and sentences in each message, which aids in understanding the
text's structure.
Text normalization techniques are applied, where all text is converted to lowercase, and
unwanted characters are removed using regex. Additionally, the code utilizes
NLTK
to
tokenize the text, eliminate stop words, and apply stemming, which reduces
words
to their root
forms. This transformation results in a cleaner and more meaningful dataset. Following this,
CountVectorizer is used to convert the processed text into a numerical format
suitable
for
machine learning models.
For the classification task, the dataset is divided into training and testing sets using
train_test_split. The models employed include Gaussian Naive Bayes and
Bernoulli Naive
Bayes, which are trained on the training dataset and then evaluated on the test set.
The
code
concludes by printing the training and testing accuracy scores for both models, providing
insight into their effectiveness in distinguishing between spam and ham messages. Overall,
this
code represents a comprehensive pipeline for building a spam detection system using natural
language processing and machine learning techniques.
This project aims to classify SMS messages as either spam or legitimate using natural language processing (NLP) and machine learning techniques. I used models such as Naive Bayes and Support Vector Machines, applying hyperparameter tuning to enhance accuracy. The project demonstrates proficiency in handling text data, tokenization, and building a real-time classifier.
HTML Code ipynb file DatasetThis project represents your first analysis project, focusing on the well-known Iris
dataset,
which is commonly used for classification tasks in machine learning. The project employs
several
libraries, including NumPy, Pandas, Matplotlib and Seaborn, to
facilitate
data
manipulation, visualization, and analysis. The initial steps involve loading the dataset
using
Seaborn and exploring its structure through various methods, such as head(),
isnull(),
and describe(), which provide insights into the data distribution, missing values,
and
overall
information about the dataset.
After the exploratory data analysis (EDA), the features and target variables are
defined.
The
feature set (X) consists of the four measurements (sepal length, sepal width, petal length,
and petal width), while the target variable (y) contains the species of the iris plants. A
correlation matrix is generated to assess the relationships between the features, visualized
using a heatmap, allowing for an understanding of how the features interact with each
other.
For the modeling phase, the dataset is split into training and testing subsets using
train_test_split from Scikit-learn. A Logistic Regression classifier is
chosen for
this classification task. To optimize the model's hyperparameters, GridSearchCV is
employed,
which systematically explores various combinations of parameters such as penalty types,
regularization strength (C), and maximum iterations. The best parameters and the
corresponding
accuracy score are identified upon fitting the model.
Once trained, predictions are made on the test dataset, and the performance is evaluated
using
accuracy scores and a classification report that provides detailed metrics for each class.
Finally, the code includes an example of making predictions for specific flower
measurements,
showcasing the model's practical application. This project not only demonstrates fundamental
data analysis and modeling techniques but also serves as a foundational experience in
applying
machine learning to real-world data.
This classic project involves classifying Iris flowers into one of three species based on petal and sepal measurements. I applied multiple machine learning algorithms such as Logistic Regression, SVM, and k-NN to classify the flowers. The project includes EDA, feature scaling, and hyperparameter tuning to achieve optimal accuracy.
HTML code ipynb file DatasetIn this project, I conduct an analysis of the Titanic dataset to predict passenger
survival
outcomes using Logistic Regression. The project begins by importing essential
libraries
such
as Pandas, NumPy, Matplotlib, and Seaborn for data manipulation,
visualization,
and analysis. The dataset is loaded from a CSV file, and initial data exploration is
performed
using methods like `head()`, `nunique()`, and `describe()` to understand the data structure,
unique values, and statistical summaries.
The analysis continues with an investigation of missing values, specifically in the 'Age'
column, and an examination of survival counts across different passenger classes using
**Seaborn's** `countplot`. The categorical variable 'Sex' is encoded into numerical values
using
**LabelEncoder** to facilitate model training. This is followed by visualizing the survival
distribution based on sex, which reveals insights into how gender influences survival
rates.
To handle the missing values in the 'Age' column, the median age is computed and used to
fill
any gaps, ensuring a complete dataset for modeling. The features selected for prediction are
'Pclass', 'Sex', and 'Age', while the target variable is 'Survived'. The dataset is then
split
into training and testing sets using **train_test_split** from **Scikit-learn**.
The Logistic Regression classifier is trained on the training set, and predictions
are
made
on the test set. The model's performance is evaluated using accuracy scores and a
classification
report, which provides detailed metrics for understanding the model's effectiveness.
Finally,
the code includes examples of predicting the survival outcome for specific passengers based
on
their class, gender, and age, demonstrating the practical application of the model. This
project
not only showcases your ability to apply machine learning techniques but also highlights
your
skills in data preprocessing, visualization, and model evaluation in a real-world context.
In this project, I created a predictive model to determine the likelihood of a passenger surviving the Titanic disaster based on features such as age, class, and family size. I used various machine learning algorithms, including Random Forest, Logistic Regression, and Gradient Boosting, along with feature engineering and hyperparameter tuning to increase prediction accuracy.
HTML Code ipynb file Dataset