MACHINE
LEARNING
Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such
algorithms operate by building a model based on inputs and using that to
make predictions or decisions, rather than following only explicitly programmed
instructions.
Deploying a machine learning model typically takes the
following five steps:
1.
Data
collection.
2.
Data preprocessing:
a.
Data cleaning;
b.
Data transformation;
c.
Divide data into
training and testing sets.
3.
Model Building: Build
a model on training data.
4.
Model Evaluation: Evaluate
the model on the test data.
5.
If the performance is
satisfying, deploy to the real system.
This process can be iterative, meaning we can re-start from step
1 again. For example, after a model is deployed, we can collect new data and
repeat this process. Let’s look at the details of each step:
1.
Data
Collection:
At this stage, we want
to collect all relevant data. For an online business, user click, search
queries, and browsing information should be all be captured and saved into the
database.
In manufacturing, log data capture machine status and activities. Such data are used to produce maintenance schedules and predict required parts for replacement.
In manufacturing, log data capture machine status and activities. Such data are used to produce maintenance schedules and predict required parts for replacement.
2.
Data
Preprocessing:
The data used in
Machine Learning describes factors, attributes, or features of an observation.
Simple first steps in looking at the data include finding missing values.
What is the significance of that missing value? Would replacing a
missing data value with the median value for the feature be acceptable? For
example, perhaps the person filling out a questionnaire doesn't want to reveal
his salary. This could be because the person has a very low salary or a
very high salary. In this case, perhaps using other features to predict
the missing salary data might be appropriate. One might infer the salary
from the person’s zip code. The fact that the value is missing may be
important. There are machine learning methods that ignore missing values
and one of these could be used for this data set.
Data Transformation:
In general we work
with both numerical and categorical data. Numerical data consists of
actual numbers, while categorical data have a few discrete values.
Examples of categorical data include eye color, species type, marriage status,
or gender. Actually a zip code is categorical. The zip code is a
number but there is no meaning to adding two zip codes. There may or may
not be an order to categorical data. For instance good, better, best is
descriptive categorical data which has an order.
3) After the data has been cleaned and transformed it needs to be split into a training-set and a
Test-set.
3.
Model
Building:
This training data set
is used to create the model which is used to predict the answers for new cases
in which the answer or target is unknown. Several different modeling techniques
have been introduced and will be discussed in detail in future sections.
Various models can be built using the same training data set.
4. Model Evaluation
Once the model is
built with the training data, it is used to predict the targets for the test data.
First the target values are removed from the test data set. The
model is applied to the test data set to predict the target values for the test
data. The predicted value of the target is then compared with the actual
target value. The accuracy of the model is the percentage of correct
predictions made. These accuracies of can be used to compare the
different models.
5. Model Deployment:
This is the most
important step. If the speed and accuracy of the model is acceptable,
then that model should be deployed in the real system. The model that is
used in production should be made with all the available data. Models improve
with the amount of available data used to create the model. The results of the
model need to be incorporated in the business strategy. Data mining
models provide valuable information which gives companies great advantages.
Real World Applications of Machine Learning
1.
Speech Recognition
2.
Computer Vision
3.
Bio-surveillance
4.
Robot Control
5.
Accelerating Empirical Sciences
Speech Recognition
Currently available
commercial systems for speech recognition all use machine learning in one
fashion or another to train the system to recognize speech. The reason is
simple: the speech recognition accuracy is greater if one trains the system,
than if one attempts to program it by hand. In fact, many commercial speech
recognition systems involve two distinct learning phases: one before the software
is shipped (training the general system in a speaker-independent fashion), and
a second phase after the user purchases the software (to achieve greater
accuracy by training in a speaker-dependent fashion).
Computer Vision
Many current vision systems,
from face recognition systems, to systems that automatically classify
microscopic images of cells, are developed using machine learning, again
because the resulting systems are more accurate than hand-crafted programs. One
massive-scale application of computer vision trained using machine learning is
its use by the US Post Office to automatically sort letters containing
handwritten addresses. Over 85% of handwritten mail in the US is sorted
automatically, using handwriting analysis software trained to very high
accuracy using machine learning over a very large data set.
Bio-surveillance
A variety of government
efforts to detect and track disease outbreaks now use machine learning. For
example, the RODS project involves real-time collection of admissions reports
to emergency rooms across western Pennsylvania, and the use of machine learning
software to learn the profile of typical admissions so that it can detect
anomalous patterns of symptoms and their geographical distribution. Current
work involves adding in a rich set of additional data, such as retail purchases
of over-the-counter medicines to increase the information flow into the system,
further increasing the need for automated learning methods given this even more
complex data set.
Robot Control
Machine learning methods
have been successfully used in a number of robot systems. For example, several
researchers have demonstrated the use of machine learning to acquire control
strategies for stable helicopter flight and helicopter aerobatics. The recent
Darpa-sponsored competition involving a robot driving autonomously for over 100
miles in the desert was won by a robot that used machine learning to refine its
ability to detect distant objects (training itself from self-collected data
consisting of terrain seen initially in the distance, and seen later up close).
Accelerating
Empirical studies
Many data-intensive sciences
now make use of machine learning methods to aid in the scientific discovery
process. Machine learning is being used to learn models of gene expression in
the cell from high-throughput data, to discover unusual astronomical objects
from massive data collected by the Sloan sky survey, and to characterize the
complex patterns of brain activation that indicate different cognitive states
of people in fMRI scanners. Machine learning methods are reshaping the practice
of many data-intensive empirical sciences, and many of these sciences now hold
workshops on machine learning as part of their field’s conferences.
No comments:
Post a Comment