Abstract: Agrowing number of applications that generate massive streams of data needintelligent data processing and online analysis. Applications like Real-timesurveillance systems, telecommunication systems, sensor networks and otherdynamic environments are such examples.
The imminent need for turning such unprocesseddata into useful information and knowledge augments the development of systems,algorithms and frameworks that address data streaming challenges. The storage,querying and mining of such data sets are highly computationally challengingtasks. Mining data streams is concerned with extracting knowledge structuresrepresented in models and patterns in non-stopping streams of information.Generally, two main challenges are designing fast mining methods for datastreams and need to promptly detect changing concepts and data distributionbecause of highly dynamic nature of data streams. The goal of this article isto analyze and classify the application of diverse data mining techniques indifferent challenges of data stream mining. In this paper, we present thetheoretical foundations of data stream analysis and propose an analyticalframework for data stream mining techniques. Keywords:Data Stream, Data StreamMining, Stream Preprocessing.
1. Introduction Data mining techniques are suitable for simpleand structured data sets like relational databases, transactional databases anddata warehouses. Fast and continuous development of advanced database systems,data collection technologies, and the World Wide Web, makes data grow rapidlyin various and complex forms such as semi structured and non-structured data,spatial and temporal data, and hypertext and multimedia data. Therefore, miningof such complex data becomes an important task in data mining realm. In recentyears different approaches are proposed to overcome the challenges of storingand processing of fast and continuous streams of data. Data stream can beconceived as a continuous and changing sequence of data that continuouslyarrive at a system to store or process. Imagine a satellite-mounted remotesensor that is constantly generating data.
The data are massive (e.g., terabytesin volume), temporally ordered, fast changing, and potentially infinite. Thesefeatures cause challenging problems in data streams field. Traditional OLAP anddata mining methods typically require multiple scans of the data and aretherefore infeasible for stream data applications. Whereby data streams can beproduced in many fields, it is crucial to modify mining techniques to fit datastreams. Data stream mining has many applications and is a hot research area.With recent progress in hardware and software technologies, differentmeasurement can be done in various fields.
These measurements are continuouslyfeasible for data with high changing ratio. Common applications which requiremining of large amount of data to find new patterns are sensor networks, storeand search of web events, and computer networks traffic. These patterns arevaluable for decision makings. DataStream mining refers to informational structure extraction as models andpatterns from continuous data streams. Data Streams have different challengesin many aspects, such as computational, storage, querying and mining. Based on last researches,because of data stream requirements, it is necessary to design new techniquesto replace the old ones. Traditional methods would require the data to be firststored and then processed off-line using complex algorithms that make severalpass over the data, but data stream is infinite and data generates with highrates, so it is impossible to store it.
Therefore two main challenges aredesigning fast mining methods for data streams and; need to detect promptly changing concepts anddata distribution because of highly dynamic nature of data streams A firstresearch challenge is designing fast and light mining methods for data streams,for example, algorithms that only require one pass over the data and work withlimited memory. Another challenge is created by the highly dynamic nature ofdata streams, whereby the stream mining algorithms need to detect promptlychanging concepts and data distribution and adapt to them. 2. Data stream mining:High volume and potentialinfinite data streams are generated by so many resources such as real-timesurveillance systems, communication networks, Internet traffic, on-linetransactions in the financial market or retail industry, electric power grids,industry production processes, scientific and engineering experiments, remotesensors, and other dynamic environments.
In data stream model, data items canbe relational tuples like network measurements and call records. In comparisonwith traditional data sets, data stream flows continuously in systems withvarying update rate. Data streams are continuous, temporally ordered, fastchanging, massive and potentially infinite.
Due to huge amount and high storagecost, it is impossible to store an entire data streams or to scan through itmultiple times. So, it makes so many challenges in storage, computational andcommunication capabilities of computational systems. Because of high volume andspeed of input data, it is needed to use semi-automatic interactionaltechniques to extract embedded knowledge from data. Data stream mining is theextraction of structures of knowledge that are represented in the case ofmodels and patterns of infinite streams of information.For extracting knowledgeor patterns from data streams, it is crucial to develop methods that analyzeand process streams of data in multidimensional, multi-level, single pass andonline manner. These methods should not be limited to data streams only,because they are also needed when we have large volume of data. Moreover,because of the limitation of data streams, the proposed methods are based onstatistic, calculation and complexity theories.
For example, by usingsummarization techniques that are derived from statistic science, we canconfront with memory limitation. In addition, some of the techniques in computationtheory can be used for implementing time and space efficient algorithms. Byusing these techniques we can also use common data mining approaches byenforcing some changes in data streams.
Some solutions have been proposed basedon data stream mining problems and challenges. Data-based techniques refer tosummarizing the whole dataset or choosing a subset of the incoming stream to beanalyzed. Sampling, load and sketching techniques represent the former one.
Synopsis data structures and aggregation represent the later one. Task-basedtechniques are those methods that modify existing techniques or invent new onesin order to address the computational challenges of data stream processing.Approximation algorithms, sliding window and algorithm output granularityrepresent this category. Sampling refers to the process ofprobabilistic choice of a data item to be processed or not. The problem withusing sampling in the context of data stream analysis is the unknown datasetsize. Thus, the treatment of data stream should follow a special analysis tofind the error bounds. Another problem with sampling is that it would beimportant to check for anomalies for surveillance analysis as an application inmining data streams.
Sampling may not be the right choice for such anapplication. Sampling also does not address the problem of fluctuating datarates. It would be worth investigating the relationship among the threeparameters: data rate, sampling rate and error bounds.
Load shedding refers tothe process of dropping a sequence of data streams. Load shedding has been usedsuccessfully in querying data streams. It has the same problems of sampling.
Load shedding is difficult to be used with mining algorithms because it dropschunks of data streams that could be used in the structuring of the generatedmodels or it might represent a pattern of interest in time series analysis.Sketching is the process of randomly project a subset of the features. It isthe process of vertically sample the incoming stream. Sketching has beenapplied in comparing different data streams and in aggregate queries.
The majordrawback of sketching is that of accuracy. It is hard to use it in the contextof data stream mining. Creating synopsis of data refers to the process ofapplying summarization techniques that are capable of summarizing the incomingstream for further analysis.
Wavelet analysis, histograms, quantiles andfrequency moments have been proposed as synopsis data structures. Sincesynopsis of data does not represent all the characteristics of the dataset,approximate answers are produced when using such data structures. The processin which the input stream is represented in a summarized form is calledaggregation. This aggregate data can be used in data mining algorithms. The mainproblem of this method is that highly fluctuating data distributions reduce themethod’s efficiency. Approximation algorithms have their roots in algorithmdesign.
It is concerned with design algorithms for computationally hardproblems. These algorithms can result in an approximate solution with errorbounds. The idea is that mining algorithms are considered hard computationalproblems given its features of continuality and speed and the generatingenvironment that is featured by being resource constrained.
Preprocessing techniquesfor data stream mining: · Data-basedsolutions 1. Sampling 2. Load shedding 3. Sketching 4. Synopsis dataStructures5. Aggregation · Task-basedsolutions1. ApproximationAlgorithms 2. Sliding window3.
Algorithm OutputGranularity Approximation algorithmshave attracted researchers as a direct solution to data stream mining problems.However, the problem of data rates with regard with the available resourcescould not be solved using approximation algorithms. Other tools should be usedalong with these algorithms in order to adapt to the available resources.Approximation algorithms have been used in. The inspiration behind slidingwindow is that the user is more concerned with the analysis of most recent datastreams. Thus, the detailed analysis is done over the most recent data itemsand summarized versions of the old ones. 3.
Classification of data stream challenges: There are different challenges in data streammining that cause many research issues in this field. Regarding to data streamrequirements, developing stream mining algorithms is needed more studying thantraditional mining methods. We can classify stream mining challenges in 5categories; Irregular rate of arrival and variant data arrival rate over time,Quality of mining results, Bounded memory size and huge amount of data streams,Limited resources, e.g.
, memory space and computation power and to facilitatedata analysis and take a quick decision for users. In the following each ofthem will be described. One of the most important issues in data stream miningis optimization of memory space consumed by the mining algorithm. Memorymanagement is a main challenge in stream processing because many real datastreams have irregular arrival rate and variation of data arrival rate overtime. In many applications like sensor networks, stream mining algorithms withhigh memory cost is not applicable. Therefore, it is necessary to developsummarizing techniques for collecting valuable information from data streams.
Data pre-processing is an important andtime-consuming phase in the knowledge discovery process and must be taken intoconsideration when mining data streams. Designing a light-weight preprocessingtechnique that can guarantee quality of the mining results is crucial. Thechallenge here is to automate such a process and integrate it with the miningtechniques. By considering the size of memory and the huge amount of datastream that continuously arrive to the system, it is needed to have a compactdata structure to store, update and retrieve the collected information. Withoutsuch a data structure, the efficiency of mining algorithm will largelydecrease.
Even if we store the information in disks, the additional I/Ooperations will increase the processing time. While it is impossible to rescanthe entire input data, incremental maintaining of data structure isindispensable. Furthermore, novel indexing, storage and querying techniques arerequired to manage continuous and changing flow of data streams.
It is crucialto consider the limited resources such as memory space and computation powerfor reaching accurate estimates in data streams mining. If stream data mining algorithms consume theavailable resources without any consideration, the accuracy of their resultswould decrease dramatically. In several papers this issue is discussed andtheir solutions for resource-aware mining. Visualization is a powerful way tofacilitate data analysis.
Absence of suitable tools for visualization of miningresult makes many problems in data analysis and quick decision making by user.This challenge still is a research issue that one of the proposed approaches isintelligent monitoring. 4.
The proposed analytical framework This research ends in ananalytical framework. This framework tries to show the efficiency of datamining applications in developing the novel data stream mining algorithms.These algorithms are classified base on the data mining tasks. We described thedetails of these algorithms based on preprocessing steps and the followingsteps. In addition, this framework candirect future works in this field. Some of the most important results that havebeen reached during this research are: (1) Mining data streams has raised a number ofresearch challenges for the data mining community.
Due to the resource and timeconstraints many summarization and approximation techniques have been adoptedfrom the fields of statistics and computational theory. (2) There are many openissues that need to be addressed. The development of systems that will fullyaddress these issues is crucial for accelerating the science discovery in thefields of physics and astronomy, as well as in business and financialapplications.
5. Conclusion In this paper we reviewed and analyzed data miningapplications for solving data stream mining challenges. At first, we presenteda comprehensive classification for data stream mining algorithms based on datamining applications. In this classification, we separate algorithms withpreprocessing from those without preprocessing. In addition, we classifypreprocessing techniques in a distinct classification.
In the following, thelayered architecture of the classification represents almost all of thechallenges that are mentioned in various researches. Then we discussed theapplication of data mining techniques for addressing the challenges of datastream mining. Results are shown that it is necessary to adopt manysummarization and approximation techniques from the fields of statistics andcomputational theory, besides crucial changes that are needed in common datamining techniques. In spite of the researches that have been done on datamining’s application in data stream mining so far, there are still wide areasfor further researches.