I. INTRODUCTIONThe process ofinspection, transformation and shaping up of data, keeping the discovery ofuseful information and conclusions as goal is known as Data Analysis. Thissupports decision making.
Information retrieval is carried out by various IRmethods and data is further analyzed. Usually, theevaluation of relevance with the help of some document representations withrespect to the query is done by an IR system. There are various models forrepresentation documents and queries. Thus, each model has its pros and cons. Data analysishas multiple facets and approaches, encompassing diverse techniques under avariety of names, in different business, science, and social science domains. Dataanalysis is closely associated to the visualization and dissemination of data.
The term data analysis is often referred to as data modelling.Informationretrieval refers to the task of extracting out relevant information resources applicableto an information requirement from a set of information resources collected. Usually,metadata or on full-text (or other content) based indexing searches can beperformed.Hidden Markov models have been successfullydesigned and implemented, over the period of last two decades, covering a widevariety of speech and language related recognition problems which includespeech recognition, named entity ending, optical character recognition, andtopic identification and a lot more 1.
In the present work, an application ofthis technology is described by us with respect to the ad hoc IR technique 2.In every HMM implementation, the observed data is modelled by the outputproduced by passing any unknown key through certain noisy channel(s). In the caseof ad hoc IR proposition(s), we represent the observed data as the query, andan unknown key that makes up a desired relevant document. Thus, for eachdocument we can compute the probability that it highly probable that this wasthe relevant document as imagined by the user, given the query. We then rankthe documents based on this measure.II. EXISTINGMODELDatamining is a distinct technique for data analysis which does not concentrateupon purely descriptive purposes, rather, focuses on modelling and discovery ofknowledge for predictive purposes.
Data relying excessively on aggregation andaiming in business information comes under business intelligence. Customer dataand IT tools build the substructure on which a victorious CRM strategy is created.Also, the quick expansion of the web and related technologies has substantiallyextended the number of marketing opportunities. In addition, this has alteredthe way alliance betweenFirmsand their clients are balanced and supervised 3.Predictive analyticsaims at application of statistical models for estimating or categorization, while statistical,language-producing, and systemic techniques are applied to text analysis toacquire and classify information from textual resources. Retrieving informationfrom the web incorporates handling the abstractness and volume of datacontained on the internet.
When including aspects like as word ambiguity and alarge number of typographical errors, it is made increasingly difficult. Thereexist a variety of key pitfalls comprehending IR- relevance, evaluation, andinformation needs. However, this isnot the complete set of issues involving IR.
Common information retrievalproblems include potential, scalability and paging update occurrences. The relationalvalue of an input provided by the user in the form of query, within a dataset,is known as relevance, which is calculated using a ranking algorithm. The larger complicationswith IR that are evaluation and relevance are still significant subject mattersthat require attention, amongst others.The documentsand the respective queries form a corpus of terms where every term within thatdocument is indexed. 1 and 0 denote the presence and absence of some text in atext source respectively 4,5.This is the Boolean model. Maintenance of aninverted index of every term is necessary in order to process matching ofdocument and query. Nonetheless, this model holds certain limitations asexplained further.
Binary decision criterion has a disadvantage that it exists lackingany grading scale concept. Another problem includes overloading of documents.Certain researchers have worked upon this to control the fragility of the abovesaid model by improvising the existing one. Certain researches have also approacheddata analysis with a different search strategy of vectors. This is known as theVector Space model 5.This Model denotesdocuments and queries as vectors.
In this model, every query and document isexpressed as vectors that exist in a |V|-dimensional space. Here V is thecollection of all distinct terms in the set of documents. Here, the documentsset is the vocabulary 5.Markov Processeswere first proposed by Russian Mathematician Andrei Markov. A Markov model inprobability theory is a stochastic model. This model is used to model systemsthat change randomly. In this model, it is presumed that the future states dependonly on the present ones rather than the sequence of events that occurred priorto it 1, 2, 6.
There exist fourMarkov models that are used in different situations, depending on theobservational degree of every sequential state.III. MODEL TO BE IMPLEMENTEDA hidden Markovmodel (HMM) constitutes of a Markov model that is statistical.
Here, the systemto be modelled is presumed to be a Markovian process with states that are hidden,which implies that the states are unobserved. The simplest dynamic Bayesiannetwork can refer to as HMM 7.Forthe measurement of effectiveness of spontaneous information retrieval in thestandard way, we require a collection of tests consisting of three things:? Collection of documents? Test suite of requirements representedas a set of queries.? Set of conclusions, which standardlyis a binary assessment of relevance computed as either relevant or irrelevantfor every text-query pair.
Earlier, thefollowing parameters were in use for the evaluation of performance of IR systems:1.Precision: It is the fraction of documents relevant among the completlyretrieved document. Practically it gives accuracy of the judgement.Precision=|Ra|/|A|2.Recall: The fraction of the documents retrieved and relevant among all relevantdocuments is referred to as recall. Practically it gives coverage of result.Recall=|Ra|/|R| Where,Ra:Set of relevant documents retrievedR:Set of all relevant documents In patternrecognition system and IR with binary classification, precision refers to thefraction of instances retrieved that are found to be relevant, while recallrefers to the fraction of relevant instances that are extracted and retrieved.
Both precision and recall are henceforth derived from an understanding anddegree of relevance.A. Language used- PythonFor both smalland large scale, Python helps enabling clear programs by providing constructs.Its features include a dynamic type system and an automatic memory management.
It also has a huge and all-inclusive standard library.Python’s largestandard library provides tools to users that are suited for numerous tasks.Modules for creating GUIs, connecting them to relational databases, pseudorandomnumber generators, and arithmetic decimals with arbitrary precision,manipulation of regular expressions are included.
It is also capable of performingunit testing.B. Dataset usedTheOHSUMED test collection is a combinational set of 348,566 references fromMEDLINE. It is the on-line database for medical information present on WorldWide Web. It has a title, MeSH indexing terms, author, and an abstract withsource as available fields in the database. The existingOHSUMED topics define the real requirements. Although, the judgements ofrelevance does not have the same coverage as given by the pooling process ofTREC. The information requirements aren’t directly expressed by MeSH but theseterms manage the indexing terms.
The standard TREC format provides the topicstatements and includes only
ohsu.*)Foursearchers replicate each query. Out of these four, two are physicians who are experiencedin searching and the other two are medical librarians. A completely differentset of physicians estimate the results for relevance. This judgement isperformed on a three point scale. The pointers are: definitely, possibly, ornot relevant. Consideration for relevance is done for all documents that arechecked to be either definitely relevant or possibly relevant.(2) MeSHrelevance judgments (files: qrels.
mesh.*) The document is considered to be relevant to aMeSH topic if its concept is included in the list of MeSH term fields. C. WHOOSH: Python LibraryWhoosh wascreated by Matt Chaput. ? Whoosh uses only pure python henceruns anywhere python can, and so is fast.
It runs without requiring a compiler.? Whoosh uses the Okapi BM25F rankingfunction by default, but can be easily modified.? Fairly small indexes are created byWhoosh as compared to numerous other search libraries.? All indexed text in Whoosh must beunicode.Whooshpermits you index free structured text for quickly searching matching documentswith respect to either simple or complex search guidelines.Somepredefined field types are provided by whoosh:whoosh.
fields.TEXTItis used for indexing the text and storing locations for the terms. Thesepositions or locations further allow phrase searching.
whoosh.fields.IDTheentire value of the field is indexed into a single unit using the ID field,rather than breaking it up into separate terms. whoosh.fields.STOREDItis neither an indexed type nor a searchable one. This is useful for displayingthe information to the user in the search results.
whoosh.fields.KEYWORDAnindexed and searchable type, this is created for comma and space separatedwords.
whoosh.fields.NUMERICThisis capable of storing int, long, or floating point numbers in a format that issortable and compactwhoosh.
fields.BOOLEANIndexingof boolean values is done by this field and this type allows users to searchfor results like: true, false, 1, 0, t, f, yes, no.whoosh.fields.DATETIMEDate-timeobjects are stored in this field in a compact and extremely sortable format. A Format object ismade to define the type of information is recorded by a field about each term.It also describes how it has to be stored on the disk. For example, this is howthe postings are stored by the Existence format:While on theother hand, this is how the Positions format would do the same:The Unicodestring is passed to the field’s format object for a field by the indexing code.
An analyser is called by the format object which breaks the string into tokens.Further, encoding of the information is done about each of them.The invertedindex performs mapping of the terms to the documents in which they appear. Also,sometimes it is useful to store a term vector that maps all the terms that arisein the documents to the original document sources.For example,inverted index of a field is:For the image above, the respectiveforward index is:D.
Creating An IndexObject Foropening an existing index in a directory, 1.Open the index directory 2.import whoosh index index Forcreating an index in a directory, 1.Create a new index2.Import os and os path3.if os path doesn’t exist, make index directory os.
mkdir(“indexdir”) and create an index with schema as parameter. Theschema using which the index is created is stored with the index itself. Indexescan be kept in the same directory using the index-name keyword. Touse the functions for convinience1.Create index with schema and index name usage as parameters.2.
Open this index Touse Storage object1.Call storage.create with schema and index name usage as parameters2.Open the storageThe relevance ofthe documents using Hidden Markov Model is compared with the tf.idf approach.
Tf.idf is an approach based on numerical statistic based vector model. It reflectsnecessity of a word to a document in a collection of documents. Often, it is usedin IR and data mining as a weighting factor. The tf-idf valueis proportional to the frequency of appearance of a word given in the document.Although, it is offset by the frequency of the word in the collection.
Thishelps to relate to the fact that in general some words have more frequency ofappearance than others.For theimplementation, the first step is to design the schema and then indexing isperformed 5. Then tf.idf values are calculated using Whoosh Library in Python.For HMM calculation the data observed is assumed to be the query Q, and anunknown key is assumed to be a relevant document D that is desired.
The mind ofthe user is a noisy channel, who is having either some precise or rough notion ofthe documents he requires. This channel transforms that expressed notion intothe query text Q. Hence, we compute the probability for each document D that itwas the relevant one in the user mind, provided that Q was the query which wasexpressed or produced, i.e. P (D is RjQ). We further rank the documents withrespect to this measure 6. This can be incorporated in the form of graphs.
These graphical structures represent informationabout a domain that is uncertain. Particularly, nodes denote random variableswith the edges denoting the probabilistic dependencies transitioning between allthe random variables 8.”Hidden”is the term represents that an observer cannot realise the transition of statesand the underlying sequences by which the output is generated. But he view the outputstates only 9. P (qjD) is theoutput distribution of any document D. It is set to be the sample distributionfor the words that appear in that document.
For any document Dk, we canexplicitly setThis distributionhas the maximum probability of producing Dk by repeatedly sampling the state”General English”. It is estimated byThe summationhere is taken for all documents present in the collection. Using the parametersestimated above, the formula for P (QjDk is R) is stated as under:IV.
ADVANTAGES1. Hidden Markovmodel is a formal substructure used for creating probabilistic models for problemsof linear sequence ‘labelling’. Just by drawing an intuitive image, aconceptual toolkit is provided. This is very useful for building complex models.They are at the hub of a set of miscellaneous programs.
These programs includegene finding, multiple alignments of sequence, profile searches and identificationof regulatory site. 2. HMM is a completeprobabilistic model. The overall ‘scores’ generated for sequences and theparameters calculated are all probabilities 6, 9. Hence, Bayesian probabilitytheory can be incorporated for the manipulation of these numbers in morepowerful ways.
This includes optimization of parameters and interpretation ofthe significance of scores 5.3. HMMs can be proveduseful for modelling of processes which contain different stages that occur indefinite orders 9. If, for example,you want to model the behaviour of a technical system that first boots, thenoperates, then enters sleep mode, and iteratively changes between sleep andoperation later on, you might need three states (boot, operate, sleep) and canuse this process model to find out what’s going on in the system at any onetime. Similar is the case with a human biological system where the observationscan be the sequence of symptoms of a human being.
Human genome project alsorequires the assimilation of HMM for DNA sequencing and RNA structuring 10.V. CHALLENGESComplicationslike scalability and frequencies of paging update are familiar IR issues. Rankingalgorithms are implemented with the usage of methods that elucidaterelationships amongst the given query and the accumulated documents. All thefeedback provided by the IR system has to be evaluated, which is another issuewith IR. The way the system behaves, may or may not converge with thesuppositions of the user. All the documents that are extracted from theprocedure may not be able to give relevance to a given query.
The way a userinteracts with the IR system is termed as Information needs. Retrieval of a lotof information might be disruptive in a number of systems. On the other hand,in another number of systems, not returning a complete set of relevant data maybe inadequate. As experienced,handling a set of voluminous information from the internet might be extremely difficultbecause of the extremely large size of documents the server manages. A thousand ofdocuments can be returned by a simple retrieval query. Many of those documentsare loosely related to the original criteria of retrieval.
To deal with this,an IR system is required to have a query management that is efficient enough aswell as contains a good level of ability in order to give weight as priority todocuments that are closer for relevance to the user query.