Text Mining - Concepts, Implementation, and Big Data Challenge

von: Taeho Jo

Springer-Verlag, 2018

ISBN: 9783319918150 , 376 Seiten

Format: PDF, Online Lesen

Kopierschutz: Wasserzeichen

Mac OSX,Windows PC für alle DRM-fähigen eReader Apple iPad, Android Tablet PC's Online-Lesen für: Mac OSX,Linux,Windows PC

Preis: 149,79 EUR

eBook anfordern eBook anfordern

Mehr zum Inhalt

Text Mining - Concepts, Implementation, and Big Data Challenge


 

Preface

6

Contents

8

Part I Foundation

15

1 Introduction

17

1.1 Definition of Text Mining

17

1.2 Texts

18

1.2.1 Text Components

19

1.2.2 Text Formats

20

1.3 Data Mining Tasks

21

1.3.1 Classification

21

1.3.2 Clustering

23

1.3.3 Association

24

1.4 Data Mining Types

25

1.4.1 Relational Data Mining

26

1.4.2 Web Mining

27

1.4.3 Big Data Mining

28

1.5 Summary

30

2 Text Indexing

32

2.1 Overview of Text Indexing

32

2.2 Steps of Text Indexing

34

2.2.1 Tokenization

34

2.2.2 Stemming

36

2.2.3 Stop-Word Removal

37

2.2.4 Term Weighting

38

2.3 Text Indexing: Implementation

40

2.3.1 Class Definition

40

2.3.2 Stemming Rule

43

2.3.3 Method Implementations

45

2.4 Additional Steps

48

2.4.1 Index Filtering

48

2.4.2 Index Expansion

50

2.4.3 Index Optimization

51

2.5 Summary

53

3 Text Encoding

54

3.1 Overview of Text Encoding

54

3.2 Feature Selection

56

3.2.1 Wrapper Approach

56

3.2.2 Principal Component Analysis

57

3.2.3 Independent Component Analysis

59

3.2.4 Singular Value Decomposition

62

3.3 Feature Value Assignment

63

3.3.1 Assignment Schemes

63

3.3.2 Similarity Computation

65

3.4 Issues of Text Encoding

66

3.4.1 Huge Dimensionality

66

3.4.2 Sparse Distribution

67

3.4.3 Poor Transparency

68

3.5 Summary

70

4 Text Association

72

4.1 Overview of Text Association

72

4.2 Data Association

74

4.2.1 Functional View

74

4.2.2 Support and Confidence

75

4.2.3 Apriori Algorithm

77

4.3 Word Association

79

4.3.1 Word Text Matrix

79

4.3.2 Functional View

81

4.3.3 Simple Example

82

4.4 Text Association

84

4.4.1 Functional View

84

4.4.2 Simple Example

85

4.5 Overall Summary

87

Part II Text Categorization

89

5 Text Categorization: Conceptual View

91

5.1 Definition of Text Categorization

91

5.2 Data Classification

93

5.2.1 Binary Classification

93

5.2.2 Multiple Classification

94

5.2.3 Classification Decomposition

95

5.2.4 Regression

97

5.3 Classification Types

98

5.3.1 Hard vs Soft Classification

98

5.3.2 Flat vs Hierarchical Classification

100

5.3.3 Single vs Multiple Viewed Classification

102

5.3.4 Independent vs Dependent Classification

104

5.4 Variants of Text Categorization

106

5.4.1 Spam Mail Filtering

106

5.4.2 Sentimental Analysis

107

5.4.3 Information Filtering

109

5.4.4 Topic Routing

110

5.5 Summary and Further Discussions

111

6 Text Categorization: Approaches

112

6.1 Machine Learning

112

6.2 Lazy Learning

114

6.2.1 K Nearest Neighbor

115

6.2.2 Radius Nearest Neighbor

117

6.2.3 Distance-Based Nearest Neighbor

118

6.2.4 Attribute Discriminated Nearest Neighbor

120

6.3 Probabilistic Learning

121

6.3.1 Bayes Rule

122

6.3.2 Bayes Classifier

123

6.3.3 Naive Bayes

125

6.3.4 Bayesian Learning

127

6.4 Kernel Based Classifier

129

6.4.1 Perceptron

130

6.4.2 Kernel Functions

131

6.4.3 Support Vector Machine

133

6.4.4 Optimization Constraints

135

6.5 Summary and Further Discussions

137

7 Text Categorization: Implementation

139

7.1 System Architecture

139

7.2 Class Definitions

141

7.2.1 Classes: Word, Text, and PlainText

141

7.2.2 Interface and Class: Classifier and KNearestNeighbor

144

7.2.3 Class: TextClassificationAPI

146

7.3 Method Implementations

147

7.3.1 Class: Word

148

7.3.2 Class: PlainText

149

7.3.3 Class: KNearestNeighbor

151

7.3.4 Class: TextClassificationAPI

152

7.4 Graphic User Interface and Demonstration

155

7.4.1 Class: TextClassificationGUI

155

7.4.2 Preliminary Tasks and Encoding

157

7.4.3 Classification Process

162

7.4.4 System Upgrading

165

7.5 Summary and Further Discussions

166

8 Text Categorization: Evaluation

167

8.1 Evaluation Overview

167

8.2 Text Collections

169

8.2.1 NewsPage.com

169

8.2.2 20NewsGroups

170

8.2.3 Reuter21578

171

8.2.4 OSHUMED

173

8.3 F1 Measure

174

8.3.1 Contingency Table

175

8.3.2 Micro-Averaged F1

176

8.3.3 Macro-Averaged F1

178

8.3.4 Example

180

8.4 Statistical t-Test

181

8.4.1 Student's t-Distribution

181

8.4.2 Unpaired Difference Inference

184

8.4.3 Paired Difference Inference

185

8.4.4 Example

187

8.5 Summary and Further Discussions

188

Part III Text Clustering

190

9 Text Clustering: Conceptual View

191

9.1 Definition of Text Clustering

191

9.2 Data Clustering

192

9.2.1 SubSubsectionTitle

193

9.2.2 Association vs Clustering

194

9.2.3 Classification vs Clustering

195

9.2.4 Constraint Clustering

196

9.3 Clustering Types

197

9.3.1 Static vs Dynamic Clustering

198

9.3.2 Crisp vs Fuzzy Clustering

199

9.3.3 Flat vs Hierarchical Clustering

201

9.3.4 Single vs Multiple Viewed Clustering

202

9.4 Derived Tasks from Text Clustering

204

9.4.1 Cluster Naming

204

9.4.2 Subtext Clustering

205

9.4.3 Automatic Sampling for Text Categorization

207

9.4.4 Redundant Project Detection

208

9.5 Summary and Further Discussions

209

10 Text Clustering: Approaches

210

10.1 Unsupervised Learning

210

10.2 Simple Clustering Algorithms

211

10.2.1 AHC Algorithm

212

10.2.2 Divisive Clustering Algorithm

213

10.2.3 Single Pass Algorithm

214

10.2.4 Growing Algorithm

216

10.3 K Means Algorithm

218

10.3.1 Crisp K Means Algorithm

218

10.3.2 Fuzzy K Means Algorithm

219

10.3.3 Gaussian Mixture

220

10.3.4 K Medoid Algorithm

221

10.4 Competitive Learning

224

10.4.1 Kohonen Networks

224

10.4.2 Learning Vector Quantization

226

10.4.3 Two-Dimensional Self-Organizing Map

227

10.4.4 Neural Gas

229

10.5 Summary and Further Discussions

230

11 Text Clustering: Implementation

232

11.1 System Architecture

232

11.2 Class Definitions

234

11.2.1 Classes in Text Categorization System

234

11.2.2 Class: Cluster

237

11.2.3 Interface: ClusterAnalyzer

239

11.2.4 Class: AHCAlgorithm

240

11.3 Method Implementations

242

11.3.1 Methods in Previous Classes

242

11.3.2 Class: Cluster

244

11.3.3 Class: AHC Algorithm

246

11.4 Class: ClusterAnalysisAPI

247

11.4.1 Class: ClusterAnalysisAPI

248

11.4.2 Class: ClusterAnalyzerGUI

249

11.4.3 Demonstration

251

11.4.4 System Upgrading

252

11.5 Summary and Further Discussions

253

12 Text Clustering: Evaluation

255

12.1 Introduction

255

12.2 Cluster Validations

256

12.2.1 Intra-Cluster and Inter-Cluster Similarities

256

12.2.2 Internal Validation

258

12.2.3 Relative Validation

259

12.2.4 External Validation

261

12.3 Clustering Index

263

12.3.1 Computation Process

263

12.3.2 Evaluation of Crisp Clustering

264

12.3.3 Evaluation of Fuzzy Clustering

265

12.3.4 Evaluation of Hierarchical Clustering

267

12.4 Parameter Tuning

269

12.4.1 Clustering Index for Unlabeled Documents

269

12.4.2 Simple Clustering Algorithm with Parameter Tuning

270

12.4.3 K Means Algorithm with Parameter Tuning

271

12.4.4 Evolutionary Clustering Algorithm

272

12.5 Summary and Further Discussions

273

Part IV Advanced Topics

275

13 Text Summarization

277

13.1 Definition of Text Summarization

277

13.2 Text Summarization Types

278

13.2.1 Manual vs Automatic Text Summarization

279

13.2.2 Single vs Multiple Text Summarization

280

13.2.3 Flat vs Hierarchical Text Summarization

282

13.2.4 Abstraction vs Query-Based Summarization

284

13.3 Approaches to Text Summarization

285

13.3.1 Heuristic Approaches

286

13.3.2 Mapping into Classification Task

287

13.3.3 Sampling Schemes

289

13.3.4 Application of Machine Learning Algorithms

291

13.4 Combination with Other Text Mining Tasks

293

13.4.1 Summary-Based Classification

294

13.4.2 Summary-Based Clustering

295

13.4.3 Topic-Based Summarization

296

13.4.4 Text Expansion

298

13.5 Summary and Further Discussions

299

14 Text Segmentation

301

14.1 Definition of Text Segmentation

301

14.2 Text Segmentation Type

302

14.2.1 Spoken vs Written Text Segmentation

302

14.2.2 Ordered vs Unordered Text Segmentation

304

14.2.3 Exclusive vs Overlapping Segmentation

306

14.2.4 Flat vs Hierarchical Text Segmentation

308

14.3 Machine Learning-Based Approaches

310

14.3.1 Heuristic Approaches

310

14.3.2 Mapping into Classification

311

14.3.3 Encoding Adjacent Paragraph Pairs

313

14.3.4 Application of Machine Learning

315

14.4 Derived Tasks

317

14.4.1 Temporal Topic Analysis

317

14.4.2 Subtext Retrieval

319

14.4.3 Subtext Synthesization

320

14.4.4 Virtual Text

321

14.5 Summary and Further Discussions

322

15 Taxonomy Generation

324

15.1 Definition of Taxonomy Generation

324

15.2 Relevant Tasks to Taxonomy Generation

325

15.2.1 Keyword Extraction

325

15.2.2 Word Categorization

327

15.2.3 Word Clustering

329

15.2.4 Topic Routing

330

15.3 Taxonomy Generation Schemes

332

15.3.1 Index-Based Scheme

332

15.3.2 Clustering-Based Scheme

333

15.3.3 Association-Based Scheme

334

15.3.4 Link Analysis-Based Scheme

336

15.4 Taxonomy Governance

337

15.4.1 Taxonomy Maintenance

337

15.4.2 Taxonomy Growth

339

15.4.3 Taxonomy Integration

340

15.4.4 Ontology

342

15.5 Summary and Further Discussions

344

16 Dynamic Document Organization

346

16.1 Definition of Dynamic Document Organization

346

16.2 Online Clustering

347

16.2.1 Online Clustering in Functional View

347

16.2.2 Online K Means Algorithm

349

16.2.3 Online Unsupervised KNN Algorithm

350

16.2.4 Online Fuzzy Clustering

351

16.3 Dynamic Organization

353

16.3.1 Execution Process

353

16.3.2 Maintenance Mode

354

16.3.3 Creation Mode

355

16.3.4 Additional Tasks

356

16.4 Issues of Dynamic Document Organization

357

16.4.1 Text Representation

358

16.4.2 Binary Decomposition

358

16.4.3 Transition into Creation Mode

359

16.4.4 Variants of DDO System

360

16.5 Summary and Further Discussions

361

References

363

Index

368