Enriching Education Through Data Mining

Education is acknowledged to be the primary vehicle for improving the economic well-being of people. Textbooks have a direct bearing on the quality of education imparted to the students as they are the primary conduits for delivering content knowledge. They are also indispensable for fostering teacher learning and constitute a key component of the ongoing professional development of the teachers. Many textbooks, particularly from emerging countries, lack clear and adequate coverage of important concepts. In this talk, we present our early explorations into developing a data mining based approach for enhancing the quality of textbooks. We discuss techniques for algorithmically augmenting different sections of a book with links to selective content mined from the Web. For finding authoritative articles, we first identify the set of key concept phrases contained in a section. Using these phrases, we find web (Wikipedia) articles that represent the central concepts presented in the section and augment the section with links to them. We also describe a framework for finding images that are most relevant to a section of the textbook, while respecting global relevancy to the entire chapter to which the section belongs. We pose this problem of matching images to sections in a textbook chapter as an optimization problem and present an eficient algorithm for solving it. We also present a diagnostic tool for identifying those sections of a book that are not well-written and hence should be candidates for enrichment. We pro-pose a probabilistic decision model for this purpose, which is based on syntactic complexity of the writing and the newly introduced notion of the dispersion of key concepts mentioned in the section. The model is learned using a tune set which is automatically generated in a novel way. This procedure maps sampled text book sections to the closest versions of Wikipedia articles having similar content and uses the maturity of those versions to assign need-for-enrichment labels. The maturity of a version is computed by considering the revision history of the corresponding Wikipedia article and convolving the changes in size with a smoothing filter. We also provide the results of applying the proposed techniques to a corpus of widely-used, high school textbooks published by the National Council of 2 Authors Suppressed Due to Excessive Length Educational Research and Training (NCERT), India. We consider books from grades IX--XII, covering four broad subject areas, namely, Sciences, Social Sciences, Commerce, and Mathematics. The preliminary results are encouraging and indicate that developing technological approaches to enhancing the quality of textbooks could be a promising direction for research for our fileld.