The identification and subsequent analysis of research articles for machine learning and natural language processing is a complicated task given the lack of consistent article organization principles and heading naming conventions across publishers and journals. Given this, an understanding of how research articles organizationally follow a common function and their use of various heading terms, or forms, is a critical step in applying machine learning techniques for data and information mining across a corpus of articles. To address this need, the authors developed and implemented an article heading form and function analysis across 12 publishers including both research articles and nonresearch articles. Our aim was to (a) identify each of the labeled sections used by research articles, define these sections based on their rhetorical function, and determine frequency of use; (b) within the given data set, determine all of the alternative labels used to identify these sections; and (c) determine whether these sections can be used to consistently determine (1) whether an article is a true research article, or (2) whether an article is not a research article. The results indicated wide variability in the organization of research articles with 24 common sections, known by 186 different names both within and across publishing houses.
All Science Journal Classification (ASJC) codes
- Library and Information Sciences
- Numerical Analysis
- Cultural Studies