1 Introduction

The present project was created as the assignment for the course “Historical Inquiries with R” taught by Prof. Maxim Romanov during the spring semester of 2020 at the University of Vienna. The project is part of my Ph.D. dissertation of the working title The Zheng He Missions: A Critical Discourse Analysis of Recent Mainland Chinese Historiography. The seven large-scale maritime missions ordered by China’s Ming dynasty and led by the imperial admiral Zheng He between 1405 and 1433 CE included visits to a number of port cities in Southeast Asia, South Asia, the Middle East, and East Africa. In the early 20th century, Zheng He’s missions were rediscovered by Chinese historians as a symbol of the country’s past maritime might, especially in the context of anti-colonial struggles. Since the ‘Reform and Opening-up’ (1978-) of the post-Mao era, the mostly peaceful missions often became a symbol of ‘peaceful rise’, economic openness and ‘win-win’ cooperation with the outside world, principles propagated by the country’s leadership. In my Ph.D. dissertation I investigate the historiography on the missions during the last two decades, in the light of China’s emerging self-perception as a (re)emerging regional and global power. The present project is an analysis of publication data (incl. abstracts) from 3,467 items downloaded from China’s largest academic database the China National Knowledge Infrastructure, related to the missions.

library(knitr)
library(yaml)
library(tinytex)
library(dplyr)
library(stringr)
library(NLP)
library(ggplot2)
library(tidyr)
library(xlsx)
library(tidytext)
library(stopwords)
library(tidyr)
library(readr)
library(tm)
library(topicmodels)
library(tictoc)
library(wordcloud)
library(LDAvis)
library(slam)
library(servr)

2 Pre-processing of the data

The data used for the project was exported from CNKI as a TXT file structured according to the RefWorks format. The original TXT file was reformatted into a TSV file using RegEx in Sublime Text and Python (for original search terms, and the reformatting process see Hompot_Zheng_He_project_TXTtoTSV.html).

#Adjustments in Sublime: Reference_type entries decapitalized, underscores added,
#the unnecessary initial string "<正>" removed from a number of abstracts.
tsv_file = read.csv("C:/Users/Siming/Desktop/siming/classes/R/AA_project/PRO_ZHENG_HE1.csv", sep="\t", encoding="UTF-8", header=TRUE, na.strings=c("","NA"))
glimpse(tsv_file)
## Rows: 3,466
## Columns: 11
## $ Reference_type   <chr> "journal_article", "journal_article", "journal_art...
## $ Ref_subtype      <chr> NA, NA, NA, NA, "<U+7855><U+58EB>", "<U+7855><U+58EB>", NA, NA, NA, NA, "<U+7855><U+58EB>", ...
## $ Title            <chr> "<U+62C9><U+65AF><U+6D77><U+9A6C><U+963F><U+5C14><U+9A6C><U+5854><U+592B><U+9057><U+5740>2019<U+5E74><U+8003><U+53E4><U+6536><U+83B7>", "<U+90D1><U+548C><U+4E0B><U+897F><U+6D0B><U+5BF9><U+6211><U+56FD><U+519C><U+4E1A><U+76F8><U+5173><U+7ECF><U+6D4E><U+7684><U+5F71><U+54CD>", "<U+90D1><U+548C><U+4E0B><U+897F>...
## $ Language         <chr> "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "...
## $ Author           <chr> "<U+7FDF><U+6BC5>;<U+90A2><U+589E><U+9510>;<U+4E01><U+94F6><U+5FE0>;<U+738B><U+5149><U+5C27>;<U+5F20><U+7136>;<U+5F20><U+5E0C><U+7AE0>;<U+82CF><U+5929><U+654F>;<U+5F6D><U+535A>;<U+738B><U+601D><U+9633>;<U+5F20><U+5B87><U+4EAE>;<U+80E1><U+5B50><U+5C27>;<U+80E1><U+5146><U+8F89>;", "...
## $ Author_address   <chr> "<U+6545><U+5BAB><U+535A><U+7269><U+9662>;<U+963F><U+8054><U+914B><U+62C9><U+65AF><U+6D77><U+9A6C><U+53E4><U+7269><U+4E0E><U+535A><U+7269><U+9986><U+90E8>;<U+82F1><U+56FD><U+675C><U+4F26><U+5927><U+5B66><U+8003><U+53E4><U+7CFB>;<U+5409><U+6797><U+5927><U+5B66><U+8003><U+53E4><U+5B66><U+9662>;", "<U+5317><U+4EAC><U+4E2D><U+519C><U+5BCC>...
## $ Publishing_place <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Publisher        <chr> "<U+6545><U+5BAB><U+535A><U+7269><U+9662><U+9662><U+520A>", "<U+7518><U+8083><U+519C><U+4E1A>", "<U+56FD><U+5BB6><U+822A><U+6D77>", "<U+4E2D><U+56FD><U+53F2><U+7814><U+7A76><U+52A8><U+6001>", "<U+5357><U+4EAC><U+5E08><U+8303><U+5927><U+5B66>", "<U+9655>...
## $ Year             <int> 2020, 2020, 2018, 2012, 2012, 2012, 2012, 2012, 20...
## $ Keywords         <chr> "<U+6731><U+5C14><U+6CD5>;<U+963F><U+5C14><U+9A6C><U+5854><U+592B>;<U+660E><U+521D><U+5FA1><U+7A91><U+4EA7><U+54C1>;<U+90D1><U+548C>; Jufar;;Al-Mataf;;early Ming ...
## $ Abstract         <chr> "2019<U+5E74>11<U+6708>-12<U+6708>,<U+6545><U+5BAB><U+535A><U+7269><U+9662><U+8003><U+53E4><U+7814><U+7A76><U+6240><U+548C><U+963F><U+8054><U+914B><U+62C9><U+65AF><U+6D77><U+9A6C><U+53E4><U+7269><U+4E0E><U+535A><U+7269><U+9986><U+90E8><U+3001><U+82F1><U+56FD><U+675C><U+4F26><U+5927><U+5B66><U+8003><U+53E4><U+7CFB><U+53CA>...
df <- tsv_file %>%
  arrange(Year)

#Adding ID columns, ID_str including a unique id, reference type, and publication year.
df$ID_num <- str_pad(seq(nrow(df)), 4, pad="0")

df <- df %>%
  mutate(ID_str = paste0(ID_num, "_", Reference_type, "_", Year))

df <- df %>%
  select(ID_num, ID_str, Reference_type, Title, Language, Author, Author_address, Publishing_place, Publisher, Year, Keywords, Abstract)

3 Analysis and visualization

3.1 Publication years and types of items

In the following section, I summarize publication years, and the number and ratio of various types of items within the corpus (journal articles, dissertations, etc.).

pub_years <- df %>%
  group_by(Year) %>%
  count(sort=TRUE)

pub_years
## # A tibble: 56 x 2
## # Groups:   Year [56]
##     Year     n
##    <int> <int>
##  1  2005   497
##  2    NA   473
##  3  2014   173
##  4  2015   157
##  5  2006   156
##  6  2013   126
##  7  2012   107
##  8  2011   105
##  9  1985   100
## 10  2004    99
## # ... with 46 more rows
ggplot(pub_years, aes(x=Year, y=n)) +
  geom_line() +
  scale_x_continuous(breaks=seq(1930,2020,5)) +
  labs(title="Number of articles by year")

reftype_n <- df %>%
  group_by(Reference_type) %>%
  count(sort=T)

reftype_n
## # A tibble: 10 x 2
## # Groups:   Reference_type [10]
##    Reference_type            n
##    <chr>                 <int>
##  1 journal_article        2450
##  2 standard                472
##  3 conference_proceeding   286
##  4 ma_thesis               120
##  5 newspaper_article        89
##  6 phd_dissertation         26
##  7 patent                   19
##  8 other                     2
##  9 dissertation_thesis       1
## 10 <NA>                      1
#Checking out what kind of articles are referenced as 'standard' and 'patent'.
#Since RStudio does not properly display Chinese documents (shows Unicodes in the format
#<U+xxxx> instead), I wrote out most results into separate data sets and chose them from
#the 'Data' list to view them (equivalent to View() function)
standard_titles <- df %>%
  filter(Reference_type == "standard") %>%
  select(Title)

patent_titles <- df %>%
  filter(Reference_type == "patent") %>%
  select(Title)

‘Standards’ are apparently mostly related to the organization of commemorative events on the missions, as well as to the establishment of Zheng He-themed touristic sites. ‘Patents’ are mostly related to replicas of Zheng He’s ships.

reftype_year <- df %>%
  group_by(Year, Reference_type) %>%
  count()

reftype_year <- reftype_year %>%
  filter(!is.na(Year)) %>%
  filter(!is.na(Reference_type))

#reftype_year_spread <- spread(reftype_year, Reference_type, n)

#reftype_year_spread <- reftype_year_spread %>%
  #select(journal_article, standard, conference_proceeding, ma_thesis, newspaper_article, phd_dissertation)

ggplot(reftype_year, aes(x=Year, y=n)) +
  geom_line(aes(color=Reference_type)) +
  scale_x_continuous(breaks=seq(1930,2020,10))

The dataset shows that journal articles are by far the most common type of publications on the Zheng He missions. The number of publications went through several spikes during the last decades. The first one, around 1985, is apparently related to the commemoration of the 580th anniversary of the first mission (launched in 1405 CE), which was strongly promoted by the Chinese state at the time, in the context of China’s recently launched post-Mao ‘Reform and Opening-up Policy’ (1978-). The largest spike is apparently related to the 600th anniversary of the first mission, also widely commemorated. A third spike can be seen around 2014, which is likely to be related to the inauguration of the “21st Maritime Silk Roads” initiative (part of the Chinese state’s global development strategy known as the Belt & Road Initiative, BRI, earlier known as “One Belt, One Road”, OBOR). During the last years, a drop can be observed in the number of articles. Meanwhile, the author believes that this is likely to be related to the rising popularity of other keywords (esp. Belt & Road Initiative, Maritime Silkroads) in publications with overlapping topics.

#Checking the numbers of item other than journal articles (also removing "other",
#unclassified "dissertation/thesis", for having marginal values only)
reftype_year2 <- reftype_year %>%
  filter(Reference_type != "journal_article") %>%
  filter(Reference_type != "other_article") %>%
  filter(Reference_type != "dissertation_thesis")

ggplot(reftype_year2, aes(x=Year, y=n)) +
  geom_line(aes(color=Reference_type)) +
  scale_x_continuous(breaks=seq(1930,2020,10))

3.2 Data on publishing languages, institutions, journals

lang_count <- df %>%
  group_by(Language) %>%
  count(sort=T)

The language count is as follows: Chinese: 2952 (“Chinese;” 2863 + “Chinese” 89), NA: 510, English: 2, Chinese-English: 1, Other: 1

# Checking articles with NA and 'Other' value:
lang_check <- df %>%
  filter(is.na(Language)|Language=='其他;')

Most of the NA value articles are of ‘standard’ and ‘patent’ type, for which no language data is provided by CNKI (they are apparently all in Chinese). There is a limited number of English articles, as well as a few instances of Korean and Japanese titles.

publisher_count_jour_pre1980 <- df %>%
  filter(Year < 1980 & Reference_type == "journal_article") %>%
  group_by(Publisher) %>%
  count(sort=T)

publisher_count_jour_1980to1999 <- df %>%
  filter(Year < 2000 & Year >= 1980 & Reference_type == "journal_article") %>%
  group_by(Publisher) %>%
  count(sort=T)

publisher_count_jour_2000to2009 <- df %>%
  filter(Year < 2010 & Year >= 2000 & Reference_type == "journal_article") %>%
  group_by(Publisher) %>%
  count(sort=T)

publisher_count_jour_since2010 <- df %>%
  filter(Year >= 2010 & Reference_type == "journal_article") %>%
  group_by(Publisher) %>%
  count(sort=T)

publisher_count_conf_1980to1999 <- df %>%
  filter(Year < 2000 & Year >= 1980 & Reference_type == "conference_proceeding") %>%
  group_by(Publisher) %>%
  count(sort=T)

publisher_count_conf_since2000 <- df %>%
  filter(Year >= 2000 & Reference_type == "conference_proceeding") %>%
  group_by(Publisher) %>%
  count(sort=T)

publisher_count_news_1980to1999 <- df %>%
  filter(Year < 2000 & Year >= 1980 & Reference_type == "newspaper_article") %>%
  group_by(Publisher) %>%
  count(sort=T)

publisher_count_news_since2000 <- df %>%
  filter(Year >= 2000 & Reference_type == "newspaper_article") %>%
  group_by(Publisher) %>%
  count(sort=T)

publisher_count_maThesis_1980to1999 <- df %>%
  filter(Year < 2000 & Year >= 1980 & Reference_type == "ma_thesis") %>%
  group_by(Publisher) %>%
  count(sort=T)

publisher_count_maThesis_since2000 <- df %>%
  filter(Year >= 2000 & Reference_type == "ma_thesis") %>%
  group_by(Publisher) %>%
  count(sort=T)

publisher_count_phdDissertation_1980to1999 <- df %>%
  filter(Year < 2000 & Year >= 1980 & Reference_type == "phd_dissertation") %>%
  group_by(Publisher) %>%
  count(sort=T)

publisher_count_phdDissertation_since2000 <- df %>%
  filter(Year >= 2000 & Reference_type == "phd_dissertation") %>%
  group_by(Publisher) %>%
  count(sort=T)

pubPlace_count_conf_1980to1999 <- df %>%
  filter(Year < 2000 & Year >= 1980 & Reference_type == "conference_proceeding") %>%
  group_by(Publishing_place) %>%
  count(sort=T)

pubPlace_count_conf_since2000 <- df %>%
  filter(Year >= 2000 & Reference_type == "conference_proceeding") %>%
  group_by(Publishing_place) %>%
  count(sort=T)

authorAddress_count_1980to1999 <- df %>%
  filter(Year < 2000 & Year >= 1980) %>%
  group_by(Author_address) %>%
  count(sort=T)

authorAddress_count_since2000 <- df %>%
  filter(Year >= 2000) %>%
  group_by(Author_address) %>%
  count(sort=T)

#For further analysis, the top results were copied into #"PRO_ZHENG_HE_publications_data.xlsx", where English translations were added.

Journals in which publications on the Zheng He missions appeared are mostly related to maritime technology, maritime history, the Muslim Hui minority of China (to which Zheng He belonged), history teaching, as well as to his birthplace (near Kunming, Yunnan province) (more info, incl. other types of publications see “PRO_ZHENG_HE_publications_data.xlsx”).

3.3 Authors and keywords

#The tricky thing about the authors and keywords data in CNKI is that it is provided as a
#list usually including several elements. They need to be reorganized into a list of #single values for a meaningful analysis of how often one certain author or keyword
#appears. I will analyze authors and keywords of the last 20 years in academic #publications.
authors_academic_since2000 <- df %>%
  filter(Year >= 2000, Reference_type == "journal_article" | Reference_type == "conference_proceeding" | Reference_type == "phd_dissertation") %>%
  select(Author)
string_authors1 <- paste(unlist(authors_academic_since2000), collapse = ";")

authors1_list <- as.list(strsplit(string_authors1, ';|;;|,'))
authors1_df <- data.frame(matrix(unlist(authors1_list), nrow=length(authors1_list), byrow=TRUE))

authors1_df <- t(authors1_df)

authors1_df <- as.data.frame(authors1_df)

authors1_count <- authors1_df %>%
  group_by(V1) %>%
  count(sort=T)

#View(authors1_count)
keywords_academic_since2000 <- df %>%
  filter(Year >= 2000, Reference_type == "journal_article" | Reference_type == "conference_proceeding" | Reference_type == "phd_dissertation") %>%
  select(Keywords)

string_kw1 <- paste(unlist(keywords_academic_since2000), collapse = ";")


kw1_list <- as.list(strsplit(string_kw1, ';|;;|,|; |;; '))

kw1_df <- data.frame(matrix(unlist(kw1_list), nrow=length(kw1_list), byrow=TRUE))

kw1_df <- t(kw1_df)

kw1_df <- as.data.frame(kw1_df)

kw1_count <- kw1_df %>%
  group_by(V1) %>%
  count(sort=T)

#View(kw1_count)

The analysis of the keywords allows a number of interesting insights into the recent academic discourse on the missions. Apart from the prominence of “Zheng He”, “missions”, etc. for obvious reasons, the following top 20 keywords deserve some attention:

  • Ming Chengzu (Yongle emperor): the emperor (r. 1402-1424) who ordered most of the missions (#7, 159 occurrences)
  • treasure ships: referring to the largest of Zheng He’s ships (#9, 90)
  • “Zheng He spirit”: assertion that the missions should be seen as having a referential value for China’s present-day maritime foreign policy (openness, peaceful development, etc.) (#10, 72)
  • Zhu Di: the personal name of Ming Chengzu/ the Yongle emperor prior to his ascendence to the throne (#13, 59)
  • Columbus: Zheng He as China’s most famous navigator is frequently compared with the figure of Columbus in Western history, also often contrasted (Chinese pacifism vs. Western colonialism) (#15, 54)
  • Menzies: (, Gavin) British author claiming that Zheng He sailed to America, rejected by most scholars (#18, 46)
  • tribute-trade: referring to the political-economic system of Ming China’s foreign relations (#20, 41)

Further analysis of the co-occurrence of keywords and co-authorship can be done through CiteSpace, which is compatible with RefWorks export format of CNKI. The picture below shows a keyword co-occurrence visualization (with the example of checking the keywords co-occurring with “Malacca”, one of the important locations visited by the Zheng He missions), as well as summary of keyword centrality (on the left).

3.4 Word/ngram frequencies of titles and abstracts

stop_dict <- stopwords::stopwords(language="zh", source="stopwords-iso")
stopwords_ch <- as.data.frame(stop_dict)
titles1 <- df %>%
  filter(Year >= 2000, Reference_type == "journal_article" | Reference_type == "conference_proceeding" | Reference_type == "phd_dissertation" | Reference_type == "ma_thesis", !is.na(Title)) %>%
  select(Title)

#As I realized, the tidytext package has a Chinese segmentation tool automatically applied
#when calling unnest_tokens() on Chinese text.
tidy1 <- titles1 %>%
  unnest_tokens(word, Title)

clean1 <- tidy1 %>%
  anti_join(stopwords_ch, by= c("word" = "stop_dict"))

word_count_titles_since2000_academic <- clean1 %>%
  group_by(word) %>%
  count(sort=T)

#View(word_count_titles_since2000_academic)

set.seed(1234)
wordcloud(word_count_titles_since2000_academic$word, word_count_titles_since2000_academic$n,
          min.freq = 10, rot.per = .15, random.order=FALSE, scale=c(5,2),
          max.words=100, colors=brewer.pal(8, "Dark2"))

The top words (black, pink, orange) of academic titles since 2000 are as follows: “Zheng He, Western Ocean, culture, research, China, navigation, history.” “Western Ocean” is part of the traditional name of the missions (“Zheng He sailing the Western Ocean”). In pre-modern Chinese it is a spatializing term referring to the South China Sea, the Indian Ocean, and territories located along their shores.

#Checking out some non-academic publication types as well
titles_standard <- df %>%
  filter(Reference_type == "standard", !is.na(Title)) %>%
  select(Title)

tidy_standard <- titles_standard %>%
  unnest_tokens(word, Title)

clean_standard <- tidy_standard %>%
  anti_join(stopwords_ch, by= c("word" = "stop_dict"))

word_count_titles_since2000_standards <- clean_standard %>%
  group_by(word) %>%
  count(sort=T)

set.seed(1234)
wordcloud(word_count_titles_since2000_standards$word, word_count_titles_since2000_standards$n,
          min.freq = 10, rot.per = .15, random.order=FALSE, scale=c(5,2),
          max.words=100, colors=brewer.pal(8, "Dark2"))

The top words (black, purple, orange) of the titles of standards since 2000 are as follows: “Zheng He, Western Ocean, commemoration, culture, navigation, Taicang (name of the place in China from where the missions started), China, anniversary, activity, 600.” The main topic in standards is apparently the organization of commemorative events on the Zheng He missions, especially the 600th anniversary in 2005.

titles_news <- df %>%
  filter(Reference_type == "newspaper_article", !is.na(Title)) %>%
  select(Title)

tidy_news <- titles_news %>%
  unnest_tokens(word, Title)

clean_news <- tidy_news %>%
  anti_join(stopwords_ch, by= c("word" = "stop_dict"))

word_count_titles_since2000_news <- clean_news %>%
  group_by(word) %>%
  count(sort=T)

set.seed(1234)
wordcloud(word_count_titles_since2000_news$word, word_count_titles_since2000_news$n,
          min.freq = 1, rot.per = .15, random.order=FALSE, scale=c(5,2),
          max.words=100, colors=brewer.pal(8, "Dark2"))

The top words (black, purple, orange) of newspaper titles since 2000 are as follows: “Zheng He, Western Ocean, China, new, [Silk] Roads, culture, maritime, navigation, Silk [Roads], spirit, ocean, history, sea.” Newspapers are apparently oriented towards the political/cultural diplomacy aspects of the topic (e.g. the construction of the “21st Century Maritime Silk Roads”, “maritime culture”, “Zheng He spirit” [of friendly relations with foreign countries], etc.)

bigrams1 <- titles1 %>%
  unnest_tokens(bigram, Title, token="ngrams", n=2) %>%
  count(bigram, sort=TRUE)

bigrams_sep1 <- bigrams1 %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_sep1 %>%
  filter(!word1 %in% stopwords_ch$stop_dict) %>%
  filter(!word2 %in% stopwords_ch$stop_dict)

#View()
trigrams1 <- titles1 %>%
  unnest_tokens(trigram, Title, token="ngrams", n=3) %>%
  count(trigram, sort=TRUE)

trigrams_sep1 <- trigrams1 %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")

trigrams_filtered <- trigrams_sep1 %>%
  filter(!word1 %in% stopwords_ch$stop_dict) %>%
  filter(!word2 %in% stopwords_ch$stop_dict) %>%
  filter(!word3 %in% stopwords_ch$stop_dict)

#View()
#For the analysis of the abstracts I wished to see the "range" of each ngram
#(the number of items in which it appears, apart from the basic word/ngram frequency),
#as well as to analyze collocates, keywords, among others. #[Antconc](https://www.laurenceanthony.net/software/antconc/) has built-in functions
#for all of these, therefore I decided to use that for these parts.

abstracts <- titles1 <- df %>%
  filter(Year >= 2000, Reference_type == "journal_article" | Reference_type == "conference_proceeding" | Reference_type == "phd_dissertation", !is.na(Abstract)) %>%
  select(Abstract)
#Altogether 1812 items.

out_csv = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/antconc/abstracts.csv'
write_csv(abstracts, out_csv, col_names = FALSE)

#I renamed the output as a TXT file and segmented it using #[SegmentAnt](https://www.laurenceanthony.net/software/segmentant/) with the Jieba
#engine. I wrote the abstracts into separate files by using Python (in order to easily #keep track of "ranges" of ngram occurrences), and used Antconc #for the analysis of #word/ngram frequencies, ngram ranges, collocates/concordance, as well keywords (using the #[Lancaster Corpus of Mandarin #Chinese/LCMC](https://www.lancaster.ac.uk/fass/projects/corpus/LCMC/) as reference corpus
#for the latter). Antconc does not have a stopwords filtering option, which I will
#perform in R here.

in_csv = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/antconc/bigrams_abstracts_academic.tsv' #Count of bigrams in the abstracts sorted by range
bigrams_abstracts_csv <- read_tsv(in_csv, col_names = TRUE)
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Freq = col_double(),
##   Range = col_double(),
##   Ngram = col_character()
## )
bigrams_abstracts_csv_sep <- bigrams_abstracts_csv %>%
  separate(Ngram, c("word1", "word2"), sep = " ")

bigrams_abstracts_filtered <- bigrams_abstracts_csv_sep %>%
  filter(!word1 %in% stopwords_ch$stop_dict) %>%
  filter(!word2 %in% stopwords_ch$stop_dict)

#View()

outfile_2gramsfilt = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/antconc/bigrams_abstracts_academic_stopWordsFiltered.tsv'
write_csv(bigrams_abstracts_filtered, outfile_2gramsfilt, col_names = TRUE)
#Example of a keywords output from Antconc, LCMC used as reference corpus
keywords_infile = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/antconc/keywords_abstracts_academic.tsv'
keywords_example <- read_tsv(keywords_infile, col_names = TRUE)
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Freq = col_double(),
##   `Keyness(LL4)` = col_double(),
##   `Effect(DICE)` = col_double(),
##   Keyword = col_character()
## )
head(keywords_example, n=10)
## # A tibble: 10 x 5
##     Rank  Freq `Keyness(LL4)` `Effect(DICE)` Keyword
##    <dbl> <dbl>          <dbl>          <dbl> <chr>  
##  1     1  3224         12030.         0.0409 <U+90D1><U+548C>   
##  2     2  1707          6267.         0.0219 <U+897F><U+6D0B>   
##  3     3  1039          3836.         0.0134 <U+822A><U+6D77>   
##  4     4  1772          3117.         0.0225 <U+4E2D><U+56FD>   
##  5     5   661          2251.         0.0085 <U+6D77><U+6D0B>   
##  6     6   856          1949.         0.011  <U+6587><U+5316>   
##  7     7  1348          1704.         0.0172 <U+4E0B>     
##  8     8   773          1491.         0.0099 <U+5386><U+53F2>   
##  9     9   385          1396.         0.005  <U+8239><U+961F>   
## 10    10   377          1298.         0.0049 <U+6D77><U+4E0A>

Bigram, trigram, and keyword counts show the centrality of terms related to the “21st Century Maritime Silk Roads” and the 600th anniversary celebrations.

3.5 Topic modeling of abstracts

abstracts2 <- df %>%
  filter(Year >= 2000, Reference_type == "journal_article" | Reference_type == "conference_proceeding" | Reference_type == "phd_dissertation", !is.na(Abstract)) %>%
  select(ID_str, Abstract)

abstracts_tidy <- abstracts2 %>%
  unnest_tokens(word, Abstract, drop = FALSE) %>%
  anti_join(stopwords_ch, by= c("word" = "stop_dict"))

abstract_dtm <- abstracts_tidy %>%
  count(ID_str, word) %>%
  cast_dtm(ID_str, word, n)

glimpse(abstract_dtm)
## List of 6
##  $ i       : int [1:100168] 1 196 398 1168 1 564 1 178 496 1480 ...
##  $ j       : int [1:100168] 1 1 1 1 2 2 3 3 3 3 ...
##  $ v       : num [1:100168] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 1811
##  $ ncol    : int 13429
##  $ dimnames:List of 2
##   ..$ Docs : chr [1:1811] "0657_journal_article_2000" "0658_journal_article_2000" "0659_journal_article_2000" "0661_journal_article_2000" ...
##   ..$ Terms: chr [1:13429] "<U+4E00><U+756A>" "<U+4E00><U+822C><U+4EBA>" "<U+4E0D><U+5927>" "<U+4E5F><U+6709>" ...
##  - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
tic()

abstracts_lda <- LDA(abstract_dtm, k = 5, control = list(seed = 1234))

abstracts_lda


toc()

glimpse(abstracts_lda)

#save(abstracts_lda, file = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/lda_model.rda")
lda_model_topic_term_prob <- tidy(abstracts_lda, matrix="beta")

topic1 <- lda_model_topic_term_prob %>%
  filter(topic == 1) %>%
  arrange(desc(beta))

topic2 <- lda_model_topic_term_prob %>%
  filter(topic == 2) %>%
  arrange(desc(beta))

topic3 <- lda_model_topic_term_prob %>%
  filter(topic == 3) %>%
  arrange(desc(beta))

topic4 <- lda_model_topic_term_prob %>%
  filter(topic == 4) %>%
  arrange(desc(beta))

topic5 <- lda_model_topic_term_prob %>%
  filter(topic == 5) %>%
  arrange(desc(beta))

out_tm1 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic1.tsv"
write_tsv(topic1, out_tm1, col_names = TRUE)

out_tm2 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic2.tsv"
write_tsv(topic2, out_tm2, col_names = TRUE)

out_tm3 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic3.tsv"
write_tsv(topic3, out_tm3, col_names = TRUE)

out_tm4 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic4.tsv"
write_tsv(topic4, out_tm4, col_names = TRUE)

out_tm5 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic5.tsv"
write_tsv(topic5, out_tm5, col_names = TRUE)

More or less the following 5 topics were identified:

  • Topic 1: The Zheng He missions and maritime trade / economic interaction
  • Topic 2: The 600th anniversary of the missions (2005) and related events
  • Topic 3:The study of the history of Maritime Silk Roads, promotion of “Zheng He spirit”, China’s maritime strategy
  • Topic 4: Zheng He and religions, esp. Islam
  • Topic 5: Zheng He and the Yongle emperor, maritime cartography, Menzies (this proposed topic is a little bit vague, apparently a mixture of various topics which could be separated if setting the number of topics higher, the other four were well separated based on my qualitative research so far).
lda_model_topic_doc_prob <- tidy(abstracts_lda, matrix="gamma")

top_doc1 <- lda_model_topic_doc_prob %>%
  filter(topic == 1) %>%
  arrange(desc(gamma))

top_doc2 <- lda_model_topic_doc_prob %>%
  filter(topic == 2) %>%
  arrange(desc(gamma))

top_doc3 <- lda_model_topic_doc_prob %>%
  filter(topic == 3) %>%
  arrange(desc(gamma))

top_doc4 <- lda_model_topic_doc_prob %>%
  filter(topic == 4) %>%
  arrange(desc(gamma))

top_doc5 <- lda_model_topic_doc_prob %>%
  filter(topic == 5) %>%
  arrange(desc(gamma))

#View()
#Checking out the articles deemed most typical of each topic

top1_joint <- merge(top_doc1, df, by.x="document", by.y="ID_str")
top1_joint <- top1_joint %>%
  arrange(desc(gamma))

top2_joint <- merge(top_doc2, df, by.x="document", by.y="ID_str")
top2_joint <- top2_joint %>%
  arrange(desc(gamma))

top3_joint <- merge(top_doc3, df, by.x="document", by.y="ID_str")
top3_joint <- top3_joint %>%
  arrange(desc(gamma))

top4_joint <- merge(top_doc4, df, by.x="document", by.y="ID_str")
top4_joint <- top4_joint %>%
  arrange(desc(gamma))

top5_joint <- merge(top_doc5, df, by.x="document", by.y="ID_str")
top5_joint <- top5_joint %>%
  arrange(desc(gamma))

The topic-per-document results of the first four topics mostly confirm to what was inferred from the word-per-topic counts, although often indicating a slightly different focus. The most typical articles of topic 2 indicate the importance of the discussions of Zheng He’s relations to the Mazu Sea Goddess cult in the broader discourse on the “maritime culture” and “Zheng He spirit”, the highest-ranking articles in topic 4 indicate a stronger focus on Buddhism instead of Islam. Topic 5 is apparently a lumping together of articles which could not be placed in other topics, some of the high-ranking articles are the few non-Chinese language articles.

#Writing out into TSV files
out_gamma1 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic1_gamma.tsv"
write_tsv(top1_joint, out_gamma1, col_names = TRUE)

out_gamma2 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic2_gamma.tsv"
write_tsv(top2_joint, out_gamma2, col_names = TRUE)

out_gamma3 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic3_gamma.tsv"
write_tsv(top3_joint, out_gamma3, col_names = TRUE)

out_gamma4 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic4_gamma.tsv"
write_tsv(top4_joint, out_gamma4, col_names = TRUE)

out_gamma5 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic5_gamma.tsv"
write_tsv(top5_joint, out_gamma5, col_names = TRUE)
lda_file = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/lda_model.rda"
load(lda_file)


topicmodels2LDAvis <- function(x, ...){
  post <- topicmodels::posterior(x)
  if (ncol(post[["topics"]]) < 3) stop("The model must contain > 2 topics")
  mat <- x@wordassignments
  LDAvis::createJSON(
    phi = post[["terms"]], 
    theta = post[["topics"]],
    vocab = colnames(post[["terms"]]),
    doc.length = slam::row_sums(mat, na.rm = TRUE),
    term.frequency = slam::col_sums(mat, na.rm = TRUE)
  )
}

serVis(topicmodels2LDAvis(abstracts_lda))

4 Analysis of selected corpora

4.1 The 50 most cited articles of the last 20 years

The following selected corpus includes the 50 most cited articles of the last 20 years (CNKI, as of 2020-06-28; #1: 115 citations, #50: 13 citations).

#Getting the titles out of the file exported from the citation numbers-filtered CNKI #search and merging them with the titles in the original data frame of the project.
cit_list_file = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/selected_corpora/most_cited/most_cited_list.tsv'
cit_list <- read_tsv(cit_list_file, col_names = TRUE)
## Parsed with column specification:
## cols(
##   Item = col_character()
## )
df_cit <- merge(cit_list, df, by.x = "Item", by.y = "Title")

glimpse(df_cit)
## Rows: 50
## Columns: 12
## $ Item             <chr> "?1100kV<U+7279><U+9AD8><U+538B><U+76F4><U+6D41><U+7CFB><U+7EDF><U+8BD5><U+9A8C><U+65B9><U+6848><U+7814><U+7A76>", "“<U+4E00><U+5E26><U+4E00><U+8DEF>”<U+4E0E><U+4E9A><U+975E><U+6218><U+7565><U+5408><U+4F5C><U+4E2D><U+7684>“<U+5B97><U+6559><U+56E0><U+7D20>”", "...
## $ ID_num           <chr> "2495", "2510", "2606", "2546", "0680", "2066", "1...
## $ ID_str           <chr> "2495_journal_article_2015", "2510_journal_article...
## $ Reference_type   <chr> "journal_article", "journal_article", "journal_art...
## $ Language         <chr> "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "...
## $ Author           <chr> "<U+6768><U+4E07><U+5F00>;<U+5370><U+6C38><U+534E>;<U+73ED><U+8FDE><U+5E9A>;<U+66FE><U+5357><U+8D85>;", "<U+9A6C><U+4E3D><U+84C9>;", "<U+9A6C><U+4E3D><U+84C9>;", "<U+5468><U+7490><U+94ED>", "<U+8D75><U+541B><U+5C27>", ...
## $ Author_address   <chr> "<U+4E2D><U+56FD><U+7535><U+529B><U+79D1><U+5B66><U+7814><U+7A76><U+9662>;", "<U+4E0A><U+6D77><U+5916><U+56FD><U+8BED><U+5927><U+5B66><U+4E2D><U+4E1C><U+7814><U+7A76><U+6240>;", "<U+4E0A><U+6D77><U+5916><U+56FD><U+8BED><U+5927><U+5B66><U+4E2D><U+4E1C><U+6240>;", NA, ...
## $ Publishing_place <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Publisher        <chr> "<U+7535><U+7F51><U+6280><U+672F>", "<U+897F><U+4E9A><U+975E><U+6D32>", "<U+4E16><U+754C><U+5B97><U+6559><U+6587><U+5316>", "<U+4E2D><U+5171><U+4E2D><U+592E><U+515A><U+6821>", "<U+804C><U+5927><U+5B66><U+62A5>", "<U+5357><U+4EAC><U+5E08><U+8303><U+5927>...
## $ Year             <int> 2015, 2015, 2015, 2015, 2000, 2012, 2005, 2006, 20...
## $ Keywords         <chr> "±1100 kV<U+7279><U+9AD8><U+538B><U+76F4><U+6D41><U+793A><U+8303><U+5DE5><U+7A0B>;±800 kV<U+7279><U+9AD8><U+538B><U+7CFB><U+7EDF><U+8BD5><U+9A8C>;<U+5206><U+5C42><U+63A5><U+5165>;<U+8BD5><U+9A8C><U+65B9><U+6848>", "<U+5B97><U+6559><U+5916>...
## $ Abstract         <chr> "<U+5206><U+6790><U+4E86><U+51C6><U+4E1C>—<U+7696><U+5357>?1100 k V<U+7279><U+9AD8><U+538B><U+76F4><U+6D41><U+793A><U+8303><U+5DE5><U+7A0B><U+7684><U+7279><U+70B9><U+4EE5><U+53CA><U+54C8><U+90D1><U+548C><U+6EAA><U+6D59>?800 k V<U+76F4><U+6D41><U+5DE5><U+7A0B><U+73B0>...
reftype_cit <- df_cit %>%
  group_by(Reference_type) %>%
  count(sort=T)

reftype_cit
## # A tibble: 4 x 2
## # Groups:   Reference_type [4]
##   Reference_type          n
##   <chr>               <int>
## 1 journal_article        31
## 2 ma_thesis              10
## 3 phd_dissertation        8
## 4 dissertation_thesis     1
pub_years_cit <- df_cit %>%
  group_by(Year) %>%
  count(sort=TRUE)

ggplot(pub_years_cit, aes(x=Year, y=n)) +
  geom_line() +
  scale_x_continuous(breaks=seq(2000,2020,1)) +
  labs(title="Number of articles by year (50 most cited)")

publishers_cit <- df_cit %>%
  group_by(Publisher) %>%
  count(sort=T)

#View()
authorAddresses_cit <- df_cit %>%
  group_by(Author_address) %>%
  count(sort=T)

#View()
authors_cit <- df_cit$Author

string_authors_cit <- paste(unlist(authors_cit), collapse = ";")

authors_cit_list <- as.list(strsplit(string_authors_cit, ';|;;|,'))

authors_cit_df <- data.frame(matrix(unlist(authors_cit_list), nrow=length(authors_cit_list), byrow=TRUE))

authors_cit_df <- t(authors_cit_df)

authors_cit_df <- as.data.frame(authors_cit_df)

authors_cit_count <- authors_cit_df %>%
  group_by(V1) %>%
  count(sort=T)

#View()
keywords_cit <- df_cit$Keywords

string_keywords_cit <- paste(unlist(keywords_cit), collapse = ";")

keywords_cit_list <- as.list(strsplit(string_keywords_cit, ';|;;|,'))

keywords_cit_df <- data.frame(matrix(unlist(keywords_cit_list), nrow=length(keywords_cit_list), byrow=TRUE))

keywords_cit_df <- t(keywords_cit_df)

keywords_cit_df <- as.data.frame(keywords_cit_df)

keywords_cit_count <- keywords_cit_df %>%
  group_by(V1) %>%
  count(sort=T)

#View()
titles_cit <- df_cit %>%
  filter(!is.na(Item)) %>%
  select(Item)

tidy_cit <- titles_cit %>%
  unnest_tokens(word, Item)

clean_cit <- tidy_cit %>%
  anti_join(stopwords_ch, by= c("word" = "stop_dict"))

word_count_titles_since2000_cit <- clean_cit %>%
  group_by(word) %>%
  count(sort=T)
#View()

set.seed(1234)
wordcloud(word_count_titles_since2000_cit$word, word_count_titles_since2000_cit$n,
          min.freq = 1, rot.per = .15, random.order=FALSE, scale=c(5,2),
          max.words=100, colors=brewer.pal(8, "Dark2"))

The top words (black, green, pink, purple) of the titles of the 50 most cited articles are as follows: “research, China, Zheng He, Western Ocean, history, strategy, Ming Dynasty, ocean, tribute (likely ref. to pre-modern ‘tributary system’ of foreign relations), Asia.”

abstracts_cit <- df_cit %>%
  filter(!is.na(Abstract)) %>%
  select(Abstract)

tidy_cit_abs <- abstracts_cit %>%
  unnest_tokens(word, Abstract)

clean_cit_abs <- tidy_cit_abs %>%
  anti_join(stopwords_ch, by= c("word" = "stop_dict"))

word_count_abstracts_since2000_cit <- clean_cit_abs %>%
  group_by(word) %>%
  count(sort=T)
#View()

set.seed(1234)
wordcloud(word_count_abstracts_since2000_cit$word, word_count_abstracts_since2000_cit$n,
          min.freq = 1, rot.per = .15, random.order=FALSE, scale=c(5,2),
          max.words=100, colors=brewer.pal(8, "Dark2"))

The top words (black, pink, purple) of the abstracts of the 50 most highly cited articles are as follows: “culture, China, foreign [~relations], strategy, history, Zheng He, development.” The importance of “culture” is notable throughout the discourse, especially here.

bigrams_cit <- abstracts_cit %>%
  unnest_tokens(bigram, Abstract, token="ngrams", n=2) %>%
  count(bigram, sort=TRUE)

bigrams_cit <- bigrams_cit %>%
  separate(bigram, c("word1", "word2"), sep = " ")

abstracts_cit_bigrams_filtered <- bigrams_sep1 %>%
  filter(!word1 %in% stopwords_ch$stop_dict) %>%
  filter(!word2 %in% stopwords_ch$stop_dict)

#View()

#outfile_2gramsfilt_2 = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/selected_corpora/most_cited/cited_abs_bigrams.tsv'
#write_csv(abstracts_cit_bigrams_filtered, outfile_2gramsfilt_2, col_names = TRUE)

Co-citation networks can be generated on CNKI, see e.g.

4.2 The 50 most downloaded articles of the last 20 years

The following selected corpus includes the 50 most downloaded articles of the last 20 years (CNKI, as of 2020-06-28; #1: 8311 downloads, #50: 1281 downloads). The analysis will start with identifying the number of matches between the lists of the most cited and most downloaded articles of the last 20 years. Following this, the procedure will be identical to the one above.

down_list_file = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/selected_corpora/most_downloaded/most_downloaded_list.tsv'
down_list <- read_tsv(down_list_file, col_names = TRUE)
## Parsed with column specification:
## cols(
##   Item = col_character()
## )
df_down <- merge(down_list, df, by.x = "Item", by.y = "Title")

glimpse(df_down)
## Rows: 50
## Columns: 12
## $ Item             <chr> "“<U+4E00><U+5E26><U+4E00><U+8DEF>”<U+4E0E><U+4E9A><U+975E><U+6218><U+7565><U+5408><U+4F5C><U+4E2D><U+7684>“<U+5B97><U+6559><U+56E0><U+7D20>”", "“21<U+4E16><U+7EAA><U+6D77><U+4E0A><U+4E1D><U+7EF8><U+4E4B><U+8DEF>”<U+5BF9><U+6D77><U+6D0B><U+5F3A><U+56FD><U+6218><U+7565><U+7684><U+5F71><U+54CD><U+7814><U+7A76>...
## $ ID_num           <chr> "2510", "2563", "1932", "2417", "2546", "2691", "1...
## $ ID_str           <chr> "2510_journal_article_2015", "2563_ma_thesis_2015"...
## $ Reference_type   <chr> "journal_article", "ma_thesis", "phd_dissertation"...
## $ Language         <chr> "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "...
## $ Author           <chr> "<U+9A6C><U+4E3D><U+84C9>;", "<U+8D75><U+6CD3><U+535A>", "<U+8C22><U+831C>", "<U+675C><U+62C9>(D.Sarananda)", "<U+5468><U+7490><U+94ED>", "<U+7A46><U+7F55>...
## $ Author_address   <chr> "<U+4E0A><U+6D77><U+5916><U+56FD><U+8BED><U+5927><U+5B66><U+4E2D><U+4E1C><U+7814><U+7A76><U+6240>;", NA, NA, NA, NA, NA, NA, "<U+6B66><U+6C49><U+7406><U+5DE5><U+5927><U+5B66>,<U+6B66>...
## $ Publishing_place <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Publisher        <chr> "<U+897F><U+4E9A><U+975E><U+6D32>", "<U+897F><U+5317><U+5E08><U+8303><U+5927><U+5B66>", "<U+6B66><U+6C49><U+5927><U+5B66>", "<U+534E><U+4E2D><U+5E08><U+8303><U+5927><U+5B66>", "<U+4E2D><U+5171><U+4E2D><U+592E><U+515A><U+6821>", "<U+626C><U+5DDE><U+5927>...
## $ Year             <int> 2015, 2015, 2010, 2014, 2015, 2016, 2007, 2006, 20...
## $ Keywords         <chr> "<U+5B97><U+6559><U+5916><U+4EA4>;“<U+4E00><U+5E26><U+4E00><U+8DEF>”;<U+4E9A><U+975E><U+5408><U+4F5C>;<U+5168><U+7403><U+6CBB><U+7406>; Religious Diplomacy;;\"One...
## $ Abstract         <chr> "<U+5B97><U+6559><U+5916><U+4EA4><U+81EA><U+53E4><U+4EE5><U+6765><U+5C31><U+662F><U+4E2D><U+56FD><U+5F00><U+5C55><U+4E9A><U+975E><U+5408><U+4F5C><U+91CD><U+8981><U+7684><U+5916><U+4EA4><U+5F62><U+6001><U+4E4B><U+4E00><U+3002><U+90D1><U+548C><U+5305><U+5BB9><U+6027><U+7684><U+5B97><U+6559><U+5916><U+4EA4>,<U+4E0D><U+4EC5><U+5F00><U+8F9F><U+4E86><U+660E><U+671D><U+4E0E>30...
cit_down_intersect <- intersect(cit_list, down_list)
#View()
#22 articles are both in the most cited and most downloaded lists
reftype_down <- df_down %>%
  group_by(Reference_type) %>%
  count(sort=T)

reftype_down
## # A tibble: 4 x 2
## # Groups:   Reference_type [4]
##   Reference_type          n
##   <chr>               <int>
## 1 journal_article        21
## 2 phd_dissertation       15
## 3 ma_thesis              13
## 4 dissertation_thesis     1

PhD dissertations apparently enjoy high popularity among users of CNKI, their ratio among the most downloaded articles is by far over their ratio in the entire corpus (df).

pub_years_down <- df_down %>%
  group_by(Year) %>%
  count(sort=TRUE)

ggplot(pub_years_down, aes(x=Year, y=n)) +
  geom_line() +
  scale_x_continuous(breaks=seq(2000,2020,1)) +
  labs(title="Number of articles by year (50 most downloaded)")

publishers_down <- df_down %>%
  group_by(Publisher) %>%
  count(sort=T)

#View()
authorAddresses_down <- df_down %>%
  group_by(Author_address) %>%
  count(sort=T)

#View()
authors_down <- df_down$Author

string_authors_down <- paste(unlist(authors_down), collapse = ";")

authors_down_list <- as.list(strsplit(string_authors_down, ';|;;|,'))

authors_down_df <- data.frame(matrix(unlist(authors_down_list), nrow=length(authors_down_list), byrow=TRUE))

authors_down_df <- t(authors_down_df)

authors_down_df <- as.data.frame(authors_down_df)

authors_down_count <- authors_down_df %>%
  group_by(V1) %>%
  count(sort=T)

#View()
keywords_down <- df_down$Keywords

string_keywords_down <- paste(unlist(keywords_down), collapse = ";")

keywords_down_list <- as.list(strsplit(string_keywords_down, ';|;;|,'))

keywords_down_df <- data.frame(matrix(unlist(keywords_down_list), nrow=length(keywords_down_list), byrow=TRUE))

keywords_down_df <- t(keywords_down_df)

keywords_down_df <- as.data.frame(keywords_down_df)

keywords_down_count <- keywords_down_df %>%
  group_by(V1) %>%
  count(sort=T)

#View()
titles_down <- df_down %>%
  filter(!is.na(Item)) %>%
  select(Item)

tidy_down <- titles_down %>%
  unnest_tokens(word, Item)

clean_down <- tidy_down %>%
  anti_join(stopwords_ch, by= c("word" = "stop_dict"))

word_count_titles_since2000_down <- clean_down %>%
  group_by(word) %>%
  count(sort=T)

#View()

set.seed(1234)
wordcloud(word_count_titles_since2000_down$word, word_count_titles_since2000_down$n,
          min.freq = 1, rot.per = .15, random.order=FALSE, scale=c(5,2),
          max.words=100, colors=brewer.pal(8, "Dark2"))

The top words (black, brown, orange, pink, purple) of titles of the 50 most downloaded articles are as follows: “research, Western Ocean, Zheng He, China, culture, Ming era, development, history, Ming Dynasty, ocean, tribute, trade.”

abstracts_down <- df_down %>%
  filter(!is.na(Abstract)) %>%
  select(Abstract)

tidy_down_abs <- abstracts_down %>%
  unnest_tokens(word, Abstract)

clean_down_abs <- tidy_down_abs %>%
  anti_join(stopwords_ch, by= c("word" = "stop_dict"))

word_count_abstracts_since2000_down <- clean_down_abs %>%
  group_by(word) %>%
  count(sort=T)
#View()

set.seed(1234)
wordcloud(word_count_abstracts_since2000_down$word, word_count_abstracts_since2000_down$n,
          min.freq = 1, rot.per = .15, random.order=FALSE, scale=c(5,2),
          max.words=100, colors=brewer.pal(8, "Dark2"))

The top words (black, green, pink, purple) of the abstracts of the 50 most downloaded articles are as follows: “culture, China, development, history, strategy, research, Western Ocean, Zheng He, foreign [relations], trade, country, influence”.
The importance of discussions of “strategy” (~present-day foreign policy) are also notable in the most cited/downloaded articles.

bigrams_down <- abstracts_down %>%
  unnest_tokens(bigram, Abstract, token="ngrams", n=2) %>%
  count(bigram, sort=TRUE)

bigrams_down <- bigrams_down %>%
  separate(bigram, c("word1", "word2"), sep = " ")

abstracts_down_bigrams_filtered <- bigrams_down %>%
  filter(!word1 %in% stopwords_ch$stop_dict) %>%
  filter(!word2 %in% stopwords_ch$stop_dict)

#View()

#outfile_2gramsfilt_3 = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/selected_corpora/most_downloaded/downloaded_abs_bigrams.tsv'
#write_csv(abstracts_down_bigrams_filtered, outfile_2gramsfilt_3, col_names = TRUE)

5 Conclusion

The present project is a valuable contribution to my Ph.D. dissertation The Zheng He Missions: A Critical Discourse Analysis of Recent Mainland Chinese Historiography. The project provides quantitative data for the dissertation, reinforcing many of my previous observations based on qualitative analysis, as well as pointing to certain aspects of the discourse previously less investigated. As the results of the present project show, the discourse on the Zheng He missions goes beyond merely discussing the historical events themselves, and is widely seen as having a referential value for China’s present-day and future foreign policy. The evidence for this comes from the high frequency of terms connected to the China’s states maritime development strategy “21st Century Silk Roads” (part of the larger Belt & Road Initiative, BRI, a.k.a. OBOR), as well as to concepts such as “maritime culture” (rediscovering China’s traditions as a maritime power) or “Zheng He spirit” (of openness and peaceful development). The 600th anniversary of the first mission in 2005 played a central role in the recent discourse, which induced the highest number of publications on the topic.

The present project serves as an example of not only how digital methods can be used in Chinese studies, but also of how they are valuable tools in discourse studies in general. In the 21st century, discourse analysis cannot remain a merely qualitative method but has to adapt to the technological opportunities of the present era. Throughout my Ph.D. dissertation I wish to further advance the combination of qualitative and quantitative methods for the study of Chinese discourses, to which the present project provided a valuable starting point.