The present project was created as the assignment for the course “Historical Inquiries with R” taught by Prof. Maxim Romanov during the spring semester of 2020 at the University of Vienna. The project is part of my Ph.D. dissertation of the working title The Zheng He Missions: A Critical Discourse Analysis of Recent Mainland Chinese Historiography. The seven large-scale maritime missions ordered by China’s Ming dynasty and led by the imperial admiral Zheng He between 1405 and 1433 CE included visits to a number of port cities in Southeast Asia, South Asia, the Middle East, and East Africa. In the early 20th century, Zheng He’s missions were rediscovered by Chinese historians as a symbol of the country’s past maritime might, especially in the context of anti-colonial struggles. Since the ‘Reform and Opening-up’ (1978-) of the post-Mao era, the mostly peaceful missions often became a symbol of ‘peaceful rise’, economic openness and ‘win-win’ cooperation with the outside world, principles propagated by the country’s leadership. In my Ph.D. dissertation I investigate the historiography on the missions during the last two decades, in the light of China’s emerging self-perception as a (re)emerging regional and global power. The present project is an analysis of publication data (incl. abstracts) from 3,467 items downloaded from China’s largest academic database the China National Knowledge Infrastructure, related to the missions.
library(knitr)
library(yaml)
library(tinytex)
library(dplyr)
library(stringr)
library(NLP)
library(ggplot2)
library(tidyr)
library(xlsx)
library(tidytext)
library(stopwords)
library(tidyr)
library(readr)
library(tm)
library(topicmodels)
library(tictoc)
library(wordcloud)
library(LDAvis)
library(slam)
library(servr)
The data used for the project was exported from CNKI as a TXT file structured according to the RefWorks format. The original TXT file was reformatted into a TSV file using RegEx in Sublime Text and Python (for original search terms, and the reformatting process see Hompot_Zheng_He_project_TXTtoTSV.html
).
#Adjustments in Sublime: Reference_type entries decapitalized, underscores added,
#the unnecessary initial string "<正>" removed from a number of abstracts.
tsv_file = read.csv("C:/Users/Siming/Desktop/siming/classes/R/AA_project/PRO_ZHENG_HE1.csv", sep="\t", encoding="UTF-8", header=TRUE, na.strings=c("","NA"))
glimpse(tsv_file)
## Rows: 3,466
## Columns: 11
## $ Reference_type <chr> "journal_article", "journal_article", "journal_art...
## $ Ref_subtype <chr> NA, NA, NA, NA, "<U+7855><U+58EB>", "<U+7855><U+58EB>", NA, NA, NA, NA, "<U+7855><U+58EB>", ...
## $ Title <chr> "<U+62C9><U+65AF><U+6D77><U+9A6C><U+963F><U+5C14><U+9A6C><U+5854><U+592B><U+9057><U+5740>2019<U+5E74><U+8003><U+53E4><U+6536><U+83B7>", "<U+90D1><U+548C><U+4E0B><U+897F><U+6D0B><U+5BF9><U+6211><U+56FD><U+519C><U+4E1A><U+76F8><U+5173><U+7ECF><U+6D4E><U+7684><U+5F71><U+54CD>", "<U+90D1><U+548C><U+4E0B><U+897F>...
## $ Language <chr> "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "...
## $ Author <chr> "<U+7FDF><U+6BC5>;<U+90A2><U+589E><U+9510>;<U+4E01><U+94F6><U+5FE0>;<U+738B><U+5149><U+5C27>;<U+5F20><U+7136>;<U+5F20><U+5E0C><U+7AE0>;<U+82CF><U+5929><U+654F>;<U+5F6D><U+535A>;<U+738B><U+601D><U+9633>;<U+5F20><U+5B87><U+4EAE>;<U+80E1><U+5B50><U+5C27>;<U+80E1><U+5146><U+8F89>;", "...
## $ Author_address <chr> "<U+6545><U+5BAB><U+535A><U+7269><U+9662>;<U+963F><U+8054><U+914B><U+62C9><U+65AF><U+6D77><U+9A6C><U+53E4><U+7269><U+4E0E><U+535A><U+7269><U+9986><U+90E8>;<U+82F1><U+56FD><U+675C><U+4F26><U+5927><U+5B66><U+8003><U+53E4><U+7CFB>;<U+5409><U+6797><U+5927><U+5B66><U+8003><U+53E4><U+5B66><U+9662>;", "<U+5317><U+4EAC><U+4E2D><U+519C><U+5BCC>...
## $ Publishing_place <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Publisher <chr> "<U+6545><U+5BAB><U+535A><U+7269><U+9662><U+9662><U+520A>", "<U+7518><U+8083><U+519C><U+4E1A>", "<U+56FD><U+5BB6><U+822A><U+6D77>", "<U+4E2D><U+56FD><U+53F2><U+7814><U+7A76><U+52A8><U+6001>", "<U+5357><U+4EAC><U+5E08><U+8303><U+5927><U+5B66>", "<U+9655>...
## $ Year <int> 2020, 2020, 2018, 2012, 2012, 2012, 2012, 2012, 20...
## $ Keywords <chr> "<U+6731><U+5C14><U+6CD5>;<U+963F><U+5C14><U+9A6C><U+5854><U+592B>;<U+660E><U+521D><U+5FA1><U+7A91><U+4EA7><U+54C1>;<U+90D1><U+548C>; Jufar;;Al-Mataf;;early Ming ...
## $ Abstract <chr> "2019<U+5E74>11<U+6708>-12<U+6708>,<U+6545><U+5BAB><U+535A><U+7269><U+9662><U+8003><U+53E4><U+7814><U+7A76><U+6240><U+548C><U+963F><U+8054><U+914B><U+62C9><U+65AF><U+6D77><U+9A6C><U+53E4><U+7269><U+4E0E><U+535A><U+7269><U+9986><U+90E8><U+3001><U+82F1><U+56FD><U+675C><U+4F26><U+5927><U+5B66><U+8003><U+53E4><U+7CFB><U+53CA>...
df <- tsv_file %>%
arrange(Year)
#Adding ID columns, ID_str including a unique id, reference type, and publication year.
df$ID_num <- str_pad(seq(nrow(df)), 4, pad="0")
df <- df %>%
mutate(ID_str = paste0(ID_num, "_", Reference_type, "_", Year))
df <- df %>%
select(ID_num, ID_str, Reference_type, Title, Language, Author, Author_address, Publishing_place, Publisher, Year, Keywords, Abstract)
In the following section, I summarize publication years, and the number and ratio of various types of items within the corpus (journal articles, dissertations, etc.).
pub_years <- df %>%
group_by(Year) %>%
count(sort=TRUE)
pub_years
## # A tibble: 56 x 2
## # Groups: Year [56]
## Year n
## <int> <int>
## 1 2005 497
## 2 NA 473
## 3 2014 173
## 4 2015 157
## 5 2006 156
## 6 2013 126
## 7 2012 107
## 8 2011 105
## 9 1985 100
## 10 2004 99
## # ... with 46 more rows
ggplot(pub_years, aes(x=Year, y=n)) +
geom_line() +
scale_x_continuous(breaks=seq(1930,2020,5)) +
labs(title="Number of articles by year")
reftype_n <- df %>%
group_by(Reference_type) %>%
count(sort=T)
reftype_n
## # A tibble: 10 x 2
## # Groups: Reference_type [10]
## Reference_type n
## <chr> <int>
## 1 journal_article 2450
## 2 standard 472
## 3 conference_proceeding 286
## 4 ma_thesis 120
## 5 newspaper_article 89
## 6 phd_dissertation 26
## 7 patent 19
## 8 other 2
## 9 dissertation_thesis 1
## 10 <NA> 1
#Checking out what kind of articles are referenced as 'standard' and 'patent'.
#Since RStudio does not properly display Chinese documents (shows Unicodes in the format
#<U+xxxx> instead), I wrote out most results into separate data sets and chose them from
#the 'Data' list to view them (equivalent to View() function)
standard_titles <- df %>%
filter(Reference_type == "standard") %>%
select(Title)
patent_titles <- df %>%
filter(Reference_type == "patent") %>%
select(Title)
‘Standards’ are apparently mostly related to the organization of commemorative events on the missions, as well as to the establishment of Zheng He-themed touristic sites. ‘Patents’ are mostly related to replicas of Zheng He’s ships.
reftype_year <- df %>%
group_by(Year, Reference_type) %>%
count()
reftype_year <- reftype_year %>%
filter(!is.na(Year)) %>%
filter(!is.na(Reference_type))
#reftype_year_spread <- spread(reftype_year, Reference_type, n)
#reftype_year_spread <- reftype_year_spread %>%
#select(journal_article, standard, conference_proceeding, ma_thesis, newspaper_article, phd_dissertation)
ggplot(reftype_year, aes(x=Year, y=n)) +
geom_line(aes(color=Reference_type)) +
scale_x_continuous(breaks=seq(1930,2020,10))
The dataset shows that journal articles are by far the most common type of publications on the Zheng He missions. The number of publications went through several spikes during the last decades. The first one, around 1985, is apparently related to the commemoration of the 580th anniversary of the first mission (launched in 1405 CE), which was strongly promoted by the Chinese state at the time, in the context of China’s recently launched post-Mao ‘Reform and Opening-up Policy’ (1978-). The largest spike is apparently related to the 600th anniversary of the first mission, also widely commemorated. A third spike can be seen around 2014, which is likely to be related to the inauguration of the “21st Maritime Silk Roads” initiative (part of the Chinese state’s global development strategy known as the Belt & Road Initiative, BRI, earlier known as “One Belt, One Road”, OBOR). During the last years, a drop can be observed in the number of articles. Meanwhile, the author believes that this is likely to be related to the rising popularity of other keywords (esp. Belt & Road Initiative, Maritime Silkroads) in publications with overlapping topics.
#Checking the numbers of item other than journal articles (also removing "other",
#unclassified "dissertation/thesis", for having marginal values only)
reftype_year2 <- reftype_year %>%
filter(Reference_type != "journal_article") %>%
filter(Reference_type != "other_article") %>%
filter(Reference_type != "dissertation_thesis")
ggplot(reftype_year2, aes(x=Year, y=n)) +
geom_line(aes(color=Reference_type)) +
scale_x_continuous(breaks=seq(1930,2020,10))
lang_count <- df %>%
group_by(Language) %>%
count(sort=T)
The language count is as follows: Chinese: 2952 (“Chinese;” 2863 + “Chinese” 89), NA: 510, English: 2, Chinese-English: 1, Other: 1
# Checking articles with NA and 'Other' value:
lang_check <- df %>%
filter(is.na(Language)|Language=='其他;')
Most of the NA value articles are of ‘standard’ and ‘patent’ type, for which no language data is provided by CNKI (they are apparently all in Chinese). There is a limited number of English articles, as well as a few instances of Korean and Japanese titles.
publisher_count_jour_pre1980 <- df %>%
filter(Year < 1980 & Reference_type == "journal_article") %>%
group_by(Publisher) %>%
count(sort=T)
publisher_count_jour_1980to1999 <- df %>%
filter(Year < 2000 & Year >= 1980 & Reference_type == "journal_article") %>%
group_by(Publisher) %>%
count(sort=T)
publisher_count_jour_2000to2009 <- df %>%
filter(Year < 2010 & Year >= 2000 & Reference_type == "journal_article") %>%
group_by(Publisher) %>%
count(sort=T)
publisher_count_jour_since2010 <- df %>%
filter(Year >= 2010 & Reference_type == "journal_article") %>%
group_by(Publisher) %>%
count(sort=T)
publisher_count_conf_1980to1999 <- df %>%
filter(Year < 2000 & Year >= 1980 & Reference_type == "conference_proceeding") %>%
group_by(Publisher) %>%
count(sort=T)
publisher_count_conf_since2000 <- df %>%
filter(Year >= 2000 & Reference_type == "conference_proceeding") %>%
group_by(Publisher) %>%
count(sort=T)
publisher_count_news_1980to1999 <- df %>%
filter(Year < 2000 & Year >= 1980 & Reference_type == "newspaper_article") %>%
group_by(Publisher) %>%
count(sort=T)
publisher_count_news_since2000 <- df %>%
filter(Year >= 2000 & Reference_type == "newspaper_article") %>%
group_by(Publisher) %>%
count(sort=T)
publisher_count_maThesis_1980to1999 <- df %>%
filter(Year < 2000 & Year >= 1980 & Reference_type == "ma_thesis") %>%
group_by(Publisher) %>%
count(sort=T)
publisher_count_maThesis_since2000 <- df %>%
filter(Year >= 2000 & Reference_type == "ma_thesis") %>%
group_by(Publisher) %>%
count(sort=T)
publisher_count_phdDissertation_1980to1999 <- df %>%
filter(Year < 2000 & Year >= 1980 & Reference_type == "phd_dissertation") %>%
group_by(Publisher) %>%
count(sort=T)
publisher_count_phdDissertation_since2000 <- df %>%
filter(Year >= 2000 & Reference_type == "phd_dissertation") %>%
group_by(Publisher) %>%
count(sort=T)
pubPlace_count_conf_1980to1999 <- df %>%
filter(Year < 2000 & Year >= 1980 & Reference_type == "conference_proceeding") %>%
group_by(Publishing_place) %>%
count(sort=T)
pubPlace_count_conf_since2000 <- df %>%
filter(Year >= 2000 & Reference_type == "conference_proceeding") %>%
group_by(Publishing_place) %>%
count(sort=T)
authorAddress_count_1980to1999 <- df %>%
filter(Year < 2000 & Year >= 1980) %>%
group_by(Author_address) %>%
count(sort=T)
authorAddress_count_since2000 <- df %>%
filter(Year >= 2000) %>%
group_by(Author_address) %>%
count(sort=T)
#For further analysis, the top results were copied into #"PRO_ZHENG_HE_publications_data.xlsx", where English translations were added.
Journals in which publications on the Zheng He missions appeared are mostly related to maritime technology, maritime history, the Muslim Hui minority of China (to which Zheng He belonged), history teaching, as well as to his birthplace (near Kunming, Yunnan province) (more info, incl. other types of publications see “PRO_ZHENG_HE_publications_data.xlsx”).
stop_dict <- stopwords::stopwords(language="zh", source="stopwords-iso")
stopwords_ch <- as.data.frame(stop_dict)
titles1 <- df %>%
filter(Year >= 2000, Reference_type == "journal_article" | Reference_type == "conference_proceeding" | Reference_type == "phd_dissertation" | Reference_type == "ma_thesis", !is.na(Title)) %>%
select(Title)
#As I realized, the tidytext package has a Chinese segmentation tool automatically applied
#when calling unnest_tokens() on Chinese text.
tidy1 <- titles1 %>%
unnest_tokens(word, Title)
clean1 <- tidy1 %>%
anti_join(stopwords_ch, by= c("word" = "stop_dict"))
word_count_titles_since2000_academic <- clean1 %>%
group_by(word) %>%
count(sort=T)
#View(word_count_titles_since2000_academic)
set.seed(1234)
wordcloud(word_count_titles_since2000_academic$word, word_count_titles_since2000_academic$n,
min.freq = 10, rot.per = .15, random.order=FALSE, scale=c(5,2),
max.words=100, colors=brewer.pal(8, "Dark2"))
The top words (black, pink, orange) of academic titles since 2000 are as follows: “Zheng He, Western Ocean, culture, research, China, navigation, history.” “Western Ocean” is part of the traditional name of the missions (“Zheng He sailing the Western Ocean”). In pre-modern Chinese it is a spatializing term referring to the South China Sea, the Indian Ocean, and territories located along their shores.
#Checking out some non-academic publication types as well
titles_standard <- df %>%
filter(Reference_type == "standard", !is.na(Title)) %>%
select(Title)
tidy_standard <- titles_standard %>%
unnest_tokens(word, Title)
clean_standard <- tidy_standard %>%
anti_join(stopwords_ch, by= c("word" = "stop_dict"))
word_count_titles_since2000_standards <- clean_standard %>%
group_by(word) %>%
count(sort=T)
set.seed(1234)
wordcloud(word_count_titles_since2000_standards$word, word_count_titles_since2000_standards$n,
min.freq = 10, rot.per = .15, random.order=FALSE, scale=c(5,2),
max.words=100, colors=brewer.pal(8, "Dark2"))
The top words (black, purple, orange) of the titles of standards since 2000 are as follows: “Zheng He, Western Ocean, commemoration, culture, navigation, Taicang (name of the place in China from where the missions started), China, anniversary, activity, 600.” The main topic in standards is apparently the organization of commemorative events on the Zheng He missions, especially the 600th anniversary in 2005.
titles_news <- df %>%
filter(Reference_type == "newspaper_article", !is.na(Title)) %>%
select(Title)
tidy_news <- titles_news %>%
unnest_tokens(word, Title)
clean_news <- tidy_news %>%
anti_join(stopwords_ch, by= c("word" = "stop_dict"))
word_count_titles_since2000_news <- clean_news %>%
group_by(word) %>%
count(sort=T)
set.seed(1234)
wordcloud(word_count_titles_since2000_news$word, word_count_titles_since2000_news$n,
min.freq = 1, rot.per = .15, random.order=FALSE, scale=c(5,2),
max.words=100, colors=brewer.pal(8, "Dark2"))
The top words (black, purple, orange) of newspaper titles since 2000 are as follows: “Zheng He, Western Ocean, China, new, [Silk] Roads, culture, maritime, navigation, Silk [Roads], spirit, ocean, history, sea.” Newspapers are apparently oriented towards the political/cultural diplomacy aspects of the topic (e.g. the construction of the “21st Century Maritime Silk Roads”, “maritime culture”, “Zheng He spirit” [of friendly relations with foreign countries], etc.)
bigrams1 <- titles1 %>%
unnest_tokens(bigram, Title, token="ngrams", n=2) %>%
count(bigram, sort=TRUE)
bigrams_sep1 <- bigrams1 %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_sep1 %>%
filter(!word1 %in% stopwords_ch$stop_dict) %>%
filter(!word2 %in% stopwords_ch$stop_dict)
#View()
trigrams1 <- titles1 %>%
unnest_tokens(trigram, Title, token="ngrams", n=3) %>%
count(trigram, sort=TRUE)
trigrams_sep1 <- trigrams1 %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ")
trigrams_filtered <- trigrams_sep1 %>%
filter(!word1 %in% stopwords_ch$stop_dict) %>%
filter(!word2 %in% stopwords_ch$stop_dict) %>%
filter(!word3 %in% stopwords_ch$stop_dict)
#View()
#For the analysis of the abstracts I wished to see the "range" of each ngram
#(the number of items in which it appears, apart from the basic word/ngram frequency),
#as well as to analyze collocates, keywords, among others. #[Antconc](https://www.laurenceanthony.net/software/antconc/) has built-in functions
#for all of these, therefore I decided to use that for these parts.
abstracts <- titles1 <- df %>%
filter(Year >= 2000, Reference_type == "journal_article" | Reference_type == "conference_proceeding" | Reference_type == "phd_dissertation", !is.na(Abstract)) %>%
select(Abstract)
#Altogether 1812 items.
out_csv = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/antconc/abstracts.csv'
write_csv(abstracts, out_csv, col_names = FALSE)
#I renamed the output as a TXT file and segmented it using #[SegmentAnt](https://www.laurenceanthony.net/software/segmentant/) with the Jieba
#engine. I wrote the abstracts into separate files by using Python (in order to easily #keep track of "ranges" of ngram occurrences), and used Antconc #for the analysis of #word/ngram frequencies, ngram ranges, collocates/concordance, as well keywords (using the #[Lancaster Corpus of Mandarin #Chinese/LCMC](https://www.lancaster.ac.uk/fass/projects/corpus/LCMC/) as reference corpus
#for the latter). Antconc does not have a stopwords filtering option, which I will
#perform in R here.
in_csv = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/antconc/bigrams_abstracts_academic.tsv' #Count of bigrams in the abstracts sorted by range
bigrams_abstracts_csv <- read_tsv(in_csv, col_names = TRUE)
## Parsed with column specification:
## cols(
## Rank = col_double(),
## Freq = col_double(),
## Range = col_double(),
## Ngram = col_character()
## )
bigrams_abstracts_csv_sep <- bigrams_abstracts_csv %>%
separate(Ngram, c("word1", "word2"), sep = " ")
bigrams_abstracts_filtered <- bigrams_abstracts_csv_sep %>%
filter(!word1 %in% stopwords_ch$stop_dict) %>%
filter(!word2 %in% stopwords_ch$stop_dict)
#View()
outfile_2gramsfilt = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/antconc/bigrams_abstracts_academic_stopWordsFiltered.tsv'
write_csv(bigrams_abstracts_filtered, outfile_2gramsfilt, col_names = TRUE)
#Example of a keywords output from Antconc, LCMC used as reference corpus
keywords_infile = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/antconc/keywords_abstracts_academic.tsv'
keywords_example <- read_tsv(keywords_infile, col_names = TRUE)
## Parsed with column specification:
## cols(
## Rank = col_double(),
## Freq = col_double(),
## `Keyness(LL4)` = col_double(),
## `Effect(DICE)` = col_double(),
## Keyword = col_character()
## )
head(keywords_example, n=10)
## # A tibble: 10 x 5
## Rank Freq `Keyness(LL4)` `Effect(DICE)` Keyword
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 3224 12030. 0.0409 <U+90D1><U+548C>
## 2 2 1707 6267. 0.0219 <U+897F><U+6D0B>
## 3 3 1039 3836. 0.0134 <U+822A><U+6D77>
## 4 4 1772 3117. 0.0225 <U+4E2D><U+56FD>
## 5 5 661 2251. 0.0085 <U+6D77><U+6D0B>
## 6 6 856 1949. 0.011 <U+6587><U+5316>
## 7 7 1348 1704. 0.0172 <U+4E0B>
## 8 8 773 1491. 0.0099 <U+5386><U+53F2>
## 9 9 385 1396. 0.005 <U+8239><U+961F>
## 10 10 377 1298. 0.0049 <U+6D77><U+4E0A>
Bigram, trigram, and keyword counts show the centrality of terms related to the “21st Century Maritime Silk Roads” and the 600th anniversary celebrations.
abstracts2 <- df %>%
filter(Year >= 2000, Reference_type == "journal_article" | Reference_type == "conference_proceeding" | Reference_type == "phd_dissertation", !is.na(Abstract)) %>%
select(ID_str, Abstract)
abstracts_tidy <- abstracts2 %>%
unnest_tokens(word, Abstract, drop = FALSE) %>%
anti_join(stopwords_ch, by= c("word" = "stop_dict"))
abstract_dtm <- abstracts_tidy %>%
count(ID_str, word) %>%
cast_dtm(ID_str, word, n)
glimpse(abstract_dtm)
## List of 6
## $ i : int [1:100168] 1 196 398 1168 1 564 1 178 496 1480 ...
## $ j : int [1:100168] 1 1 1 1 2 2 3 3 3 3 ...
## $ v : num [1:100168] 1 1 1 1 1 1 1 1 1 1 ...
## $ nrow : int 1811
## $ ncol : int 13429
## $ dimnames:List of 2
## ..$ Docs : chr [1:1811] "0657_journal_article_2000" "0658_journal_article_2000" "0659_journal_article_2000" "0661_journal_article_2000" ...
## ..$ Terms: chr [1:13429] "<U+4E00><U+756A>" "<U+4E00><U+822C><U+4EBA>" "<U+4E0D><U+5927>" "<U+4E5F><U+6709>" ...
## - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
tic()
abstracts_lda <- LDA(abstract_dtm, k = 5, control = list(seed = 1234))
abstracts_lda
toc()
glimpse(abstracts_lda)
#save(abstracts_lda, file = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/lda_model.rda")
lda_model_topic_term_prob <- tidy(abstracts_lda, matrix="beta")
topic1 <- lda_model_topic_term_prob %>%
filter(topic == 1) %>%
arrange(desc(beta))
topic2 <- lda_model_topic_term_prob %>%
filter(topic == 2) %>%
arrange(desc(beta))
topic3 <- lda_model_topic_term_prob %>%
filter(topic == 3) %>%
arrange(desc(beta))
topic4 <- lda_model_topic_term_prob %>%
filter(topic == 4) %>%
arrange(desc(beta))
topic5 <- lda_model_topic_term_prob %>%
filter(topic == 5) %>%
arrange(desc(beta))
out_tm1 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic1.tsv"
write_tsv(topic1, out_tm1, col_names = TRUE)
out_tm2 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic2.tsv"
write_tsv(topic2, out_tm2, col_names = TRUE)
out_tm3 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic3.tsv"
write_tsv(topic3, out_tm3, col_names = TRUE)
out_tm4 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic4.tsv"
write_tsv(topic4, out_tm4, col_names = TRUE)
out_tm5 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic5.tsv"
write_tsv(topic5, out_tm5, col_names = TRUE)
More or less the following 5 topics were identified:
lda_model_topic_doc_prob <- tidy(abstracts_lda, matrix="gamma")
top_doc1 <- lda_model_topic_doc_prob %>%
filter(topic == 1) %>%
arrange(desc(gamma))
top_doc2 <- lda_model_topic_doc_prob %>%
filter(topic == 2) %>%
arrange(desc(gamma))
top_doc3 <- lda_model_topic_doc_prob %>%
filter(topic == 3) %>%
arrange(desc(gamma))
top_doc4 <- lda_model_topic_doc_prob %>%
filter(topic == 4) %>%
arrange(desc(gamma))
top_doc5 <- lda_model_topic_doc_prob %>%
filter(topic == 5) %>%
arrange(desc(gamma))
#View()
#Checking out the articles deemed most typical of each topic
top1_joint <- merge(top_doc1, df, by.x="document", by.y="ID_str")
top1_joint <- top1_joint %>%
arrange(desc(gamma))
top2_joint <- merge(top_doc2, df, by.x="document", by.y="ID_str")
top2_joint <- top2_joint %>%
arrange(desc(gamma))
top3_joint <- merge(top_doc3, df, by.x="document", by.y="ID_str")
top3_joint <- top3_joint %>%
arrange(desc(gamma))
top4_joint <- merge(top_doc4, df, by.x="document", by.y="ID_str")
top4_joint <- top4_joint %>%
arrange(desc(gamma))
top5_joint <- merge(top_doc5, df, by.x="document", by.y="ID_str")
top5_joint <- top5_joint %>%
arrange(desc(gamma))
The topic-per-document results of the first four topics mostly confirm to what was inferred from the word-per-topic counts, although often indicating a slightly different focus. The most typical articles of topic 2 indicate the importance of the discussions of Zheng He’s relations to the Mazu Sea Goddess cult in the broader discourse on the “maritime culture” and “Zheng He spirit”, the highest-ranking articles in topic 4 indicate a stronger focus on Buddhism instead of Islam. Topic 5 is apparently a lumping together of articles which could not be placed in other topics, some of the high-ranking articles are the few non-Chinese language articles.
#Writing out into TSV files
out_gamma1 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic1_gamma.tsv"
write_tsv(top1_joint, out_gamma1, col_names = TRUE)
out_gamma2 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic2_gamma.tsv"
write_tsv(top2_joint, out_gamma2, col_names = TRUE)
out_gamma3 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic3_gamma.tsv"
write_tsv(top3_joint, out_gamma3, col_names = TRUE)
out_gamma4 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic4_gamma.tsv"
write_tsv(top4_joint, out_gamma4, col_names = TRUE)
out_gamma5 = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/topic5_gamma.tsv"
write_tsv(top5_joint, out_gamma5, col_names = TRUE)
lda_file = "C:/Users/Siming/Desktop/siming/classes/R/AA_project/topic_modeling/lda_model.rda"
load(lda_file)
topicmodels2LDAvis <- function(x, ...){
post <- topicmodels::posterior(x)
if (ncol(post[["topics"]]) < 3) stop("The model must contain > 2 topics")
mat <- x@wordassignments
LDAvis::createJSON(
phi = post[["terms"]],
theta = post[["topics"]],
vocab = colnames(post[["terms"]]),
doc.length = slam::row_sums(mat, na.rm = TRUE),
term.frequency = slam::col_sums(mat, na.rm = TRUE)
)
}
serVis(topicmodels2LDAvis(abstracts_lda))
The following selected corpus includes the 50 most cited articles of the last 20 years (CNKI, as of 2020-06-28; #1: 115 citations, #50: 13 citations).
#Getting the titles out of the file exported from the citation numbers-filtered CNKI #search and merging them with the titles in the original data frame of the project.
cit_list_file = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/selected_corpora/most_cited/most_cited_list.tsv'
cit_list <- read_tsv(cit_list_file, col_names = TRUE)
## Parsed with column specification:
## cols(
## Item = col_character()
## )
df_cit <- merge(cit_list, df, by.x = "Item", by.y = "Title")
glimpse(df_cit)
## Rows: 50
## Columns: 12
## $ Item <chr> "?1100kV<U+7279><U+9AD8><U+538B><U+76F4><U+6D41><U+7CFB><U+7EDF><U+8BD5><U+9A8C><U+65B9><U+6848><U+7814><U+7A76>", "“<U+4E00><U+5E26><U+4E00><U+8DEF>”<U+4E0E><U+4E9A><U+975E><U+6218><U+7565><U+5408><U+4F5C><U+4E2D><U+7684>“<U+5B97><U+6559><U+56E0><U+7D20>”", "...
## $ ID_num <chr> "2495", "2510", "2606", "2546", "0680", "2066", "1...
## $ ID_str <chr> "2495_journal_article_2015", "2510_journal_article...
## $ Reference_type <chr> "journal_article", "journal_article", "journal_art...
## $ Language <chr> "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "...
## $ Author <chr> "<U+6768><U+4E07><U+5F00>;<U+5370><U+6C38><U+534E>;<U+73ED><U+8FDE><U+5E9A>;<U+66FE><U+5357><U+8D85>;", "<U+9A6C><U+4E3D><U+84C9>;", "<U+9A6C><U+4E3D><U+84C9>;", "<U+5468><U+7490><U+94ED>", "<U+8D75><U+541B><U+5C27>", ...
## $ Author_address <chr> "<U+4E2D><U+56FD><U+7535><U+529B><U+79D1><U+5B66><U+7814><U+7A76><U+9662>;", "<U+4E0A><U+6D77><U+5916><U+56FD><U+8BED><U+5927><U+5B66><U+4E2D><U+4E1C><U+7814><U+7A76><U+6240>;", "<U+4E0A><U+6D77><U+5916><U+56FD><U+8BED><U+5927><U+5B66><U+4E2D><U+4E1C><U+6240>;", NA, ...
## $ Publishing_place <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Publisher <chr> "<U+7535><U+7F51><U+6280><U+672F>", "<U+897F><U+4E9A><U+975E><U+6D32>", "<U+4E16><U+754C><U+5B97><U+6559><U+6587><U+5316>", "<U+4E2D><U+5171><U+4E2D><U+592E><U+515A><U+6821>", "<U+804C><U+5927><U+5B66><U+62A5>", "<U+5357><U+4EAC><U+5E08><U+8303><U+5927>...
## $ Year <int> 2015, 2015, 2015, 2015, 2000, 2012, 2005, 2006, 20...
## $ Keywords <chr> "±1100 kV<U+7279><U+9AD8><U+538B><U+76F4><U+6D41><U+793A><U+8303><U+5DE5><U+7A0B>;±800 kV<U+7279><U+9AD8><U+538B><U+7CFB><U+7EDF><U+8BD5><U+9A8C>;<U+5206><U+5C42><U+63A5><U+5165>;<U+8BD5><U+9A8C><U+65B9><U+6848>", "<U+5B97><U+6559><U+5916>...
## $ Abstract <chr> "<U+5206><U+6790><U+4E86><U+51C6><U+4E1C>—<U+7696><U+5357>?1100 k V<U+7279><U+9AD8><U+538B><U+76F4><U+6D41><U+793A><U+8303><U+5DE5><U+7A0B><U+7684><U+7279><U+70B9><U+4EE5><U+53CA><U+54C8><U+90D1><U+548C><U+6EAA><U+6D59>?800 k V<U+76F4><U+6D41><U+5DE5><U+7A0B><U+73B0>...
reftype_cit <- df_cit %>%
group_by(Reference_type) %>%
count(sort=T)
reftype_cit
## # A tibble: 4 x 2
## # Groups: Reference_type [4]
## Reference_type n
## <chr> <int>
## 1 journal_article 31
## 2 ma_thesis 10
## 3 phd_dissertation 8
## 4 dissertation_thesis 1
pub_years_cit <- df_cit %>%
group_by(Year) %>%
count(sort=TRUE)
ggplot(pub_years_cit, aes(x=Year, y=n)) +
geom_line() +
scale_x_continuous(breaks=seq(2000,2020,1)) +
labs(title="Number of articles by year (50 most cited)")
publishers_cit <- df_cit %>%
group_by(Publisher) %>%
count(sort=T)
#View()
authorAddresses_cit <- df_cit %>%
group_by(Author_address) %>%
count(sort=T)
#View()
authors_cit <- df_cit$Author
string_authors_cit <- paste(unlist(authors_cit), collapse = ";")
authors_cit_list <- as.list(strsplit(string_authors_cit, ';|;;|,'))
authors_cit_df <- data.frame(matrix(unlist(authors_cit_list), nrow=length(authors_cit_list), byrow=TRUE))
authors_cit_df <- t(authors_cit_df)
authors_cit_df <- as.data.frame(authors_cit_df)
authors_cit_count <- authors_cit_df %>%
group_by(V1) %>%
count(sort=T)
#View()
keywords_cit <- df_cit$Keywords
string_keywords_cit <- paste(unlist(keywords_cit), collapse = ";")
keywords_cit_list <- as.list(strsplit(string_keywords_cit, ';|;;|,'))
keywords_cit_df <- data.frame(matrix(unlist(keywords_cit_list), nrow=length(keywords_cit_list), byrow=TRUE))
keywords_cit_df <- t(keywords_cit_df)
keywords_cit_df <- as.data.frame(keywords_cit_df)
keywords_cit_count <- keywords_cit_df %>%
group_by(V1) %>%
count(sort=T)
#View()
titles_cit <- df_cit %>%
filter(!is.na(Item)) %>%
select(Item)
tidy_cit <- titles_cit %>%
unnest_tokens(word, Item)
clean_cit <- tidy_cit %>%
anti_join(stopwords_ch, by= c("word" = "stop_dict"))
word_count_titles_since2000_cit <- clean_cit %>%
group_by(word) %>%
count(sort=T)
#View()
set.seed(1234)
wordcloud(word_count_titles_since2000_cit$word, word_count_titles_since2000_cit$n,
min.freq = 1, rot.per = .15, random.order=FALSE, scale=c(5,2),
max.words=100, colors=brewer.pal(8, "Dark2"))
The top words (black, green, pink, purple) of the titles of the 50 most cited articles are as follows: “research, China, Zheng He, Western Ocean, history, strategy, Ming Dynasty, ocean, tribute (likely ref. to pre-modern ‘tributary system’ of foreign relations), Asia.”
abstracts_cit <- df_cit %>%
filter(!is.na(Abstract)) %>%
select(Abstract)
tidy_cit_abs <- abstracts_cit %>%
unnest_tokens(word, Abstract)
clean_cit_abs <- tidy_cit_abs %>%
anti_join(stopwords_ch, by= c("word" = "stop_dict"))
word_count_abstracts_since2000_cit <- clean_cit_abs %>%
group_by(word) %>%
count(sort=T)
#View()
set.seed(1234)
wordcloud(word_count_abstracts_since2000_cit$word, word_count_abstracts_since2000_cit$n,
min.freq = 1, rot.per = .15, random.order=FALSE, scale=c(5,2),
max.words=100, colors=brewer.pal(8, "Dark2"))
The top words (black, pink, purple) of the abstracts of the 50 most highly cited articles are as follows: “culture, China, foreign [~relations], strategy, history, Zheng He, development.” The importance of “culture” is notable throughout the discourse, especially here.
bigrams_cit <- abstracts_cit %>%
unnest_tokens(bigram, Abstract, token="ngrams", n=2) %>%
count(bigram, sort=TRUE)
bigrams_cit <- bigrams_cit %>%
separate(bigram, c("word1", "word2"), sep = " ")
abstracts_cit_bigrams_filtered <- bigrams_sep1 %>%
filter(!word1 %in% stopwords_ch$stop_dict) %>%
filter(!word2 %in% stopwords_ch$stop_dict)
#View()
#outfile_2gramsfilt_2 = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/selected_corpora/most_cited/cited_abs_bigrams.tsv'
#write_csv(abstracts_cit_bigrams_filtered, outfile_2gramsfilt_2, col_names = TRUE)
Co-citation networks can be generated on CNKI, see e.g.
The following selected corpus includes the 50 most downloaded articles of the last 20 years (CNKI, as of 2020-06-28; #1: 8311 downloads, #50: 1281 downloads). The analysis will start with identifying the number of matches between the lists of the most cited and most downloaded articles of the last 20 years. Following this, the procedure will be identical to the one above.
down_list_file = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/selected_corpora/most_downloaded/most_downloaded_list.tsv'
down_list <- read_tsv(down_list_file, col_names = TRUE)
## Parsed with column specification:
## cols(
## Item = col_character()
## )
df_down <- merge(down_list, df, by.x = "Item", by.y = "Title")
glimpse(df_down)
## Rows: 50
## Columns: 12
## $ Item <chr> "“<U+4E00><U+5E26><U+4E00><U+8DEF>”<U+4E0E><U+4E9A><U+975E><U+6218><U+7565><U+5408><U+4F5C><U+4E2D><U+7684>“<U+5B97><U+6559><U+56E0><U+7D20>”", "“21<U+4E16><U+7EAA><U+6D77><U+4E0A><U+4E1D><U+7EF8><U+4E4B><U+8DEF>”<U+5BF9><U+6D77><U+6D0B><U+5F3A><U+56FD><U+6218><U+7565><U+7684><U+5F71><U+54CD><U+7814><U+7A76>...
## $ ID_num <chr> "2510", "2563", "1932", "2417", "2546", "2691", "1...
## $ ID_str <chr> "2510_journal_article_2015", "2563_ma_thesis_2015"...
## $ Reference_type <chr> "journal_article", "ma_thesis", "phd_dissertation"...
## $ Language <chr> "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "<U+4E2D><U+6587>;", "...
## $ Author <chr> "<U+9A6C><U+4E3D><U+84C9>;", "<U+8D75><U+6CD3><U+535A>", "<U+8C22><U+831C>", "<U+675C><U+62C9>(D.Sarananda)", "<U+5468><U+7490><U+94ED>", "<U+7A46><U+7F55>...
## $ Author_address <chr> "<U+4E0A><U+6D77><U+5916><U+56FD><U+8BED><U+5927><U+5B66><U+4E2D><U+4E1C><U+7814><U+7A76><U+6240>;", NA, NA, NA, NA, NA, NA, "<U+6B66><U+6C49><U+7406><U+5DE5><U+5927><U+5B66>,<U+6B66>...
## $ Publishing_place <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Publisher <chr> "<U+897F><U+4E9A><U+975E><U+6D32>", "<U+897F><U+5317><U+5E08><U+8303><U+5927><U+5B66>", "<U+6B66><U+6C49><U+5927><U+5B66>", "<U+534E><U+4E2D><U+5E08><U+8303><U+5927><U+5B66>", "<U+4E2D><U+5171><U+4E2D><U+592E><U+515A><U+6821>", "<U+626C><U+5DDE><U+5927>...
## $ Year <int> 2015, 2015, 2010, 2014, 2015, 2016, 2007, 2006, 20...
## $ Keywords <chr> "<U+5B97><U+6559><U+5916><U+4EA4>;“<U+4E00><U+5E26><U+4E00><U+8DEF>”;<U+4E9A><U+975E><U+5408><U+4F5C>;<U+5168><U+7403><U+6CBB><U+7406>; Religious Diplomacy;;\"One...
## $ Abstract <chr> "<U+5B97><U+6559><U+5916><U+4EA4><U+81EA><U+53E4><U+4EE5><U+6765><U+5C31><U+662F><U+4E2D><U+56FD><U+5F00><U+5C55><U+4E9A><U+975E><U+5408><U+4F5C><U+91CD><U+8981><U+7684><U+5916><U+4EA4><U+5F62><U+6001><U+4E4B><U+4E00><U+3002><U+90D1><U+548C><U+5305><U+5BB9><U+6027><U+7684><U+5B97><U+6559><U+5916><U+4EA4>,<U+4E0D><U+4EC5><U+5F00><U+8F9F><U+4E86><U+660E><U+671D><U+4E0E>30...
cit_down_intersect <- intersect(cit_list, down_list)
#View()
#22 articles are both in the most cited and most downloaded lists
reftype_down <- df_down %>%
group_by(Reference_type) %>%
count(sort=T)
reftype_down
## # A tibble: 4 x 2
## # Groups: Reference_type [4]
## Reference_type n
## <chr> <int>
## 1 journal_article 21
## 2 phd_dissertation 15
## 3 ma_thesis 13
## 4 dissertation_thesis 1
PhD dissertations apparently enjoy high popularity among users of CNKI, their ratio among the most downloaded articles is by far over their ratio in the entire corpus (
df
).
pub_years_down <- df_down %>%
group_by(Year) %>%
count(sort=TRUE)
ggplot(pub_years_down, aes(x=Year, y=n)) +
geom_line() +
scale_x_continuous(breaks=seq(2000,2020,1)) +
labs(title="Number of articles by year (50 most downloaded)")
publishers_down <- df_down %>%
group_by(Publisher) %>%
count(sort=T)
#View()
authorAddresses_down <- df_down %>%
group_by(Author_address) %>%
count(sort=T)
#View()
authors_down <- df_down$Author
string_authors_down <- paste(unlist(authors_down), collapse = ";")
authors_down_list <- as.list(strsplit(string_authors_down, ';|;;|,'))
authors_down_df <- data.frame(matrix(unlist(authors_down_list), nrow=length(authors_down_list), byrow=TRUE))
authors_down_df <- t(authors_down_df)
authors_down_df <- as.data.frame(authors_down_df)
authors_down_count <- authors_down_df %>%
group_by(V1) %>%
count(sort=T)
#View()
keywords_down <- df_down$Keywords
string_keywords_down <- paste(unlist(keywords_down), collapse = ";")
keywords_down_list <- as.list(strsplit(string_keywords_down, ';|;;|,'))
keywords_down_df <- data.frame(matrix(unlist(keywords_down_list), nrow=length(keywords_down_list), byrow=TRUE))
keywords_down_df <- t(keywords_down_df)
keywords_down_df <- as.data.frame(keywords_down_df)
keywords_down_count <- keywords_down_df %>%
group_by(V1) %>%
count(sort=T)
#View()
titles_down <- df_down %>%
filter(!is.na(Item)) %>%
select(Item)
tidy_down <- titles_down %>%
unnest_tokens(word, Item)
clean_down <- tidy_down %>%
anti_join(stopwords_ch, by= c("word" = "stop_dict"))
word_count_titles_since2000_down <- clean_down %>%
group_by(word) %>%
count(sort=T)
#View()
set.seed(1234)
wordcloud(word_count_titles_since2000_down$word, word_count_titles_since2000_down$n,
min.freq = 1, rot.per = .15, random.order=FALSE, scale=c(5,2),
max.words=100, colors=brewer.pal(8, "Dark2"))
The top words (black, brown, orange, pink, purple) of titles of the 50 most downloaded articles are as follows: “research, Western Ocean, Zheng He, China, culture, Ming era, development, history, Ming Dynasty, ocean, tribute, trade.”
abstracts_down <- df_down %>%
filter(!is.na(Abstract)) %>%
select(Abstract)
tidy_down_abs <- abstracts_down %>%
unnest_tokens(word, Abstract)
clean_down_abs <- tidy_down_abs %>%
anti_join(stopwords_ch, by= c("word" = "stop_dict"))
word_count_abstracts_since2000_down <- clean_down_abs %>%
group_by(word) %>%
count(sort=T)
#View()
set.seed(1234)
wordcloud(word_count_abstracts_since2000_down$word, word_count_abstracts_since2000_down$n,
min.freq = 1, rot.per = .15, random.order=FALSE, scale=c(5,2),
max.words=100, colors=brewer.pal(8, "Dark2"))
The top words (black, green, pink, purple) of the abstracts of the 50 most downloaded articles are as follows: “culture, China, development, history, strategy, research, Western Ocean, Zheng He, foreign [relations], trade, country, influence”.
The importance of discussions of “strategy” (~present-day foreign policy) are also notable in the most cited/downloaded articles.
bigrams_down <- abstracts_down %>%
unnest_tokens(bigram, Abstract, token="ngrams", n=2) %>%
count(bigram, sort=TRUE)
bigrams_down <- bigrams_down %>%
separate(bigram, c("word1", "word2"), sep = " ")
abstracts_down_bigrams_filtered <- bigrams_down %>%
filter(!word1 %in% stopwords_ch$stop_dict) %>%
filter(!word2 %in% stopwords_ch$stop_dict)
#View()
#outfile_2gramsfilt_3 = 'C:/Users/Siming/Desktop/siming/classes/R/AA_project/selected_corpora/most_downloaded/downloaded_abs_bigrams.tsv'
#write_csv(abstracts_down_bigrams_filtered, outfile_2gramsfilt_3, col_names = TRUE)
The present project is a valuable contribution to my Ph.D. dissertation The Zheng He Missions: A Critical Discourse Analysis of Recent Mainland Chinese Historiography. The project provides quantitative data for the dissertation, reinforcing many of my previous observations based on qualitative analysis, as well as pointing to certain aspects of the discourse previously less investigated. As the results of the present project show, the discourse on the Zheng He missions goes beyond merely discussing the historical events themselves, and is widely seen as having a referential value for China’s present-day and future foreign policy. The evidence for this comes from the high frequency of terms connected to the China’s states maritime development strategy “21st Century Silk Roads” (part of the larger Belt & Road Initiative, BRI, a.k.a. OBOR), as well as to concepts such as “maritime culture” (rediscovering China’s traditions as a maritime power) or “Zheng He spirit” (of openness and peaceful development). The 600th anniversary of the first mission in 2005 played a central role in the recent discourse, which induced the highest number of publications on the topic.
The present project serves as an example of not only how digital methods can be used in Chinese studies, but also of how they are valuable tools in discourse studies in general. In the 21st century, discourse analysis cannot remain a merely qualitative method but has to adapt to the technological opportunities of the present era. Throughout my Ph.D. dissertation I wish to further advance the combination of qualitative and quantitative methods for the study of Chinese discourses, to which the present project provided a valuable starting point.