I'm scraping text keywords from this article page using rvest in R using the code below:
#install.packages("xml2") # required for rvest
library("rvest") # for web scraping
library("dplyr") # for data management
#' start with get the link for the web to be scraped
page <- read_html("https://www.sciencedirect.com/science/article/pii/S1877042810004568")
keyW <- page %>% html_nodes("div.Keywords.u-font-serif") %>% html_text() %>% paste(collapse = ",")
And it gave me:
> keyW
[1] "KeywordsPhysics curriculumTurkish education systemfinnish education systemPISAphysics achievement"
After removing the word "Keywords" and anything before it from the string using this line of code:
keyW <- gsub(".*Keywords","", keyW)
The new keyW is:
[1] "Physics curriculumTurkish education systemfinnish education systemPISAphysics achievement"
However, my desired output is this list:
[1] "Physics curriculum" "Turkish education system" "finnish education system" "PISA" "physics achievement"
How should I tackle this? I think this boils down to:
- how to properly scrape the keywords from the website
- how to properly split the string
Thanks
question from:
https://stackoverflow.com/questions/65948543/how-to-split-merged-glued-words-with-no-delimiter-using-r 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…