Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201894

Unable to Export (or view) Total If-Idf Results for textmining

$
0
0

As part of my efforts to textmine research papers I am interested in looking at Tf-Idf values.

So far I have had difficulty using tidytext for tf-idf due to issues with columns/objects not being detected (consistent issue on this site). Therefore I utilised TM weighting and hoped to view all my results by exporting to csv.

The limited results that I have are in the right format (paper; term; tf-idf value). Only a few of the papers though are available. This is despite the fact that the object states that there are 71 documents. (One document is not readable therefore shows up with error that can be ignored.)

Any help is appreciated, cheers

setwd('C:\\Users\\[--myname--]\\Desktop\\Text_Mine_TestSet_1')
files <- list.files(pattern = 'pdf$')
summary(files)

corpus_a1 <- Corpus(URISource(files),
                    readerControl = list(reader = readPDF()))

TDM_a1 <- TermDocumentMatrix(corpus_a1, control = list(removePunctuation = TRUE,
                                                       stopwords = TRUE,
                                                       tolower = TRUE,
                                                       stemming =TRUE,
                                                       removenumbers = TRUE))
DTM_a1 <- DocumentTermMatrix(corpus_a1, control = list(removePunctuation = TRUE,
                                                       stopwords = TRUE,
                                                       tolower = TRUE,
                                                       stemming =TRUE,
                                                       removenumbers = TRUE))

# --------------------------

tdm_TfIdf <- weightTfIdf(TDM_a1)
tdm_TfIdf # 71 Documents 32,177 terms (can sparse here)
tdm_TfIdf %>% 
  View() # Odd table
inspect(tdm_TfIdf) # Shows limited output
print(tdm_TfIdf)

library(devtools)

tdm_inspect <- inspect(tdm_TfIdf)
tdm_DF <- as.data.frame(tdm_inspect, stringsAsFactors = FALSE)
tdm_DF
write.table(tdm_DF)
write.csv(tdm_DF, 'C:\\Users\\Hunter S. Baggen\\Desktop\\tdm_TfIdf.csv',
          row.names = TRUE)

# ---------------------
# SAME ISSUE SIMPLY X and Y AXIS FLIPPED
dtm_TfIdf <- weightTfIdf(DTM_a1)
dtm_TfIdf # 71 Documents 32,177 terms (can sparse here)
dtm_TfIdf %>% 
  View() # Odd table
inspect(dtm_TfIdf) # Shows limited output
print(dtm_TfIdf)

dtm_inspect <- inspect(dtm_TfIdf)
dtm_DF <- as.data.frame(dtm_inspect, stringsAsFactors = FALSE)
dtm_DF
write.table(dtm_DF)
write.csv(dtm_DF, 'C:\\Users\\Hunter S. Baggen\\Desktop\\dtm_TfIdf.csv',
          row.names = TRUE)

As stated above, four papers and ten terms appear in the resulting csv file. I am unsure why the results would be limited in this manner.


Viewing all articles
Browse latest Browse all 201894

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>