Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201894

R: how to extract pieces of text from a string based on a pattern

$
0
0

I have a dataset where each row contains a string of text of this type

1)list(text = \"incredible hosts\", relevance = 0.87518, count = 1), list(text = \"Japan\", relevance = 0.675236, count = 1), list(text = \"support\", relevance = 0.625663, count = 1), list(text = \"result\", relevance = 0.359757, count = 1)


2)list(text = \"British fleet\", relevance = 0.912888, count = 1), list(text = \"worst maritime disasters\", relevance = 0.904047, count = 1), list(text = \"British history\", relevance = 0.755491, count = 1), list(text = \"Scilly Isles\", relevance = 0.716508, count = 1), list(text = \"sailors\", relevance = 0.691141, count = 1), list(text = \"evening\", relevance = 0.597375, count = 1), list(text = \"Tragedy\", relevance = 0.577141, count = 1), list(text = \"prize\", relevance = 0.565035, count = 1), list(text = \"rocks\", relevance = 0.543257, count = 1), list(text = \"innovation\", relevance = 0.529463, count = 1), list(text = \"longitude\", relevance = 0.335207, count = 1)

basically I would like to extract just the string of text contain between \" and \"

and obtain something like this

1) "incredible hosts, Japan, support , result"
2) "British fleet, worst maritime disasters, British history, scilly Isles, sailors, evening, etc..."

Moreover I would like to create a data frame that helps le keep track of the relevance score contained in the text for each piece of text (considering that different raws might have different number of pieces of text) so to get something like this:

 col1                 col2.   col3.    col4.   col5.     col6.....  colA1    colA2.  .....
 incredible hosts     Japon  support  result    NA.      NA        0.87518.   0.675236....
 british fleet.       worst marit.......

basically a number of columns that is equal to the maximum number of pieces of text in a row, same for the columns corresponding to the score (each relevance score refers to a piece of text, so they re the same number).

If I can find a way to extract first the pieces of text and separate them by a comma, and then do the same with the relevance scores I think I can easily merge the two in a dataframe. so the problem is mainly extracting this 2 things from that text.

thank you in advance for your help,

Carlo


Viewing all articles
Browse latest Browse all 201894

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>