Rewrote textrank_sentences()#7
Conversation
…dded progress bar and parallelization
|
I think you should really test out using the minhash algorithm. That is more a solution if you have large volumes of sentences as if you don't use it it will have to calculate all pairwise sentence similarities. Please try out the minhash algorithm. |
|
On another note. Some advise in reducing the dimensionality of the number of sentences: it is better to use text clustering first (e.g. using the topicmodels R package or the BTM package (https://cran.r-project.org/web/packages/BTM/index.html) and inside that cluster apply textrank |
|
Thanks for the advise - I'm trying out different approaches and have had a look at the minhash algorithm - but it takes longer to run the textrank_candidates_lsh function itself then running the rewritten textrank_sentences. And that's only when it will run - if I run it on all my 12.000 sentences - it will fail and throw an error.. Also had a look at the BTM package, but again it takes a long time to complete. Really the fastes way to do it, is using the textrank_sentences. |
|
I've read a bit the changes. Am I correct that the speed difference is basically because you calculate overlap in batches by groups of textrank_id's and because you parallelise the mapply loop? |
|
Not really - actually haven't even used the parallelise function. The reason is, that I have used data.table and thus using reference = lower memoy and faster speed. |
|
Ok, but in that case, can you drop the usage of the pbapply package. In general I'm against adding package dependencies which are not needed. Adding a dependency on another package seems to me overkill. Why not add a simple trace argument and print out something every say 1000 comparisons. That removes another dependency which might give problems in maintaining later on. |
|
That is a good principle - one I tend to stick with as well, but guess i got carried away :) I'll have a look at it and write the pbapply package out.. |
|
great |
…textrank_sentences()
|
Removed the use of pdapply and replaced it with cat - it's not as pretty but it gets the job done, if you want to monitor the progress of the function.. |
|
Thanks, I'm going to review this soon and incorporate |
|
I've reviewed your code and updated it according to what I thought was better readeable. Can you try it out on your own dataset and let me know if this is fine. |
I rewrote the textrank_sentences() as it could not run my dataset (ran for 3 days without finishing). In doing so I added pbapply to show progress for the sentence_dist function as well as enable parallelization.
A pretty solid upgrad to an already pretty solid function, if I should say so my self!