You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems like jusText can not extract content from html lists (ul, ol tags). For example, only "Some text A. Some text C." will be extracted from: <p>Some text A.</p><ul><li>Some text B.</li></ul><p>Some text C.</p>
Is it normal behavior? Is it possible to fix?
Or could you point me where can I modify this behavior, please?
Example: https://plantcaretoday.com/how-to-grow-and-care-for-bougainvillea.html
The text was updated successfully, but these errors were encountered:
@polosatyi Hi, to be honest I don't know where is the problem. JustText has many heuristics and it may be be any of them or the combination. I can see that <li> is paragraph tag so Some text B. is considered new paragraph of text and it can be deleted because it's too short or anything else. It's quite a time I worked with jusText last time so it's hard to tell. I think the best thing is to try tweak some CLI args like min. length, density, ... to minimal/maximal value to find out which one is causing the problem. If that does not help it's on good old debugging :)
Hey, @miso-belica
It seems like jusText can not extract content from html lists (ul, ol tags). For example, only "Some text A. Some text C." will be extracted from:
<p>Some text A.</p><ul><li>Some text B.</li></ul><p>Some text C.</p>
Is it normal behavior? Is it possible to fix?
Or could you point me where can I modify this behavior, please?
Example: https://plantcaretoday.com/how-to-grow-and-care-for-bougainvillea.html
The text was updated successfully, but these errors were encountered: