Thursday, April 4, 2024

Stripping HTML Tags from Textual content Utilizing Plain JavaScript

Must read


Introduction

Internet scraping has turn into increasingly prevalent over time, which suggests extra builders are having to determine how you can work with HTML markup from the pages they’re scraping. However what if you happen to simply need the textual content? Given the complexity of HTML, this would possibly look like a frightening process, however fortunately, there are some methods to do it with JavaScript.

Why take away HTML tags?

So why would you ever wish to take away HTML tags from textual content? Properly, there are lots of causes. As an illustration, you would possibly wish to extract the textual content content material from an internet web page for evaluation, otherwise you would possibly wish to sanitize person enter to stop XSS (Cross Web site Scripting) assaults. Eradicating HTML tags may help in each these eventualities, and plenty of others.

Notice: XSS is a sort of safety vulnerability the place an attacker injects malicious scripts into webpages considered by different customers. By sanitizing person enter and stripping HTML tags, we may help mitigate this threat.

Tips on how to Strip HTML Tags with JavaScript

Within the following sections we’ll present a number of methods to strip HTML tags from a string. You may in all probability discover that, when utilizing plain JS, the frequent denominator is to make use of Common Expressions, that are a robust device for working with complext string manipulations like this.

The change() Technique

The change() methodology is a frequently-used device for manipulating strings in JavaScript, and it may also be used to strip HTML tags from a string. It really works by looking out the string for a specified sample, which in our case can be HTML tags, and changing them with an empty string.

The next instance exhibits how you need to use the change() methodology to take away all HTML tags from a given string:

let stringWithHtml = "<p>Hi there, World!</p> <a href="#">Click on Me</a>";
let strippedString = stringWithHtml.change(/</?[^>]+(>|$)/g, "");
console.log(strippedString);

// Outputs: Hi there, World! Click on Me

On this instance, the common expression /</?[^>]+(>|$)/g is used to match any string that begins with a less-than image (<), adopted by optionally available ahead slash (/), after which adopted by any character that’s not a greater-than image (>), ending with a greater-than image (>) or the tip of the string.

The g on the finish of the common expression is a flag that tells JavaScript to switch all occurrences, not simply the primary one.

By changing these matches with an empty string, we successfully strip all HTML tags from the unique string, leaving us with simply the textual content content material.

Utilizing Libraries

Whereas utilizing plain JavaScript is nice, generally you would possibly wish to use a library to deal with this process. One such library is Cheerio. Cheerio supplies a easy API for manipulating HTML and XML paperwork, just like jQuery.

This is how you need to use Cheerio to strip HTML tags:

const cheerio = require('cheerio');

let str = "<p>Hi there, World!</p>";
let $ = cheerio.load(str);

console.log($.textual content());

This will even output: "Hi there, World!".

Stripping HTML Entities

HTML entities are a unique beast altogether. These are particular characters which might be written utilizing particular codes to be displayed in an HTML doc. For instance, &amp; is the HTML entity for the ampersand (&).

Stripping HTML entities is a bit trickier, however will be executed utilizing the he library. This is how:

const he = require('he');

let str = "Hi there, World &amp; everybody else!";
let decodedStr = he.decode(str);

console.log(decodedStr);

This can output: "Hi there, World & everybody else!".

Notice: The he.decode() operate will decode any HTML entities in your string, changing them again into their unique characters.

By combining these strategies and this, we are able to successfully strip all HTML tags and entities from a string utilizing JavaScript. Bear in mind, whereas libraries could make our lives simpler, understanding how you can do it with plain JavaScript is a good ability to have.

Dealing with Nested HTML Tags

Earlier than we conclude, one factor we should always in all probability have a look at is – does our method work on nested HTML entities? This may current a little bit of a problem when attempting to strip them out. As an example we’ve got a string like this:

let str = "<div><p>Hi there <sturdy>World</sturdy></p></div>";

If we had been to make use of a naive strategy, we would find yourself with some sudden outcomes. However don’t be concerned, JavaScript’s change() methodology, mixed with a well-crafted common expression, can deal with this state of affairs fairly effectively. This is how:

let str = "<div><p>Hi there <sturdy>World</sturdy></p></div>";
let stripped = str.change(/<[^>]+>/g, '');
console.log(stripped);

// "Hi there World"

Right here, the common expression <[^>]+> matches any sequence that begins with <, adopted by a number of characters that aren’t >, and ends with >. This matches all HTML tags, nested or not, and replaces them with an empty string.

Conclusion

On this Byte, we have explored how you can strip HTML tags from textual content utilizing plain JavaScript. We have discovered in regards to the change() methodology and how you can use common expressions to match HTML tags. We have additionally lined how you can deal with nested HTML tags and particular characters. Whereas JavaScript supplies us with the instruments to do that in a reasonably simple method, all the time take into account the complexity and efficiency implications of your particular use case.



Supply hyperlink

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article