TLDR
If you are trying to scrape data from DOM/ HTML, and data is present in the API response but not on the screen, then override XMLHttpRequest to access the response of Api in your code.
Chrome extensions are a great way to build additional features on top of the content the user consumes via the Chrome browser. There are many use cases for this capability of Chrome extensions, such as providing additional information about product pages the user is visiting on e-commerce websites, scraping information from social media networks such as LinkedIn, Twitter and running various analyses to find insights or simply forwarding that information to your product for future use.
The first approach to scrape data from any web page is to inject a content script and parse HTML or traverse the DOM tree using CSS selectors and XPaths. And it works well for many use-case scenarios. However, more and more web applications nowadays are highly reactive and built with modern frameworks such as React or Vue. This created various challenges while scraping data from sites like Twitter or LinkedIn using traditional HTML or DOM scraping techniques.
Let's take a scenario where the user is visiting a tweet on Twitter. You want to scrape author details, likes, retweets, and replies. Getting this information via the DOM scraping approach is impossible as most of this information is absent in DOM at that time because Twitter doesn't show that info in DOM. Some of you know that you can see API calls in the network tab, and they have much information Twitter internally returns but never show on the screen. And yes, it is possible to scrape information from API calls.
This is the second approach to scrape data from a web page the user is visiting by listening to API calls. The solution is to override XMLHttpRequest. The idea is to replace the native XMLHttpRequest definition with an overridden version of XMLHttpRequest and inject it into the page using a content script. You will get a chance to listen to events in your own overridden XMLHttpRequest object. Inside it, you will call the original native XMLHttpRequest Object. This way, you can seamlessly listen to traffic without interrupting original features on any third-party site.
Step-by-Step Guide to Overriding XMLHttpRequest
Create a Script.js
- This is an immediately invoked function expression (IIFE). It creates a private scope for the code inside, preventing variables from polluting the global scope.
- XHR Prototype Modification: These lines save references to the original send and open methods of the XMLHttpRequest prototype.
- Override Open Method: This code overrides the open method of XMLHttpRequest. When we create an XMLHttpRequest, this modification stores the request URL in the URL property of the XHR object.
- Override Send Method: This code overrides the send method of XMLHttpRequest. It adds an event listener for the 'load' event. If the URL contains the specified string ("UserByScreenName"), it executes code to handle the response. After that, it calls the original send method.
- Handling the Response: If the URL includes "UserByScreenName," it creates a new div element, sets its innerText to the intercepted response, and appends it to the document body.
Let's explore how we can override XMLHttpRequest!
- Creating a Script Element: This code creates a new script element, sets its type to "text/javascript," specifies the source URL using Chrome.runtime.getURL("script.js"), and then appends it to the head of the document since it is a common way to inject a script into a web page.
- Checking for DOM Elements: The checkForDOM function checks if the document's body and head elements are present. If they are, it calls the interceptData function. If not, it schedules another call to checkForDOM using requestIdleCallback to ensure the script waits until the necessary DOM elements are available.
- Scraping Data from Profile: The scrapeDataProfile function looks for an element with the ID "__interceptedData." If found, it parses the JSON content of that element and logs it to the console as the API response. If not found, it schedules another call to scrapeDataProfile using requestIdleCallback.
- Initiating the Process: These lines initiate the process by calling requestIdleCallback on checkForDOM and scrapeDataProfile. This ensures that the script begins by checking for the existence of the necessary DOM elements and then proceeds to scrape data when the "__interceptedData" element is available.
How to add script content script and run_at in manifest.json?
In manifest.json , add the following line of code : Your manifest.json usually looks like this. And it will work in 99% of cases. Whenever we inject content script, chrome decides when that script should be executed based on the value of `run_at,` which defaults to `document_idle.`
Here is the link to the official documentation: Content Scripts
"content_scripts": [
{
"matches": ["<all_urls>"],
"js": ["contentScript.js"]
}
To ensure seamless integration when APIs are triggered before Chrome has the chance to run content scripts. To circumvent this, employing document_start allows you to execute a script right at the onset, guaranteeing that it precedes any API calls.
"content_scripts": [
{
"matches": ["https://www.twitter.com/*"],
"js": ["contentScript.js"],
"run_at":"document_start"
} ]
Pros
You can obtain substantial information from the server response and store details not in the user interface.
Cons
The server response may change after a certain period.
Pro Tip
You can simulate Twitter's internal API calls to fetch even more information that Twitter will not otherwise bring on that particular action. For example, you can call API to get the info of all users who like tweets by calling API, which gets called when you open a popup showing people who like tweets. Keep this simple because Bot Protection Strategies detect many abnormal API calls. That's how most LinkedIn detects LinkedIn scrapers; they either log out of that user or block it permanently.
Conclusion
To conclude the entire situation, one must grasp the specific use case. Sometimes, extracting data from the user interface can be challenging due to its scattered placement. Therefore, opting to listen to API calls and retrieving data in a unified manner is more straightforward. Many websites utilize APIs to fetch collections of entities from the backend, subsequently binding them to the UPI; this is precisely why intercepting API calls becomes essential.