Web Scanner allows users to target the body data of a webpage in a number of ways, allowing security teams to scan for similar websites based on hash values, similarity scores, JavaScript data, and darkweb content.
Here's a selection of useful data types. Click here for a full list of field names.
SHA-256 body data
Field names
body_analysis.body_sha256
body_analysis.header_sha256
body_analysis.footer_sha256
Explanation
The above fields are part of the http response body analysis - i.e. the SHA-256 hash of whatever is contained in the < body > tag of a webpage.
Matching hashes means that the content is exactly the same across one or more pages.
This, however, is quite rare and generally only occurs with basic web pages and holding pages, e.g. error pages or directory listings.
Javascript body data
Field names
body_analysis.js_sha256
body_analysis.js_ssdeep
Explanation
The above fields contain a comprehensive list of all referenced JavaScript files, whether through URLs or embedded as script, along with their corresponding SHA-256 and ssdeep hashes.
These hashes are specific to each file version and parameter. For example, a file referenced with different query parameters, such as ?v=1.2
or ?v=2.0
would result in distinct hashes. This is crucial as even minor differences in file versions can lead to significant variations hash values.
body_analysis.js_ssdeep
, being a fuzzy hash, displays similarity between two files. The more byte-identical two files are, the less different their ssdeep hashes will be.
Language data
Field names
body_analysis.language
Explanation
The above field serves an indicator of the intended audience and language of the website. Be wary of body_analysis.language
results that don't sync with the tld
country code.
Chinese language content on a website with a ".co.uk" tld
Onion data
Field names
body_analysis.onion
Explanation
The above field contains a list of Tor onion addresses referenced within the HTML body of a webpage. This information can be leveraged to connect with Tor scan data.
A clearnet website promoting Command and Control (C2) or phishing kits provides a link to their .onion purchasing page.
By identifying these connections, you can conduct further investigation into the hosting IP, domain, and other associated data.
SHV data
Field names
body_analysis.SHV
Explanation
The Script Hash Value (SHV) is a fingerprint generated by alphabetically ordering the list of all script names (excluding parameters) as they appear on a webpage. It excels in pinpointing groups of perfect matches.
This method entails a somewhat fuzzy search, disregarding parameters, treating variations such as "jquery-2.1.4.min.js?buildTime=1708035548" and "jquery-2.1.4.min.js?buildTime=999999990" as identical, even though the "buildTime" value differs between the two
Unlike a SHA-256 search (which allows only exact matches for single files), the SHV's fuzzy nature facilitates the identification of similar groups. While a single-file search using ssdeep is possible, it may not always yield useful results when finding partial hash matches.
If a phishing kit includes commonly used JavaScript files such as jQuery, along with two custom JS files with varying versions denoted by a "?v=" parameter, the SHV fingerpint enables the discovery of these variations.
HTML body data
Field names
html_body_similarity
Explanation
The above field name outputs a numerical value ranging from 0 to 100, representing the similarity between the current scan and the previous scan of a webpage.
A value of 91 implies a 9% difference in data content compared to the previous scan.
This calculation is based on the html_body_ssdeep
field.
This metric may not always correlate with the visual similarity of a website.