Open Source Intelligence Gathering: Techniques, Automation, and Visualization

Published in

Posts By SpecterOps Team Members

19 min readOct 2, 2018

One constant throughout my career has been my fascination with what can be learned about an organization from basic public records. The aggregation of a multitude of small pieces of information can paint a picture that is sometimes startling in its completeness. Then some holes can be filled-in with small logical leaps and inferences.

This always interested me as an defender because I wanted to know what an outsider could learn without ever touching infrastructure or engaging an insider. Now, I am most often looking to make use of this sort of data to prepare a collection of insider knowledge that I might use as a foundation for social engineering or, once inside the network, to better understand the network in which I am operating.

There are numerous data points to consider, but this article will focus on network targets (e.g. IP addresses, domains, and systems) while lightly touching on collecting personnel information (e.g. email addresses, names, job titles). This is what I look for first when preparing for a new project and then I use the collected data to inform my decisions on what to dig into next.

The key to managing all of this data is automation. By automating the initial research phases, the manual research is much simpler and more easily organized. Automation and reporting will be discussed at the end, in “Phase 4.” Let’s begin with what to look for first.

Phase 1: Mapping Networks

A target is necessary to begin the discovery process. While it is possible a target might not be easily tracked down online, most organizations will have a name and at least one “primary” domain they use for email, so a name and a domain is a good place to start. The basic information that is needed to get started is a name (e.g. Blizzard Entertainment, Inc.) and at least one domain (e.g. blizzard.com).

The Full Contact marketing database and API is a fine place to begin. It can provide basic information about an organization based on the name and domain. Full Contact tracks “key people” (e.g. executives), social media profiles, approximate employee head counts, and more. These details are good to have to provide some company background and context for the data yet to be collected. This can also be collected from sources like the company’s LinkedIn profile.

Domains

With the background information collected, domains and subdomains are the next stop. Additional domains can be discovered using reverse WHOIS lookups. WhoXY is a solid service for this and offers a reverse WHOIS API endpoint that accepts company names and keywords to perform searches against WHOIS records. These searches can return hundreds of additional domains.

At this point it is best to pause and consider these new domains. Some organizations will purchase variations of their domain names to avoid issues with typosquatters. These domains stay “parked” or just forward visitors to the primary domain and company website. Pick through the results and select a few domains that look interesting to carry forward and then continue on to the next step, subdomain discovery.

Keep in mind that if an individual’s name is on the domain, e.g. Henry Dorsett, and that name is common enough, reverse WHOIS lookups may return hundreds or even thousands of unrelated results. For this reason do not rush ahead and blindly include every reverse WHOIS result.

Furthermore, WhoXY searches return only exact matches. If the query looks for domains tied to “Blizzard Entertainment Inc” the results will not include any domains attached to “Blizzard Entertainment, Inc” or even “Blizzard Entertainment Inc.” (with a period at the end). The good news is companies tend to stick with one variation of their name for their domain registration records, so if one name is pulled from a WHOIS record that name is a safe bet for reverse WHOIS searches.

Subdomains

There are many tools available that will perform subdomain discovery and brute forcing, such as Aquatone and Sublist3r. Brute forcing can reveal subdomains that may never have been found otherwise, but you have to contend with wildcard DNS and it is not necessary in these early stages. DNS Dumpster and Netcraft are likely to have a good number of catalogued subdomains for the target domain(s). Also, TLS certificates, pulled from crt.sh or censys.io, will usually reveal additional new subdomains those services have not yet seen or recorded. Specifically, the subdomains can be pulled from a certificate’s alternate names.

For example, a Censys certificate search for “blizzard.com” returns a certificate with these names:

Censys.io will parse names from a certificate and provide them as an easy to digest list, both in the web UI and via the API.

Certificates tend to yield the most subdomains and searching for them is fast. However, a search for a domain like “blizzard.com” on censys.io will yield some unrelated results, like iran-blizzard.com, i.e. any domain containing the query string. Domains like that could be related to the target company, but it is more likely that many of these sorts of results will be unrelated and will only pollute the dataset.

Searching for “.blizzard.com” or using regular expressions will not work with Censys, but it is possible to search specific fields. A search for parsed.names: blizzard.com will limit the results to only certificates issued for a subdomain of “blizzard.com.”

Additionally, the certificate transparency logs may offer more subdomains can can be searched using the Google Transparency Report tool:

https://transparencyreport.google.com/https/certificates

DNS Records & IP Addresses

This big list of domains and subdomains needs to be resolved to IP addresses. This is easily done with Python sockets (or with Go, Ruby, etc.) and by checking DNS records. Some of the domains will not resolve for one reason or another and that is fine. Retired subdomains and those that come and go (like those that might point to a cloud asset that goes up and down) can still be useful, but more on that in a moment.

The DNS records are all useful in different ways. The A records provide IP addresses and the other records provide some situationally interesting information. Again, this DNS resolution step is easy to script with Python and other languages. For manual checks, dnsstuff.com is convenient for quick DNS record and domain ownership checks.

DNS Records: MX and TXT

The DMARC and SPF records, or lack thereof, will help determine if email spoofing is in the cards for any social engineering campaigns. Many organizers do not setup DMARC, and those that do often use weak SPF records and/or weak DMARC policies (e.g. p=none). This ultimately means the organization’s email protections fail open and do not do anything to prevent spoofed emails. This article does a good job of covering some of the pitfalls of these email security settings.

Nearly all Mail Transport Agents, including the ones used by Gmail and Microsoft Exchange Server, default to relying on DMARC for direction on what action to take if an email fails SPF or DKIM. If the sending domain has no DMARC record or a record with a policy of none, the mail server fails open and delivers the email.
This means that if a domain does not have an SPF, DKIM, and strict DMARC record, it can be spoofed.
—Alex DeFreese for Bishop Fox, link above

DKIMValidator.com is a handy utility for analysis of SPF and DKIM records. If it looks like email spoofing is a possibility, an email spoofed to a dkimvalidator.com address will reveal the SpamAssassin score and whether or not it passed SPF checks.

Author’s Note: This may sound like a small thing, but it can be a severe issue when misunderstood. Once, while working with a client, they had to respond to a nasty phishing incident. The attacker was, very convincingly, spoofing their email addresses to employees and other organizations. This simple check for DMARC and SPF records helped them understand what had happened. They thought SPF and vendor-provided email security solutions had spoofing on lockdown, so they moved to the next logical assumption, that the accounts had been compromised. However, they had never setup a DMARC record. Spoofing is a deceitfully difficult thing for many organizations because email security is so frequently misunderstood and so many exceptions are made for marketing, PR, automated alert emails, and other situations where spoofed emails are being used legitimately.

Additionally, name servers may be vulnerable to DNS cache snooping and MX and TXT records can reveal services used by the organization (e.g. Proofpoint, Survey Monkey). These are old tricks, but can still yield some interesting information.

DNS Records: CNAMES

This is also the time to look for Content Delivery Networks (CDNs) and cloud services mentioned in the DNS records. These records will reveal if a domain is pointed at an asset like an S3 bucket for web hosting. Also, some of the subdomains may be usable for domain fronting or vulnerable to a takeover of that subdomain (e.g. a dangling DNS record for a deleted S3 bucket).

If these are new ideas, exploring these items will be left as homework for the reader. This is a good resource to get started. Also, flAWS.cloud is an excellent resource for learning to detect and abuse many common AWS misconfigurations, which also translate to other cloud services (e.g. Google, Azure).

This article is dedicated to a review of cloud service providers, storage, and servers and explores some of the above topics in more detail:

Head in the Clouds - Christopher Maddalena - Medium

In the cloud space, there are three major providers at this time: Amazon Web Services (AWS), Microsoft Azure, and…

medium.com

Additional Network Information

Finally, RDAP and Shodan can fill-in some of the gaps in the information collected for all of these IP addresses and domain names.

RDAP can provide some useful information for each of the IP addresses, such as the owner and network block. On its own, knowing one IP address belongs to Amazon is not all that interesting, but knowing 65% of a target’s IP addresses are owned by Amazon begins to suggest they make good use of Amazon Web Services. It might also be an indication of which assets are owned by the organization and which are leased / externally hosted. This is where some logical leaps and inference may begin to come into play.

Last, but certainly not least, Shodan may provide a few more details for the IP addresses and domain names. Shodan can offer up information like hostnames, operating systems, open ports, and service banner data, all without touching any infrastructure. Shodan can also reveal additional hosts and domain names using keyword searches with discovered network blocks and domains (e.g. hostname:foo.bar and ip:1.1.1.0/24).

All of these steps will create a large collection of information that any human would have a difficult time sorting through in a sane manner. It is best to set this data aside at this point until the reporting phase.

Phase 2: Discovering Contacts

This phase touches on personnel at the target organization. Now that some additional domains may be known, search engines (e.g. Google, Yahoo, Bing) can be used to hunt for email addresses associated with each of the domains the organization uses for their business. Email Hunter’s API, as the name suggests, can also be used to find email addresses for a domain. It is intended for sales people to find contacts and sales leads for a prospective customer, but anyone can use it and collect the email addresses. Sometimes Hunter has names, job titles, and phone numbers as well.

Beyond an Email Address

Email addresses open up opportunities for phishing and password spraying, but can be taken a step further. By checking the email addresses against a service like Troy Hunt’s HaveIBeenPwned, or a private database of security breaches and dumped passwords, employees can be matched-up with services they have used in the past.

Like most of the data collected so far, on its own this data is not terribly interesting. However, it can indicate how long each employee has been with the company (assuming they have not left by this point) and might even hint at what they do there and the sorts of services the organization uses internally. Of course, it also means their old passwords might be available and, possibly, reused for a business account.

Additionally, hunting through paste sites (e.g. pastebin, ghostbin, slexy) looking for the email addresses can yield some especially juicy information. HaveIBeenPwned also has a pastes API for quickly searching pre-indexed pastes linked to an email address. Some pastes may lead to dead ends, but not always, and sometimes these pastes contain passwords, answers to security questions, and other information. This is a boon of intelligence data, but also an immediate finding and something definitely worth noting in a report. If a paste has been removed, it is worth checking Google’s web cache and the Wayback Machine for cached versions.

Author’s Note: Pastes may also reveal an email address was part of a particularly “sensitive” breach not listed on a site like HaveIBeenPwned, such as the Ashley Madison breach. This is interesting because it means the email address has been used for non-company business and accounts, but reporting a password came from such a breach is problematic. Use good judgement before blindly treating all pastes as equals in a client-facing deliverable.

Social Media Profiles

It is generally a good plan to take it easy with social media at this point in the intelligence gathering process. Wrangling what might be dozens or hundreds of social media profiles is a bit much for early reconnaissance. However, it is not difficult to pick-up a few leads from LinkedIn and Twitter while discovering email addresses. These can be scraped from search engine results using many of the same tricks as email addresses.

Some basic Google searches like site:linkedin.com COMPANY will return LinkedIn profiles that might also contain email addresses, job titles, and interesting tidbits of information (e.g. “I manage the company’s deployment of Cylance Protect” or “I’m a Splunk administrator”). This is not an exact science and the searches will yield dead end links (i.e. the profiles returned are for other people who have left the company but mention it in job history), but it can help harvest some names and information to get you started. Also, Email Hunter will return LinkedIn profiles links, supposedly pre-verified, if it knows of any.

Twitter handles can also be a great source of intelligence and the Twitter API can help verify the profiles. Just like LinkedIn profiles and email addresses, these handles will come back for searches against twitter.com.

Start by harvesting some potential Twitter handles and then use the Twitter API to verify the profile still exists and collect information like follower count, location, biography, and real name. While much of this data is provided by the user and might be missing, incomplete, or untrue, it does make it easy to quickly skim names and biographies for anything that looks promising as a good starting point for a more serious hunt later. Assuming this returns a handful of legitimate accounts, some basic link analysis is likely to reveal more accounts later.

A Note on Social Media

One reason social media is best saved for later is it requires some thought and careful analysis. When searching for a real person, it is not too difficult to determine who someone is and what their interests are, assuming they have an online presence. While not always the case, their true self is often reflected on social media.

However, research may lead to an entity that is more of a public or professional persona or, perhaps, an entirely fabricated identity. The social media profile does not reflect the person or people behind the account, so the information cannot be taken at face value.

A CEO may have a highly curated persona on Twitter and LinkedIn, which makes it difficult to learn much about the person behind the title and profiles. Manual analysis may lead to the CEO’s personal assistant who could have a more “honest” public presence on Twitter or Facebook.

Sometimes one or two steps removed is better than the higher value target. They are more likely to be accessible, less likely to be carefully monitored, and may offer more convenient access to the high value target.

Phase 3: The Cloud

By now most of the low hanging OSINT fruit has been picked, but there are a couple more basic searches to round out the available data.

Digging Through Files

Many corporate websites have a hoard of files sitting below their domain(s). These files may have accumulated over years and include everything from Office documents to PDFs and other miscellaneous files. Basic Google searches, like site:company.com filetype:pdf will reveal them. These documents can be automatically downloaded and dissected for metadata, which might include software information (e.g. Office 2013) or usernames.

If someone has been careless with the site’s web root, it is not unheard of to get hits for other file extensions, like .key or .cert for more sensitive files. There is also the possibility someone uploaded documents intended for a narrower audience without realizing anyone can download them. If a search engine has them indexed, they can be found.

Hunting for Buckets

Speaking of documents not meant for the internet, Amazon S3 buckets have become notorious for this. Bucket hunting is hot right now, but do not neglect Digital Ocean’s “Spaces.” Digital Ocean launched their own service similar to S3 and called it Spaces. Conveniently, Digital Ocean deferred to the industry standard, the S3 bucket, when designing their new service. In other words, spaces operate exactly like buckets and tools made for hunting buckets will work for spaces if they are pointed at Digital Ocean.

The existence of a bucket can be checked with a web request. A web request is made to Amazon or Digital Ocean (e.g. https://fubar.s3.amazonaws.com/ or https://fubar.nyc3.digitaloceanspaces.com/) and the service returns some XML that indicates if the bucket exists. If it exists, the XML will indicate if any data is publicly accessible. That is the sum of it. Hunting for these becomes just a matter of using a wordlist to create new web requests.

Note: Web requests work fine for spaces, but may miss some S3 buckets. It is better to use Amazon’s awscli or the boto/boto3 Python library (which uses awscli) for checking buckets. These tools are authenticated using an Amazon account and some buckets may deny anonymous access from a browser while allowing “authenticated users” to see some of their contents. This is discussed in more depth in this article:

Head in the Clouds: Amazon Web Services - Christopher Maddalena - Medium

This article is a companion piece for this primer: Amazon provides the simplest option for fetching their IP addresses…

medium.com

Since the goal is to target a specific organization, the wordlist should be related to the company. At a minimum, try to include the company’s name, any acronyms or abbreviations they use, alternate names they might have, subsidiaries, and their NASDAQ listing (if they have one).

The wordlist can and should be expanded if the company makes use of other terms related to their business. For example, Blizzard Entertainment is well known for naming teams with numbers (e.g. Team 1, Team 2, Team 3). with each number tied to one of their games. They also like to use codenames that are often pulled from their Warcraft and StarCraft lore. If the the target were Blizzard, it would make sense to add team1, team2, arthas, townportal, and other Blizzard-related terms to the list.

Nuclear Launch Detected: Improve wordlists for better targeting and a higher chance of finding something interesting.

Bucket names have to be globally unique, so it is better to use different variations of the keywords. One simple option is to use various prefixes and suffixes, or “fixes.” Some common fixes are qa, doc, legacy, uat, and bak. These can be added to the beginning and end of the keywords to check for common variations on bucket names. For example, “tychus” and some fixes are combined to creates several new keywords like “qa-tychus” and “tychuslegacy.”

It is worth noting bucket names can include periods in addition to hyphens, so even “blizzard.com” is a valid bucket name. In fact, resources or webpages that are hosted in an S3 bucket will have bucket names that resolve to something like hearthstone.blizzard.com.s3.amazonaws.com.

Also, some companies may add garbage to a bucket name to make it harder to discover, like tychus-79a9ba8b089e4c022c32b964cacf6b13f2aa6d9a (the shasum of tychus). They are not undiscoverable, but definitely more difficult and something to consider for later if a more intensive bucket hunt is conducted for the target.

This wordlist approach is meant to capture low hanging fruit that could lead to some leaked information. To provide an example of what can be found, this process once identified an “internal” git repository used by the organizations developers. The bucket was full of passwords, company source code, and other sensitive information. It was made public because the company had mistakenly made it accessible to “Any Authenticated AWS User” thinking this meant their authenticated AWS users, not any AWS user. Amazon has improved the web console UI to add warnings and make it more difficult to make this mistake, but it still happens.

Once the wordlist and fixes list are ready, smash them together and commence the hunt.

Phase 4: Reporting & Automating the Process

All of this would be very tedious to do by hand for every project. If you are a defender or bug bounty hunting looking to perform continuous asset discovery, this is really not something you want to do repeatedly or with individual tools. That is why numerous people have tackled automating all of this. Some notable tools are Recon-ng and Discover Scripts. I took a shot at automating everything laid out above in a tool I named ODIN:

chrismaddalena/ODIN

ODIN - Tool for automating penetration testing tasks (in development).

github.com

Reinventing the Wheel?

I had a need that was not fulfilled by the current tooling available at the time. I did not / do not want to have to run several modules to get all of the data or rely on external tools being installed. My goal with ODIN was, and still remains, to create a tool that could be run on Windows, MacOS, or Linux with just Python 3. Not only that, I wanted the tool to automate basic analysis, i.e. connect some of the dots by doing things like checking for certain strings in DNS records. ODIN accomplishes this and enables an analyst to do a lot more with the data while requiring them to do less work to get it.

ODIN’s Reporting & Organization

ODIN stores all of the data it collects in a SQLite3 database that is saved for later analysis. Optionally, a multi-page HTML report is built from this data to make browsing the information as simple as opening a report in a web browser. This is good for casual reviews and references, but using this data to visualize the external perimeter can be eye opening.

Enter Neo4j

A basic schema developed for drawing relationships between all of the various entities and assets discovered during this OSINT gathering process.

I developed a simple Neo4j graph database schema for the external assets one might encounter while collecting the data outlined above. Once ODIN converts the SQLite3 database to a graph database, it is possible to create a map of the external perimeter. This is a very basic example:

A graph of a small network branching off of one root domain.

Most of the node types are represented in this graph. ODIN was run in a small lab environment for this example, so the IP addresses are internal addresses and do not have any Shodan data (i.e. open ports).

You have an organization (blue) tied to a domain (purple) and the subdomains (green) for that domain. The certificates (red) and their relationships to the domain and subdomains show which nodes share certificates. You can also see the IP addresses (yellow) to which the subdomains resolve.

Full, unfiltered graphs of large organizations can be quite intense with many to one relationships, numerous IP addresses, and multiple domains.

Additional Analysis

The graphs are great for a visual analysis and picking out interesting assets, but it is not all about the graphs. Cypher queries make it quick and easy to grab statistics and create tables. For example:

MATCH (p:Port) RETURN DISTINCT p.Organization

The above query matches open ports from Shodan and returns a list of the organizations. In other words, it makes a table showing the organizations, e.g. Cloudflare or Amazon.com, in the database. It is a quick way to get an idea of the network providers the organization uses.

This query will map the network while excluding subdomains that never resolved to an IP address:

MATCH (org:Organization)-[r1:OWNS]->(dom:Domain)-[:RESOLVES_TO]->(add:IP)
MATCH (sub:Subdomain)-[r2:SUBDOMAIN_OF|:RESOLVES_TO]->(n)
MATCH (add)-[r3:HAS_PORT]->(p:Port)
RETURN org,dom,sub,add,p,n,r1,r2,r3

This query first matches only the Organization, Domain, and IP nodes that have :OWNS and :RESOLVES_TO relationships. It then matches the Subdomains that have :SUBDOMAIN_OF or :RESOLVES_TO relationships with any node. Finally, it matches any Port nodes with a :HAS_PORT relationship with one of the matched IP nodes.

Conclusion

The process detailed above does not come close to collecting every piece of information one might gather about an organization. Intelligence gathering can become complex once you start dealing with very large organizations or organizations with disparate groups within it. However, you must start somewhere, and this this process has served me well.

The data that is collected has value to attackers, and I argue that alone makes it valuable to defenders. Forewarned is forearmed, i.e. knowing what the available data suggests to, or outright tells, an outsider about your organization can greatly inform internal security awareness programs and help identify weak points. At the very least, this process has helped me identify assets an organization had believed to be decommissioned, firewalled off from the internet, or shutdown. It can be an easy win that catches a potential weak spot like a forgotten server before it becomes a problem.

Of course, OSINT is an organic process and will typically continue beyond the basic phases detailed here. For example, searching GitHub for hostnames, passwords, and secrets is often a worthwhile endeavor. Those steps are worthy of their own posts. For now, the process outlined here will help reveal more about an organization, discover assets, and help guide efforts going forward and throughout an operation.

By using ODIN to automate this process, you can transform a name and a domain into much more in as little as 10 minutes or so. ODIN runs multiple tasks in parallel with multiprocessing, so it does not take long at all. If its capabilities have piqued your interests, give it a try. The project is open source and open to feedback. I encourage defenders to leverage ODIN, or other tools / manual analysis, to visualize their external network to monitor their assets and keep tabs on public data available about their organization and employees.