docs/metasploit-framework.wiki/How-to-parse-an-HTTP-response.md

This document talks about how to parse an HTTP response body in the cleanest way possible.

## Getting a response

To get a response, you can either use [[Rex::Proto::Http::Client|How to send an HTTP request using Rex Proto Http Client]], or the [[HttpClient|How to Send an HTTP Request Using HttpClient]] mixin to make an HTTP request. If you are writing a module, you should use the mixin.

The following is an example of using the #send_request_cgi method from HttpClient:

```ruby
res = send_request_cgi({'uri'=>'/index.php'})
```

The return value for ```res``` is a Rex::Proto::Http::Response object, but it's also possible you get a NilClass due to a connection/response timeout.

## Getting the response body

With a Rex::Proto::Http::Response object, here's how you can retrieve the HTTP body:

```ruby
data = res.body
```

If you want to get the raw HTTP response (including the response message/code, headers, body, etc), then you can simply do:

```ruby
raw_res = res.to_s
```

However, in this documentation we are only focusing on ```res.body```.

## Choosing the right parser

Format | Parser
------ | ------
HTML   | Nokogiri
XML    | Nokogiri
JSON   | JSON

If the format you need to parse isn't on the list, then fall back to ```res.body```.

## Parsing HTML with Nokogiri

When you have a Rex::Proto::Http::Response with HTML in it, the method to call is:

```ruby
html = res.get_html_document
```

This will give you a Nokogiri::HTML::Document, which allows you use the Nokogiri API.

There are two common methods in Nokogiri to find elements: #at and #search. The main difference is that the #at method will only return the first result, while the #search will return all found results (in an array).

Consider the following example as your HTML response:

```html
<html>
<head>
	<title>Hello, World!</title>
</head>
<body>
	<div class="greetings">
		<div id="english">Hello</div>
		<div id="spanish">Hola</div>
		<div id="french">Bonjour</div>
	</div>
</body>
<html>
```

**Basic usage of #at**

If the #at method is used to find a DIV element:

```ruby
html = res.get_html_document
greeting = html.at('div')
```

Then the ```greeting``` variable should be a Nokogiri::XML::Element object that gives us this block of HTML (again, because the #at method only returns the first result):

```html
<div class="greetings">
<div id="english">Hello</div>
<div id="spanish">Hola</div>
<div id="french">Bonjour</div>
</div>
```

**Grabbing an element from a specific element tree**

```ruby
html = res.get_html_document
greeting = html.at('div//div')
```

Then the ```greeting``` variable should give us this block of HTML:

```html
<div id="english">Hello</div>
```

**Grabbing an element with a specific attribute**

Let's say I don't want the English Hello, I want the Spanish one. Then we can do:

```ruby
html = res.get_html_document
greeting = html.at('div[@id="spanish"]')
```

**Grabbing an element with a specific text**

Let's say I only know there's a DIV element that says "Bonjour", and I want to grab it, then I can do:

```ruby
html = res.get_html_document
greeting = html.at('//div[contains(text(), "Bonjour")]')
```

Or let's say I don't know what element the word "Bonjour" is in, then I can be a little vague about this:

```ruby
html = res.get_html_document
greeting = html.at('[text()*="Bonjour"]')
```

**Basic usage of #search**

The #search method returns an array of elements. Let's say we want to find all the DIV elements, then here's how:

```ruby
html = res.get_html_document
divs = html.search('div')
```

**Accessing text**

When you have an element, you can always call the #text method to grab the text. For example:

```ruby
html = res.get_html_document
greeting = html.at('[text()*="Bonjour"]')
print_status(greeting.text)
```

The #text method can also be used as a trick to strip all the HTML tags:

```ruby
html = res.get_html_document
print_line(html.text)
```

The above will print:

```
"\n\nHello, World!\n\n\n\nHello\nHola\nBonjour\n\n\n" 
```

If you actually want to keep the HTML tags, then instead of calling #text, call #inner_html.

**Accessing attributes**

With an element, simply call #attributes.

**Walking a DOM tree**

Use the #next method to move on to the next element.

Use the #previous method to roll back to the previous element.

Use the #parent method to find the parent element.

Use the #children method to get all the child elements.

Use the #traverse method for complex parsing.

## Parsing XML

To get the XML body from Rex::Proto::Http::Response, do:

```ruby
xml = res.get_xml_document
```

The rest should be pretty similar to parsing HTML.

## Parsing JSON

To get the JSON body from Rex::Proto::Http::Response, do:

```ruby
json = res.get_json_document
```

## References

* <https://nokogiri.org/tutorials/parsing_an_html_xml_document.html>
* [[How to send an HTTP request using Rex Proto Http Client]]
* [[How to Send an HTTP Request Using HttpClient]]
Updated How to parse an HTTP response (markdown) 2015-10-23 19:46:24 -05:00			`This document talks about how to parse an HTTP response body in the cleanest way possible.`
Updated How to parse an HTTP response (markdown) 2015-10-23 19:35:29 -05:00
			`## Getting a response`

Fix invalid links 2021-09-05 15:47:15 +01:00			`To get a response, you can either use [[Rex::Proto::Http::Client\|How to send an HTTP request using Rex Proto Http Client]], or the [[HttpClient\|How to Send an HTTP Request Using HttpClient]] mixin to make an HTTP request. If you are writing a module, you should use the mixin.`
Updated How to parse an HTTP response (markdown) 2015-10-23 19:35:29 -05:00
			`The following is an example of using the #send_request_cgi method from HttpClient:`

			```ruby
			`res = send_request_cgi({'uri'=>'/index.php'})`
			```

			The return value for ```res``` is a Rex::Proto::Http::Response object, but it's also possible you get a NilClass due to a connection/response timeout.

			`## Getting the response body`

			`With a Rex::Proto::Http::Response object, here's how you can retrieve the HTTP body:`

			```ruby
			`data = res.body`
			```

Updated How to parse an HTTP response (markdown) 2015-10-23 19:41:15 -05:00			`If you want to get the raw HTTP response (including the response message/code, headers, body, etc), then you can simply do:`
Updated How to parse an HTTP response (markdown) 2015-10-23 19:35:29 -05:00
			```ruby
			`raw_res = res.to_s`
			```

			However, in this documentation we are only focusing on ```res.body```.

			`## Choosing the right parser`

			`Format \| Parser`
			`------ \| ------`
			`HTML \| Nokogiri`
			`XML \| Nokogiri`
			`JSON \| JSON`

			If the format you need to parse isn't on the list, then fall back to ```res.body```.

			`## Parsing HTML with Nokogiri`

			`When you have a Rex::Proto::Http::Response with HTML in it, the method to call is:`

			```ruby
			`html = res.get_html_document`
			```

			`This will give you a Nokogiri::HTML::Document, which allows you use the Nokogiri API.`

			`There are two common methods in Nokogiri to find elements: #at and #search. The main difference is that the #at method will only return the first result, while the #search will return all found results (in an array).`

			`Consider the following example as your HTML response:`

			```html
			`<html>`
			`<head>`
			`<title>Hello, World!</title>`
			`</head>`
			`<body>`
			`<div class="greetings">`
			`<div id="english">Hello</div>`
			`<div id="spanish">Hola</div>`
			`<div id="french">Bonjour</div>`
			`</div>`
			`</body>`
			`<html>`
			```

			`Basic usage of #at`

			`If the #at method is used to find a DIV element:`

			```ruby
			`html = res.get_html_document`
			`greeting = html.at('div')`
			```

			Then the ```greeting``` variable should be a Nokogiri::XML::Element object that gives us this block of HTML (again, because the #at method only returns the first result):

			```html
			`<div class="greetings">`
			`<div id="english">Hello</div>`
			`<div id="spanish">Hola</div>`
			`<div id="french">Bonjour</div>`
			`</div>`
			```

			`Grabbing an element from a specific element tree`

			```ruby
			`html = res.get_html_document`
			`greeting = html.at('div//div')`
			```

			Then the ```greeting``` variable should give us this block of HTML:

			```html
			`<div id="english">Hello</div>`
			```

			`Grabbing an element with a specific attribute`

			`Let's say I don't want the English Hello, I want the Spanish one. Then we can do:`

			```ruby
			`html = res.get_html_document`
Updated How to parse an HTTP response (markdown) 2015-10-23 19:44:38 -05:00			`greeting = html.at('div[@id="spanish"]')`
Updated How to parse an HTTP response (markdown) 2015-10-23 19:35:29 -05:00			```

			`Grabbing an element with a specific text`

			`Let's say I only know there's a DIV element that says "Bonjour", and I want to grab it, then I can do:`

			```ruby
			`html = res.get_html_document`
			`greeting = html.at('//div[contains(text(), "Bonjour")]')`
			```

			`Or let's say I don't know what element the word "Bonjour" is in, then I can be a little vague about this:`

			```ruby
			`html = res.get_html_document`
			`greeting = html.at('[text()*="Bonjour"]')`
			```

			`Basic usage of #search`

			`The #search method returns an array of elements. Let's say we want to find all the DIV elements, then here's how:`

			```ruby
			`html = res.get_html_document`
			`divs = html.search('div')`
			```

			`Accessing text`

			`When you have an element, you can always call the #text method to grab the text. For example:`

			```ruby
			`html = res.get_html_document`
			`greeting = html.at('[text()*="Bonjour"]')`
			`print_status(greeting.text)`
			```

			`The #text method can also be used as a trick to strip all the HTML tags:`

			```ruby
			`html = res.get_html_document`
			`print_line(html.text)`
			```

			`The above will print:`

			```
			`"\n\nHello, World!\n\n\n\nHello\nHola\nBonjour\n\n\n"`
			```

			`If you actually want to keep the HTML tags, then instead of calling #text, call #inner_html.`

			`Accessing attributes`

			`With an element, simply call #attributes.`

			`Walking a DOM tree`

			`Use the #next method to move on to the next element.`

			`Use the #previous method to roll back to the previous element.`

			`Use the #parent method to find the parent element.`

			`Use the #children method to get all the child elements.`

			`Use the #traverse method for complex parsing.`

			`## Parsing XML`

			`To get the XML body from Rex::Proto::Http::Response, do:`

			```ruby
			`xml = res.get_xml_document`
			```

			`The rest should be pretty similar to parsing HTML.`

			`## Parsing JSON`

			`To get the JSON body from Rex::Proto::Http::Response, do:`

			```ruby
			`json = res.get_json_document`
Updated How to parse an HTTP response (markdown) 2015-10-26 10:58:11 -05:00			```

			`## References`

Cleanup markdown uses 2021-09-07 00:59:05 +01:00			`* <https://nokogiri.org/tutorials/parsing_an_html_xml_document.html>`
Fix invalid links 2021-09-05 15:47:15 +01:00			`* [[How to send an HTTP request using Rex Proto Http Client]]`
			`* [[How to Send an HTTP Request Using HttpClient]]`