Skip to content

Commit 32ae4a0

Browse files
authored
Merge branch 'dev' into develop
2 parents 9928233 + d30145c commit 32ae4a0

File tree

20 files changed

+189
-351
lines changed

20 files changed

+189
-351
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
composer.phar
2+
composer.lock
3+
/vendor/

.travis.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ language: php
33
php:
44
- 5.6
55
- 7.0
6-
- hhvm
6+
~ 7.1
77

88
install:
99
- composer self-update

README.md

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,12 @@ Version 1.7.0
77
[![Coverage Status](https://coveralls.io/repos/paquettg/php-html-parser/badge.png)](https://coveralls.io/r/paquettg/php-html-parser)
88
[![Scrutinizer Code Quality](https://scrutinizer-ci.com/g/paquettg/php-html-parser/badges/quality-score.png?b=master)](https://scrutinizer-ci.com/g/paquettg/php-html-parser/?branch=master)
99

10-
PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrap html, whether it's valid or not! This project was original supported by [sunra/php-simple-html-dom-parser](https://github.com/sunra/php-simple-html-dom-parser) but the support seems to have stopped so this project is my adaptation of his previous work.
10+
PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assist in the development of tools which require a quick, easy way to scrap html, whether it's valid or not! This project was original supported by [sunra/php-simple-html-dom-parser](https://github.com/sunra/php-simple-html-dom-parser) but the support seems to have stopped so this project is my adaptation of his previous work.
1111

1212
Install
1313
-------
1414

15-
This package can be found on [packagist](https://packagist.org/packages/paquettg/php-html-parser) and is best loaded using [composer](http://getcomposer.org/). We support php 5.6, 7.0, and hhvm 2.3.
15+
This package can be found on [packagist](https://packagist.org/packages/paquettg/php-html-parser) and is best loaded using [composer](http://getcomposer.org/). We support php 5.6, 7.0, 7.1.
1616

1717
Usage
1818
-----
@@ -35,7 +35,7 @@ The above will output "click here". Simple no? There are many ways to get the sa
3535
Loading Files
3636
------------------
3737

38-
You may also seamlessly load a file into the dom instead of a string, which is much more convinient and is how I except most developers will be loading the html. The following example is taken from our test and uses the "big.html" file found there.
38+
You may also seamlessly load a file into the dom instead of a string, which is much more convenient and is how I except most developers will be loading the html. The following example is taken from our test and uses the "big.html" file found there.
3939

4040
```php
4141
// Assuming you installed from Composer:
@@ -61,9 +61,9 @@ foreach ($contents as $content)
6161
}
6262
```
6363

64-
This example loads the html from big.html, a real page found online, and gets all the content-border classes to process. It also shows a few things you can do with a node but it is not an exhaustive list of methods that a node has avaiable.
64+
This example loads the html from big.html, a real page found online, and gets all the content-border classes to process. It also shows a few things you can do with a node but it is not an exhaustive list of methods that a node has available.
6565

66-
Alternativly, you can always use the `load()` method to load the file. It will attempt to find the file using `file_exists` and, if succesfull, will call `loadFromFile()` for you. The same applies to a URL and `loadFromUrl()` method.
66+
Alternativly, you can always use the `load()` method to load the file. It will attempt to find the file using `file_exists` and, if successful, will call `loadFromFile()` for you. The same applies to a URL and `loadFromUrl()` method.
6767

6868
Loading Url
6969
----------------
@@ -102,7 +102,7 @@ As long as the Connector object implements the `PHPHtmlParser\CurlInterface` int
102102
Loading Strings
103103
---------------
104104

105-
Loading a string directly, with out the checks in `load()` is also easely done.
105+
Loading a string directly, with out the checks in `load()` is also easily done.
106106

107107
```php
108108
// Assuming you installed from Composer:
@@ -142,19 +142,19 @@ At the moment we support 7 options.
142142

143143
**Strict**
144144

145-
Strict, by default false, will throw a `StrickException` if it find that the html is not strict complient (all tags must have a clossing tag, no attribute with out a value, etc.).
145+
Strict, by default false, will throw a `StrickException` if it find that the html is not strictly compliant (all tags must have a closing tag, no attribute with out a value, etc.).
146146

147147
**whitespaceTextNode**
148148

149149
The whitespaceTextNode, by default true, option tells the parser to save textnodes even if the content of the node is empty (only whitespace). Setting it to false will ignore all whitespace only text node found in the document.
150150

151151
**enforceEncoding**
152152

153-
The enforceEncoding, by default null, option will enforce an charater set to be used for reading the content and returning the content in that encoding. Setting it to null will trigger an attempt to figure out the encoding from within the content of the string given instead.
153+
The enforceEncoding, by default null, option will enforce an character set to be used for reading the content and returning the content in that encoding. Setting it to null will trigger an attempt to figure out the encoding from within the content of the string given instead.
154154

155155
**cleanupInput**
156156

157-
Set this to `true` to skip the entire clean up phase of the parser. If this is set to true the next 3 options will be ignored. Defaults to `false`.
157+
Set this to `false` to skip the entire clean up phase of the parser. If this is set to true the next 3 options will be ignored. Defaults to `true`.
158158

159159
**removeScripts**
160160

@@ -217,3 +217,13 @@ $a->delete();
217217
unset($a);
218218
echo $dom; // '<div class="all"><p>Hey bro, <br /> :)</p></div>');
219219
```
220+
221+
You can modify the text of `TextNode` objects easely. Please note that, if you set an encoding, the new text will be encoded using the existing encoding.
222+
223+
```php
224+
$dom = new Dom;
225+
$dom->load('<div class="all"><p>Hey bro, <a href="google.com">click here</a><br /> :)</p></div>');
226+
$a = $dom->find('a')[0];
227+
$a->firstChild()->setText('biz baz');
228+
echo $dom; // '<div class="all"><p>Hey bro, <a href="google.com">biz baz</a><br /> :)</p></div>'
229+
```

composer.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,11 @@
1515
],
1616
"require": {
1717
"php": ">=5.6",
18+
"ext-mbstring": "*",
1819
"paquettg/string-encode": "~0.1.0"
1920
},
2021
"require-dev": {
21-
"phpunit/phpunit": "~5.3.0",
22+
"phpunit/phpunit": "~5.7.0",
2223
"satooshi/php-coveralls": "~1.0.0",
2324
"mockery/mockery": "~0.9.0"
2425
},

src/PHPHtmlParser/Curl.php

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,11 @@ public function get($url)
2828

2929
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
3030
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
31+
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
32+
curl_setopt($ch, CURLOPT_VERBOSE, true);
33+
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
34+
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36');
35+
curl_setopt($ch, CURLOPT_URL, $url);
3136

3237
$content = curl_exec($ch);
3338
if ($content === false) {

src/PHPHtmlParser/Dom.php

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -79,15 +79,23 @@ class Dom
7979
* @var array
8080
*/
8181
protected $selfClosing = [
82-
'img',
82+
'area',
83+
'base',
84+
'basefont',
8385
'br',
86+
'col',
87+
'embed',
88+
'hr',
89+
'img',
8490
'input',
85-
'meta',
91+
'keygen',
8692
'link',
87-
'hr',
88-
'base',
89-
'embed',
93+
'meta',
94+
'param',
95+
'source',
9096
'spacer',
97+
'track',
98+
'wbr'
9199
];
92100

93101
/**
@@ -569,8 +577,8 @@ protected function parseTag()
569577
}
570578

571579
if (empty($name)) {
572-
$this->content->fastForward(1);
573-
continue;
580+
$this->content->skipByToken('blank');
581+
continue;
574582
}
575583

576584
$this->content->skipByToken('blank');

src/PHPHtmlParser/Dom/AbstractNode.php

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ public function __get($key)
8585
case 'tag':
8686
return $this->getTag();
8787
case 'parent':
88-
$this->getParent();
88+
return $this->getParent();
8989
}
9090

9191
return null;
@@ -179,8 +179,8 @@ public function delete()
179179
if ( ! is_null($this->parent)) {
180180
$this->parent->removeChild($this->id);
181181
}
182-
183-
$this->parent = null;
182+
$this->parent->clear();
183+
$this->clear();
184184
}
185185

186186
/**
@@ -324,6 +324,9 @@ public function setAttribute($key, $value)
324324
{
325325
$this->tag->setAttribute($key, $value);
326326

327+
//clear any cache
328+
$this->clear();
329+
327330
return $this;
328331
}
329332

@@ -337,6 +340,9 @@ public function setAttribute($key, $value)
337340
public function removeAttribute($key)
338341
{
339342
$this->tag->removeAttribute($key);
343+
344+
//clear any cache
345+
$this->clear();
340346
}
341347

342348
/**
@@ -348,8 +354,10 @@ public function removeAttribute($key)
348354
public function removeAllAttributes()
349355
{
350356
$this->tag->removeAllAttributes();
351-
}
352357

358+
//clear any cache
359+
$this->clear();
360+
}
353361
/**
354362
* Function to locate a specific ancestor tag in the path to the root.
355363
*

src/PHPHtmlParser/Dom/HtmlNode.php

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,6 +186,9 @@ protected function clear()
186186
$this->innerHtml = null;
187187
$this->outerHtml = null;
188188
$this->text = null;
189+
if (is_null($this->parent) === false) {
190+
$this->parent->clear();
191+
}
189192
}
190193

191194
/**

src/PHPHtmlParser/Dom/InnerNode.php

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -254,6 +254,9 @@ public function replaceChild($childId, AbstractNode $newChild)
254254
$this->children = array_combine($keys, $this->children);
255255
$this->children[$newChild->id()] = $newChild;
256256
unset($oldChild);
257+
258+
//clear any cache
259+
$this->clear();
257260
}
258261

259262
/**
@@ -326,4 +329,4 @@ public function setParent(InnerNode $parent)
326329

327330
return parent::setParent($parent);
328331
}
329-
}
332+
}

src/PHPHtmlParser/Dom/MockNode.php

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,9 @@ protected function clear()
4040
$this->innerHtml = null;
4141
$this->outerHtml = null;
4242
$this->text = null;
43+
if (is_null($this->parent) === false) {
44+
$this->parent->clear();
45+
}
4346
}
4447

4548
/**

0 commit comments

Comments
 (0)