I currently need to parse a lot of .phtml files, get specific html tags and add a custom data attribute to them. I'm using python beautifulsoup to parse the entire document and add the tags, and this part works just fine.
The problem is that on the view files (phtml) there are tags that get parsed too. Below is an example of input-output
INPUT
<?php
$stars = $this->getData('sideBarCoStars', []);
if (!$stars) return;
$sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
$title = $this->getData('sideBarCoStarsTitle');
$viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
$isDomain = $this->getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this->getData('emptyImageData');
?>
<header>
<h3>
<a href="<?php echo $viewAllUrl; ?>" class="noContentLink white">
<?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
</a>
</h3>
OUTPUT
<?php
$stars = $this->
getData('sideBarCoStars', []);
if (!$stars) return;
$sideBarCoStarsCount = $this->getData('sideBarCoStarsCount');
$title = $this->getData('sideBarCoStarsTitle');
$viewAllUrl = $this->getData('sideBarCoStarsViewAllUrl');
$isDomain = $this->getData('isDomain');
$lazy_load = $lazy_load ?? 0;
$imageSrc = $this->getData('emptyImageData');
?>
<header>
<h3>
<a class="noContentLink white" href="<?php echo $viewAllUrl; ?>">
<?php echo "{$title} ({$sideBarCoStarsCount})"; ?>
</a>
</h3>
I tried different ways, but didn't succeed on making beautifulsoup to ignore the PHP tags. Is it possible to get html.parser custom rules to ignore , or to beautifulsoup? Thanks!
Your best bet is to remove all of the PHP elements before giving it to BeautifulSoup to parse. This can be done using a regular expression to spot all PHP sections and replace them with safe placeholder text.
After carrying out all of your modifications using BeautifulSoup, the PHP expressions can then be replaced.
As the PHP can be anywhere, i.e. also within a quoted string, it is best to use a simple unique string placeholder rather than trying to wrap it in an HTML comment (see
php_sig
).re.sub()
can be given a function. Each time the a substitution is made, the original PHP code is stored in an array (php_elements
). Then the reverse is done afterwards, i.e. search for all instances ofphp_sig
and replace them with the next element fromphp_elements
. If all goes well,php_elements
should be empty at the end, if it is not then your modifications have resulted in a place holder being removed.