I've tagged this post as WordPress, but I'm not entirely sure it's WordPress-specific, so I'm posting it on StackOverflow rather than WPSE. The solution doesn't have to be WordPress-specific, simply PHP.
The Scenario
I run a fishkeeping website with a number of tropical fish Species Profiles
and Glossary
entries.
Our website is oriented around our profiles. They are, as you may term it, the bread and butter of the website.
What I'm hoping to achieve is that, in every species profile which mentions another species or a glossary entry, I can replace those words with a link - such as you'll see here. Ideally, I would also like this to occur in news, articles and blog posts too.
We have nearly 1400 species profiles
and 1700 glossary entries
. Our species profiles are often lengthy and at last count our species profiles alone numbered more than 1.7 million words
of information.
What I'm Currently Attempting
Currently, I have a filter.php
with a function that - I believe - does what I need it to do. The code is quite lengthy, and can be found in full here.
In addition, in my WordPress theme's functions.php
, I have the following:
# ==============================================================================================
# [Filter]
#
# Every hour, using WP_Cron, `my_updated_posts` is checked. If there are new Post IDs in there,
# it will run a filter on all of the post's content. The filter will search for Glossary terms
# and scientific species names. If found, it will replace those names with links including a
# pop-up.
include "filter.php";
# ==============================================================================================
# When saving a post (new or edited), check to make sure it isn't a revision then add its ID
# to `my_updated_posts`.
add_action( 'save_post', 'my_set_content_filter' );
function my_set_content_filter( $post_id ) {
if ( !wp_is_post_revision( $post_id ) ) {
$post_type = get_post_type( $post_id );
if ( $post_type == "species" || ( $post_type == "post" && in_category( "articles", $post_id ) ) || ( $post_type == "post" && in_category( "blogs", $post_id ) ) ) {
//get the previous value
$ids = get_option( 'my_updated_posts' );
//add new value if necessary
if( !in_array( $post_id, $ids ) ) {
$ids[] = $post_id;
update_option( 'my_updated_posts', $ids );
}
}
}
}
# ==============================================================================================
# Add the filter to WP_Cron.
add_action( 'my_filter_posts_content', 'my_filter_content' );
if( !wp_next_scheduled( 'my_filter_posts_content' ) ) {
wp_schedule_event( time(), 'hourly', 'my_filter_posts_content' );
}
# ==============================================================================================
# Run the filter.
function my_filter_content() {
//check to see if posts need to be parsed
if ( !get_option( 'my_updated_posts' ) )
return false;
//parse posts
$ids = get_option( 'my_updated_posts' );
update_option( 'error_check', $ids );
foreach( $ids as $v ) {
if ( get_post_status( $v ) == 'publish' )
run_filter( $v );
update_option( 'error_check', "filter has run at least once" );
}
//make sure no values have been added while loop was running
$id_recheck = get_option( 'my_updated_posts' );
my_close_out_filter( $ids, $id_recheck );
//once all options, including any added during the running of what could be a long cronjob are done, remove the value and close out
delete_option( 'my_updated_posts' );
update_option( 'error_check', 'working m8' );
return true;
}
# ==============================================================================================
# A "difference" function to make sure no new posts have been added to `my_updated_posts` whilst
# the potentially time-consuming filter was running.
function my_close_out_filter( $beginning_array, $end_array ) {
$diff = array_diff( $beginning_array, $end_array );
if( !empty ( $diff ) ) {
foreach( $diff as $v ) {
run_filter( $v );
}
}
my_close_out_filter( $end_array, get_option( 'my_updated_posts' ) );
}
The way this works, as (hopefully) described by the code's comments, is that each hour WordPress operates a cron job (which is like a false cron - works upon user hits, but that doesn't really matter as the timing isn't important) which runs the filter found above.
The rationale behind running it on an hourly basis was that if we tried to run it when each post was saved, it would be to the detriment of the author. Once we get guest authors involved, that is obviously not an acceptable way of going about it.
The Problem...
For months now I've been having problems getting this filter running reliably. I don't believe that the problem lies with the filter itself, but with one of the functions that enables the filter - i.e. the cron job, or the function that chooses which posts are filtered, or the function which prepares the wordlists etc. for the filter.
Unfortunately, diagnosing the problem is quite difficult (that I can see), thanks to it running in the background and only on an hourly basis. I've been trying to use WordPress' update_option
function (which basically writes a simple database value) to error-check, but I haven't had much luck - and to be honest, I'm quite confused as to where the problem lies.
We ended up putting the website live without this filter working correctly. Sometimes it seems to work, sometimes it doesn't. As a result, we now have quite a few species profiles which aren't correctly filtered.
What I'd Like...
I'm basically seeking advice on the best way to go about running this filter.
Is a Cron Job the answer? I can set up a .php
file which runs every day, that wouldn't be a problem. How would it determine which posts need to be filtered? What impact would it have on the server at the time it ran?
Alternatively, is a WordPress admin page the answer? If I knew how to do it, something along the lines of a page - utilising AJAX - which allowed me to select the posts to run the filter on would be perfect. There's a plugin called AJAX Regenerate Thumbnails
which works like this, maybe that would be the most effective?
Considerations
- The size of the database/information being affected/read/written
- Which posts are filtered
- The impact the filter has on the server; especially considering I don't seem to be able to increase the WordPress memory limit past 32Mb.
- Is the actual filter itself efficient, effective and reliable?
This is quite a complex question and I've inevitably (as I was distracted roughly 18 times by colleagues in the process) left out some details. Please feel free to probe me for further information.
Thanks in advance,