Importing CSS Reference
I thought it would be kinda cool to have the css reference as some kind of database that i could refer to.
this is how to import it somewhere else.
Import CSS Reference Function
This is the initial function, which i had to modify later, as it was crashing due to too many http requests.
PHP
/* Import CSS Reference */
public function import_css_reference (
$loop_max = 10
) {
/* This will be run on a loaded import item, so dont need to pass variabled to the function */
global $db;
global $functions;
$out = "";
$for_counter = 0;
$main_loop_tag = "#sect2 li";
$db_table_name = $this->db->escapeString($this->db_table_name);
$loop_max = $this->db->escapeString($loop_max);
require_once("lib/simple_html_dom.php");
$html = file_get_html($this->import_url);
foreach($html->find($main_loop_tag) as $item) {
$for_counter++;
if($for_counter == $loop_max) {
continue;
}
$css_link = $item->find('a',0)->href;
$css_title = $item->find('a',0)->plaintext;
$out .= "\$css_link:$css_link<br />";
$out .= "\$css_title:$css_title<br />";
// this is all we need for stage one, then open the page link and process.
$html_source = file_get_html($css_link);
$reply_count = 0;
$out .= "<hr />";
return $out;
}
}
this is returning the first item here, so its working as intended so far.
Find the div with the id #sect2 li as a loop item.
Then for each of them grab the link and the title.
Its finding the first item here and getting its link and title
append the domain link
usually the links will not include the full domain, so need to manually append it for the full link
PHP
$x_element_link = "https://the-domain.org".$x_element_link;
grab the content
this part is causing a timeout on the server, processing too many html requests on one loop can cause the server to crash so, need to move this part to a separate function and request.
PHP
$html_source = file_get_html($x_element_link);
importing problem, server 504 error timeout
Problem its only importing 11 items, so will need to manually add each link to a temp table and then, run the details as a separate import for each item.
Split the import, so the 1st bit just adds the titles and links, then another import can go through each of them, and add the missing details from the second part of the function.
Increased memory size limit of php from 128mb to 256mb, but still timing out on import.
/etc/php/7.4/fpm$ sudo nano php.ini
#find mem_limit and change to 256
Import Full
currently crashing the server, causing 504 error timeout.
PHP
/* Import CSS Reference */
/* this import crashes after 10 items - so need to split into smaller import chunks */
public function import_css_reference (
$loop_max = 10
) {
/* This will be run on a loaded import item, so dont need to pass variabled to the function */
global $db;
global $functions;
$out = "";
$for_counter = 0;
$main_loop_tag = "#sect2 li";
$db_table_name = $this->db->escapeString($this->db_table_name);
$loop_max = $this->db->escapeString($loop_max);
require_once("lib/simple_html_dom.php");
$html = file_get_html($this->import_url);
foreach($html->find($main_loop_tag) as $item) {
$for_counter++;
if($for_counter == $loop_max) {
continue;
}
$x_element_link = $item->find('a',0)->href;
$x_element_title = $item->find('a',0)->plaintext;
$x_element_title = trim($x_element_title);
$out .= "\$x_element_link:$x_element_link<br />";
$out .= "\$x_element_title:$x_element_title<br />";
$x_element_link = "https://the-domain.org".$x_element_link;
$html_source = file_get_html($x_element_link);
$reply_count = 0;
$out .= "<hr />";
foreach($html_source->find(".main-content") as $main_content) {
$x_title = $main_content->find("h1",0)->plaintext;
$x_title = trim($x_title);
$out .= "\$x_title:$x_title<br />";
$x_summary = $main_content->find("p",0)->innertext;
$x_summary = trim($x_summary);
$out .= "\$x_summary:$x_summary<br />";
$x_summary_2 = $main_content->find("p",1)->innertext;
$x_summary_2 = trim($x_summary_2);
$out .= "\$x_summary_2:$x_summary_2<br />";
$x_md5 = md5($x_element_title);
$out .= "\$x_md5:$x_md5<br />";
$x_category = "CSS";
$out .= "\$x_category:$x_category<br />";
$x_additional = $main_content->innertext;
$out .= "\$x_additional:$x_additional<br />";
// start the class
$linked_class = new $this->linked_class;
$linked_class->add_to_menu = false;
$linked_class->start();
// assign all vars
$linked_class->title = $x_title;
$linked_class->additional = $x_additional;
$linked_class->category = $x_category;
$linked_class->md5 = $x_md5;
$linked_class->summary = $x_summary;
$linked_class->summary_2 = $x_summary_2;
$linked_class->element_title = $x_element_title;
$linked_class->source_link = $x_element_link;
// check if title md5 exists
if(!$linked_class->md5_exists($x_md5)) {
if($linked_class->add()) {
$out .= "Item $linked_class->title Added<br>";
}
}
}
}
return $out;
}
/* Import CSS Reference */
Split the import into part1 and part2, so the 1st part of the import should just be loading one page, but its still giving me a 504 Gateway Time-out and only adding 4 items for some reason.
Even less items than the more complicated import.
Import Part 1
This is a smaller import and is only grabbing the 1st page and not following url’s so it should be working better than the full import, but only adds 4 items. Hmm...
PHP
/* Import CSS Reference - Part 1 */
/* this import crashes after 10 items - so need to split into smaller import chunks */
public function import_css_reference_part1 (
$loop_max = 10
) {
/* This will be run on a loaded import item, so dont need to pass variabled to the function */
global $db;
global $functions;
$out = "";
$for_counter = 0;
$main_loop_tag = "#sect2 li";
$db_table_name = $this->db->escapeString($this->db_table_name);
$loop_max = $this->db->escapeString($loop_max);
require_once("lib/simple_html_dom.php");
$html = file_get_html($this->import_url);
foreach($html->find($main_loop_tag) as $item) {
$for_counter++;
if($for_counter == $loop_max) {
continue;
}
$x_element_link = $item->find('a',0)->href;
$x_element_title = $item->find('a',0)->plaintext;
$x_element_title = trim($x_element_title);
$out .= "\$x_element_link:$x_element_link<br />";
$out .= "\$x_element_title:$x_element_title<br />";
$x_element_link = "https://the-domain.org".$x_element_link;
$html_source = file_get_html($x_element_link);
$reply_count = 0;
$out .= "<hr />";
// start the class
$linked_class = new $this->linked_class;
$linked_class->add_to_menu = false;
$linked_class->start();
$x_md5 = md5($x_element_title);
$x_category = "CSS";
// assign items
$linked_class->title = $x_element_title;
$linked_class->category = $x_category;
$linked_class->md5 = $x_md5;
$linked_class->source_link = $x_element_link;
/* these following items can come from part 2 of the import */
// $linked_class->additional = $x_additional;
// $linked_class->summary = $x_summary;
// $linked_class->summary_2 = $x_summary_2;
// $linked_class->long_title = $long_title;
// check if title md5 exists
if(!$linked_class->md5_exists($x_md5)) {
if($linked_class->add()) {
$out .= "Item $linked_class->title Added<br>";
}
}
}
return $out;
}
/* Import CSS Reference - Part 1 */
Still causing this timeout.
Timeout Fixed
Actually I see the issue now, i left the download source line in there. Doh!
PHP
// get rid of this line and it should run ok
$html_source = file_get_html($x_element_link);
Import Stage 1 Working Now
Just the titles and the links for now.
Woo 695 CSS Attribute Items, with no crash.
Now to get the second part of the import done.
Part 2 Import
the import will need a way to check if the import has already been processed.
check through each item in css_reference and if the other flag is blank then process it, otherwise set it to processed. just do one at a time, and then add to a 1 min cron, then in 695 minutes it should be all processed. thats a long time, maybe run it every 5 seconds, and stop it after 700 x 5 seconds.
added cron, remove this after a day or so.
*/3 * * * * wget --spider https://the_import_url/ > /dev/null 2>&1
PHP
/* Import CSS Reference - Part 2 */
/*
This one needs to, load a single item from the css_reference
grab the url, load the content and populate the missing items.
when loading the item it needs to also add something to the other field, to mark it processed
*/
public function import_css_reference_part2 (
$loop_max = 10
) {
/* This will be run on a loaded import item, so dont need to pass variabled to the function */
global $db;
global $functions;
$out = "";
$for_counter = 0;
$main_loop_tag = ".main-content";
$css_reference = new css_reference;
$css_reference->add_to_menu = false;
$css_reference->start();
// load item - using fields array
$fields_array = [
"other" => "",
];
if(!$css_reference->load_from_fields_array($fields_array, $max = 1)) {
return "nothing to load";
}
// new item should now be loaded
$out .= $css_reference->title . "<br />";
$db_table_name = $this->db->escapeString($this->db_table_name);
$loop_max = $this->db->escapeString($loop_max);
require_once("lib/simple_html_dom.php");
$html = file_get_html($css_reference->source_link); // new url based on loaded item
foreach($html->find($main_loop_tag) as $main_content) {
if($for_counter == $loop_max) {
continue;
}
$for_counter++;
$x_title = $main_content->find("h1",0)->plaintext;
$x_title = trim($x_title);
$out .= "\$x_title:$x_title<br />";
$css_reference->long_title = $x_title;
$x_summary = $main_content->find("p",0)->innertext;
$x_summary = trim($x_summary);
$out .= "\$x_summary:$x_summary<br />";
$css_reference->summary = $x_summary;
//$x_summary_2 = $main_content->find("p",1)->innertext;
$x_summary_2 = $main_content->find(".code-example",0)->innertext;
$x_summary_2 = trim($x_summary_2);
$out .= "\$x_summary_2:$x_summary_2<br />";
$css_reference->summary_2 = $x_summary_2;
$x_additional = $main_content->innertext;
$out .= "\$x_additional:$x_additional<br />";
$css_reference->additional = $x_additional;
$css_reference->other = "processed";
if($css_reference->update()) {
$out .= "Item $css_reference->title Updated<br>";
}
// check if title md5 exists
/*
if(!$css_reference->md5_exists($x_md5)) {
if($css_reference->add()) {
$out .= "Item $css_reference->title Updated<br>";
}
}
*/
}
return $out;
}
/* Import CSS Reference - Part 2 */
Ran this over night and some of the items imported and then the importer was timing out again, so increased the script processing time to 60 seconds on php, and now it seems to be working slowly again. Maybe the end site is slow.
Found the reason it was crashing is that the source link was not the correct doc link, so it was trying to import from an incorrect page, which was crashing the script somehow.
So go through and delete or mark as processed the ones with incorrect source links and it should continue.
I think the issue was that it had some items with disabled links that it was still using as a link source, removing these disabled links stopped the crashing. Yay!