Part of the EllisLab Network
   
1 of 3
1
[OPTIMIZATION] text_helper.php word_limiter()
Posted: 05 May 2007 11:00 PM   [ Ignore ]  
Research Assistant
Avatar
RankRankRank
Total Posts:  956
Joined  08-06-2006

Welcome to the new thread for word_limiter OPTIMIZATION discussion.

Here’s where we left the Making CI faster one-line-at-a-time thread... basically, we’ve made it a lot faster, but there are still some problems to work out to make it perform in the ‘expected’ way - i.e., not stripping CR and LF characters.

This code uses a tokenizing approach to zip quite rapidly through the first (default) 100 words. However, there are some problems with it (that it shares with the original word_limiter) namely that it strips carriage returns and linefeeds and tabs from the text. That’s unexpected behavior for a function named word_limiter().

function word_limiter($str, $limit = 100, $end_char = '…')
{
    
if (strlen($str) <= $limit * 2)
        return
$str;
    
    
$word_count = 0;
    
$text = '';
    
$token = strtok($str, " \r\n\t");

    while (
$token !== FALSE)
    
{
        $word_count
++;

        
$text .= ' '. $token;
        
$token = strtok(" \r\n\t");

        if (
$word_count >= $limit)
        
{
            
if ($token !== FALSE)
                
$text .= $end_char;
                
            break;
        
}
    }

    
return ltrim($text);
}
Geert De Deckere - 04 May 2007 04:09 PM
sophistry - 04 May 2007 03:56 PM

I like those changes. I just realized one problem with the addition of \n\r\t to the tokenizer: the string is rebuilt with space character so \n and \r and \t will be transformed into spaces. That could be a problem if you imagine this function just before a nl2br(). It might mess some things up for people expecting their newlines, carriage-returns, and tabs to be there.

True. Good Point. Again. The original word_limiter() function does the same thing. Could that be considered a bug? Also note that in case $str is shorter than twice $limit (first lines of function), the original string is returned with all original whitespace. It should at least return the same formatting as in the other case, just for the sake of consistency.

I suggest just going back to the space tokenizer on the first pass and then see if there are any newlines, carriage-returns or tabs and then on second, third and 4th passes we incrementally replace the other whitespace chars.

Could you translate that into some real code? Don’t clearly see where you’re going with that one.

Well, I went back to the original and took a good hard look at the problem. I think I’ve got the winner now. Please test this and let me know if I’ve overlooked anything.

I decided to take out the strlen check because it is extra code that will rarely be used. The preg_match goes through looking for any non-whitespace char the \S* followed by a whitespace the \s* and look for that sub-pattern to repeat $limit times - that’s essentially a complete word counter in regexp. The ?: is a command to tell the PCRE regex engine not to capture the sub-pattern that defines the word. This optimizes the regexp ever-so-slightly.

EDIT: fixed code to handle tabs properly. Optimized regexp more in this case by using * instead of +.

function word_limiter($str, $limit = 100, $end_char = '&#8230;')
{
    
// this is here because the following regexp
    // absolutely chokes when the word count is
    // lower than the limit. but, it is super fast
    // when word count is over the limit! strange.
    // try commenting out these lines and send a
    // 99 word string. there may be some regexp fix
    // to avoid this problem
    
$word_count = str_word_count($str);
    if (
$word_count <= $limit)
    
{
        
return $str;
    
}
    
    preg_match
('/(?:\S*\s*){'.$limit.'}/', $str, $matches);
    
    
// just in case the regexp gets confused by something
    
if (!isset($matches[0]))
    
{
        
return $str;
    
}
    
    
return rtrim($matches[0]).$end_char;
}

On 1000 word selections this seems to be about 4 times as fast as Geert’s most recent tweak seen above.

1000 reps of 1000 words:
Original CI:  4.5775
Sophistry-Geert tokenizer: 0.4115
Sophistry slicer/dicer: 0.1112

However, even thought the tokenizer is (technically) broken it scales better. So, at around 5000-word selections it starts to outperform this one I just built. I tried to come up with a way to fix the tokenizer version, but I couldn’t do it. I think this is the most versatile and fastest for the typical “blog post preview” setup.

Please note that all of these functions count punctuation that stands on it’s own (like an em-dash or a bullet) as a word.

 Signature 

imap_pop_class - get email and attachments site_migrate - port sites to CI OOCalendar - OO Calendar

Profile
 
 
Posted: 06 May 2007 12:44 AM   [ Ignore ]   [ # 1 ]  
Research Assistant
Avatar
RankRankRank
Total Posts:  956
Joined  08-06-2006

EDIT: I made this text small since this is a dead end: the strtok() function turned out to be too unwieldy for this task. It is made for simpler parsing and is very quick at it.

Better versions are the preg_match() based function and the array_slice()/next()key() based function. seen above and below respectively.

Bonus points for fixing the tokenizer too… I assume tabs will be OK if they are converted to spaces.

Note; this fix is really just to make sure that the function spits back the right number of words.

function word_limiter_proper_space_handling($str, $limit = 100, $end_char = ‘…’)
{

  $word_count = 0;
  $text = ‘’;
  $token = strtok($str, ” \t”);

  while ($token !== FALSE)
  {
      $word_count++;

      $text .= ’ ‘.$token;
      $token = strtok(” \t”);
     
      // check text for CR or LF
      // if there are CR or LF chars
      // we just increment the counter
      if (strpos($token,”\n”)!==FALSE || strpos($token,”\r”)!==FALSE)
      {
        $word_count++;
      }
     
      if ($word_count >= $limit)
      {
        if ($token !== FALSE)
        {
          $text .= $end_char;
        }
        break;
      }
  }

  return ltrim($text);
}

 Signature 

imap_pop_class - get email and attachments site_migrate - port sites to CI OOCalendar - OO Calendar

Profile
 
 
Posted: 06 May 2007 02:23 AM   [ Ignore ]   [ # 2 ]  
Lab Assistant
Avatar
RankRank
Total Posts:  248
Joined  02-10-2007

There are issues with your function, sophistry.

$str = "One two\nthree\nfour\nfive six seven eight";
$str = word_limiter_proper_space_handling($str, 3);
echo
$str;

Output:

One two
three
four
five…

In the meanwhile, I started working on yet a different approach to the word_limiter() function…

 Signature 

Kohana rocks!

Profile
 
 
Posted: 06 May 2007 03:41 AM   [ Ignore ]   [ # 3 ]  
Lab Assistant
Avatar
RankRank
Total Posts:  248
Joined  02-10-2007

Okay, here is what I came up with. It is slower than the strtok() variants. However, until now this seems to be the only function that works fully as expected and it preserves the original whitespace! First functionality, than speed, right? After all it is still double as fast as the original CI word_limiter() function: 1000 iterations take about 2.1608 seconds.

function word_limiter_geert($str, $limit = 100, $end_char = '&#8230;') {
    
    
// The $words array contains all that we'll need to make our word limiter work:
    // - array keys   = numeric position of the word inside the string
    // - array values = the actual word itself
    
$words = str_word_count($str, 2);
    
    
// We're done if the original string already contains less words than the limit.
    
if (count($words) <= $limit)
        return
$str;
    
    
// Let's rebuild our words array so that it consists of this:
    // - array keys   = the word count (starts from 0)
    // - array values = numeric position of the word inside the string
    
$words = array_keys($words);
    
    
// Now just take a part out of the original string, easy-peasy-lemon-squeezy.
    // All original whitespace is preserved.
    
$str = substr($str, 0, $words[$limit]);
    
    
// Finally, chop off all trailing whitespace...
    
$str = rtrim($str);
    
    
// ...and add the end character.
    
$str .= $end_char;

    
// Done.
    
return $str;
}
 Signature 

Kohana rocks!

Profile
 
 
Posted: 06 May 2007 03:50 AM   [ Ignore ]   [ # 4 ]  
Lab Assistant
Avatar
RankRank
Total Posts:  248
Joined  02-10-2007

Update already. Managed to increase the speed of my function by using array_slice() (see inline comments for more info).

1000 iterations take about 1.3173 seconds

function word_limiter_geert($str, $limit = 100, $end_char = '&#8230;') {
    
    
// The $words_by_pos array contains all that we'll need to make our word limiter work:
    // - array keys   = numeric position of the word inside the string
    // - array values = the actual word itself
    
$words = str_word_count($str, 2);
    
    
// We're done if the original string already contains less words than the limit.
    
if (count($words) <= $limit)
        return
$str;
        
    
// Chop off all words we don't need anymore.
    // This line is only needed for speed boost.
    
$words = array_slice($words, 0, $limit + 1);
    
    
// Let's rebuild our words array so that it consists of this:
    // - array keys   = the word count (starts from 0)
    // - array values = numeric position of the word inside the string
    
$words = array_keys($words);
    
    
// Now just take a part out of the original string, easy-peasy-lemon-squeezy.
    // All original whitespace is preserved.
    
$str = substr($str, 0, $words[$limit]);
    
    
// Finally, chop off all trailing whitespace...
    
$str = rtrim($str);
    
    
// ...and add the end character.
    
$str .= $end_char;

    
// Done.
    
return $str;
}
 Signature 

Kohana rocks!

Profile
 
 
Posted: 06 May 2007 08:36 PM   [ Ignore ]   [ # 5 ]  
Research Assistant
Avatar
RankRankRank
Total Posts:  956
Joined  08-06-2006

Geert. Nice job. I was thinking along these lines when I went back to the preg style function. Did you check that one out too? I think it is quite accurate - more so than the fixes I tried to make to the tokenizer we’ve been banging on.

I like this new function. Very clean. I’m going to put it under some stress testing and see what pops.

 Signature 

imap_pop_class - get email and attachments site_migrate - port sites to CI OOCalendar - OO Calendar

Profile
 
 
Posted: 06 May 2007 10:44 PM   [ Ignore ]   [ # 6 ]  
Research Assistant
Avatar
RankRankRank
Total Posts:  956
Joined  08-06-2006

Ok, I’ve looked the new code over and tested a bit… it’s good because it doesn’t count space delimited punctuation marks as a word - the super-duper fast preg_match() function I wrote last night does, but it is so fast I don’t really care if the “word” count is off by a few! That preg_match() based function is very accurate over lots of weird text and preserves the CR and LFs and TABs and all of that. I think it is worth finding out if it can be improved to not count standalone punctuation. Alternatively, we could decide that it is a trivial problem and the speed it offers is worth that little issue - it’s 10x faster than this one below!

A problem with the use of array_slice() to optimize is…

PHP4 has no preserve_keys parameter for array_slice() so array_slice() returns a re-indexed array thus destroying the word position data - darn it. That would have been nice.

But, not to worry… I built another kind of slicer using next(), each() and key() which are pretty fast. Please review. It is largely the same as yours but with a buzz-saw-like use of next() to get to the key() we need.

function word_limiter_count_and_slice($str, $limit = 100, $end_char = '&#8230;') {
    
    
// The $word_positions_and_words array contains all that we'll need to make our word limiter work:
    // - array keys   = numeric character position of the word inside the string
    // - array values = the actual word itself
    
$word_positions_and_words =  str_word_count($str, 2);

    
// the original string might contain fewer words than the limit.
    
if (count($word_positions_and_words) <= $limit)
    
{
        
return $str;
    
}   
    
    
// buzz through the array to get to where we want to be
    
$i=0;
    while (
$i++ < $limit)
    
{
        next
($word_positions_and_words);
    
}
    
    
// Now just take a part out of the original
    // string, easy-peasy-lemon-squeezy
    // All original whitespace is preserved.
    // but, punctuation on it's own (e.g., space dash space) is NOT counted as a word.

    
$str = substr($str, 0, key($word_positions_and_words));

    
// ...and add the end character.    
    
return rtrim($str) . $end_char;
}
 Signature 

imap_pop_class - get email and attachments site_migrate - port sites to CI OOCalendar - OO Calendar

Profile
 
 
Posted: 06 May 2007 11:18 PM   [ Ignore ]   [ # 7 ]  
Research Assistant
Avatar
RankRankRank
Total Posts:  956
Joined  08-06-2006

While we are on the subject… Here is an old thread about a problem with word_limiter() (that still exists in the versions we are writing). It describes the possibility of sending in HTML tagged text and the word limiter limiting tags away. So, the text ends up with a non-closed tag because of the function.

You might say: “well, don’t send the function HTML formatted strings”, but maybe this is an issue that should be addressed? Not pushing it since I don’t have a need for it, but it is a limitation that should be spelled out in the documentation and or code comments.


Something like: “The use of this code on HTML is unpredictable - don’t use word_limiter() on HTML formatted text; use something else.”

 Signature 

imap_pop_class - get email and attachments site_migrate - port sites to CI OOCalendar - OO Calendar

Profile
 
 
Posted: 07 May 2007 02:29 AM   [ Ignore ]   [ # 8 ]  
Lab Assistant
RankRank
Total Posts:  144
Joined  09-08-2006

sophistry thats a good point about the TAGS. i believe we need to take that into account or create a another method that takes care of it.

it would also have to allow user defined tags so its not limited.

Profile
 
 
Posted: 07 May 2007 07:09 AM   [ Ignore ]   [ # 9 ]  
Lab Assistant
Avatar
RankRank
Total Posts:  248
Joined  02-10-2007
sophistry - 06 May 2007 10:44 PM

A problem with the use of array_slice() to optimize is…

PHP4 has no preserve_keys parameter for array_slice() so array_slice() returns a re-indexed array thus destroying the word position data - darn it. That would have been nice.

Good point. I must have overlooked that one. This is quite simple to fix though. Just no speed boost for older PHP version.

function word_limiter($str, $limit = 100, $end_char = '&#8230;') {
    
    
// The $words_by_pos array contains all that we'll need to make our word limiter work:
    // - array keys   = numeric position of the word inside the string
    // - array values = the actual word itself
    
$words = str_word_count($str, 2);
    
    
// We're done if the original string already contains less words than the limit.
    
if (count($words) <= $limit)
        return
$str;
        
    
// Chop off all words we don't need anymore.
    // This line is only needed for speed boost.
    // PHP 5 only because we need to preserve keys (last parameter set to TRUE).
    
if (phpversion() >= '5.0.2')
        
$words = array_slice($words, 0, $limit + 1, TRUE);
    
    
// Let's rebuild our words array so that it consists of this:
    // - array keys   = the word count (starts from 0)
    // - array values = numeric position of the word inside the string
    
$words = array_keys($words);
    
    
// Now just take a part out of the original string, easy-peasy-lemon-squeezy.
    // All original whitespace is preserved.
    
$str = substr($str, 0, $words[$limit]);
    
    
// Finally, chop off all trailing whitespace...
    
$str = rtrim($str);
    
    
// ...and add the end character.
    
$str .= $end_char;

    
// Done.
    
return $str;
}

—edit—

Concerning the html tags, I’d say forget about that. It really would complicate the function a lot. When you use this function you have to know that it just chops off your text at word number x. If you’re concerned about html, you should strip_stags() your string first.

Suppose if we would support html, where would it end? Shouldn’t we automatically close any BBcode tags neither? What about quotes etc? In my opinion all this stuff falls out of the scope of the word_limiter() function.

I’m happy with this version of word_limiter(): it preserves the original whitespace and it is faster than the original. No need for more ‘features’. Just my 2 cents.

 Signature 

Kohana rocks!

Profile
 
 
Posted: 07 May 2007 09:54 AM   [ Ignore ]   [ # 10 ]  
Research Assistant
Avatar
RankRankRank
Total Posts:  956
Joined  08-06-2006

Hi Geert,

I do not have a PHP5 install to test your function against the one I wrote last night. I made a similar optimization as the array_slice() with preserve keys for PHP4 using the next() and key() array functions. Let’s not just abandon PHP4 users (like me!).

As long as you think we should target optimizations at PHP versions (and I agree we should) I would like to integrate the two like this:

function word_limiter($str, $limit = 100, $end_char = '&#8230;') {
    
    
// The $words array contains all that we'll need to make our word limiter work:
    // - array keys   = numeric position of the word inside the string
    // - array values = the actual word itself
    
$words = str_word_count($str, 2);
    
    
// We're done if the original string already contains less words than the limit.
    
if (count($words) <= $limit)
        return
$str;
        
    
// Chop off all words we don't need anymore.
    // This line is only needed for speed boost.
    // PHP 5 only because we need to preserve keys (last parameter set to TRUE).
    // PHP 4 uses next() and key() to get quickly to the array item
    
if (phpversion() >= '5.0.2')
    
{
        $words
= array_slice($words, 0, $limit + 1, TRUE);
        
// Let's rebuild our words array so that it consists of this:
    // - array keys   = the word count (starts from 0)
    // - array values = numeric position of the word inside the string
    
$words = array_keys($words);
    
// get the value
    
$position_of_word_at_limit = $words[$limit];    
    
}
    
else
    
{
        
// buzz through the array to get to where we want to be
    
$i=0;
    while (
$i++ < $limit)
    
{
        next
($words);
    
}
    
    
// get the key
    
$position_of_word_at_limit = key($words);    
    
}
    
            
    
// Now just take a part out of the original string, easy-peasy-lemon-squeezy.
    // All original whitespace is preserved.
    
$str = substr($str, 0, $position_of_word_at_limit);

    
// Finally, chop off all trailing whitespace...
    
$str = rtrim($str);
    
    
// ...and add the end character.
    
$str .= $end_char;

    
// Done.
    
return $str;
}

However, I’m still wondering if you have any feedback about the super-fast preg_match version (the one with the small quirk that it counts standalone punctuation as a word)? That function is 10x faster than this one and needs no version checking.

EDIT: Yes, let’s not bloat this function with TAG closing support. It should be a completely separate function if it is needed. I think HTMLtidy may have some tag closing capability. At least it could be used to flag errors.

This code is really nice! It’s been a pleasure working it out with you.

 Signature 

imap_pop_class - get email and attachments site_migrate - port sites to CI OOCalendar - OO Calendar

Profile
 
 
Posted: 07 May 2007 12:28 PM   [ Ignore ]   [ # 11 ]  
Lab Assistant
Avatar
RankRank
Total Posts:  248
Joined  02-10-2007

Your array buzzer version is even faster than my array_slicer() stuff. Actually it is quite logic too if you think about it. By using next() you only move the internal pointer of the array. Of course that has to be faster than doing an array_slice() followed by array_keys(). So, no need to check for PHP version anymore just go 100% for your ‘buzzer’ (I love that name) version:

function word_limiter_array_buzzer($str, $limit = 100, $end_char = '&#8230;') {
    
    
// The $words_by_pos array contains all that we'll need to make our word limiter work:
    // - array keys   = numeric position of the word inside the string
    // - array values = the actual word itself
    
$words = str_word_count($str, 2);
    
    
// We're done if the original string already contains less words than the limit.
    
if (count($words) <= $limit)
        return
$str;
    
    
// Buzz through to array until we arrive at the word where we want to cut the string off.
    
$i = 0;
    while (
$i++ < $limit)
        
next($words);
    
    
// Now just take a part out of the original string, easy-peasy-lemon-squeezy.
    // All original whitespace is preserved.
    
$str = substr($str, 0, key($words));
    
    
// Finally, chop off all trailing whitespace...
    
$str = rtrim($str);
    
    
// ...and add the end character.
    
$str .= $end_char;

    
// Done.
    
return $str;
}

—edit—

Time differences for 1000 iterations (on a 1000 word string to 100 words) are about 1.3024 seconds versus 1.2733 seconds.

 Signature 

Kohana rocks!

Profile
 
 
Posted: 07 May 2007 01:07 PM   [ Ignore ]   [ # 12 ]  
Lab Assistant
Avatar
RankRank
Total Posts:  248
Joined  02-10-2007
sophistry - 07 May 2007 09:54 AM

However, I’m still wondering if you have any feedback about the super-fast preg_match version (the one with the small quirk that it counts standalone punctuation as a word)? That function is 10x faster than this one and needs no version checking.

That regular expression is a stroke of genius, sophistry! I only made a slight change in order to allow for leading whitespace.

Counting standalone punctuation as word? That is not a quirk. If you use proper spelling you shouldn’t have standalone comma’s or dots, right? Okay, it could chop the string off at an “&” for example, but I wouldn’t bother about that. This is what happens with the original word_limiter() function as well.

On the other hand, your regex does take numbers in account, the others don’t! That’s a plus. Example: if you would like to chop “I want 1000 dollars now” off after 3 words your regex would return “I want 1000” while the others would return “I want 1000 dollars” (that are actually four words, and unexpected behavior in my opinion).

You could fix this behavior by simply adding the characters that you would also liked to be considered as word characters to the third parameter of the str_word_count() function, but that is a PHP5 only feature and still way slower than the regex version.

function word_limiter_preg_matcher($str, $limit = 100, $end_char = '&#8230;') {
    
    
// I deleted the str_word_count() check that would filter out too short strings. Why?
    // Because str_word_count() counts different than our regex below. Think about numbers for example.
    
    // Don't bother about empty strings.
    // Get rid of them here because the regex below would match them too.
    
if (trim($str) == '')
        return
$str;
    
    
// Added the initial \s* in order to make the regex work in case $str starts with whitespace.
    // Without it a string like " test" would be counted for two words instead of one.
    
preg_match('/\s*(?:\S*\s*){'. (int) $limit .'}/', $str, $matches);

    
// Chop off trailing whitespace and add the end character.
    
return rtrim($matches[0]) . $end_char;
}

1000 iterations take about 0.0609 seconds! That is 20 times faster than our word_limiter_array_buzzer() version. Even 70 times faster than the original word_limiter() function. A new record!

One small issue though, currently the end character always gets appended. How to fix that?

 Signature 

Kohana rocks!

Profile
 
 
Posted: 07 May 2007 01:42 PM   [ Ignore ]   [ # 13 ]  
Research Assistant
Avatar
RankRankRank
Total Posts:  735
Joined  10-18-2006

Doesn’t look hard, at this point…

return rtrim($matches[0]) . (($matches[0] != $str) ? $end_char : '');
 Signature 

Once in a while I remember I use Twitter

Profile
 
 
Posted: 07 May 2007 02:04 PM   [ Ignore ]   [ # 14 ]  
Lab Assistant
Avatar
RankRank
Total Posts:  248
Joined  02-10-2007
Seppo - 07 May 2007 01:42 PM

Doesn’t look hard, at this point…

return rtrim($matches[0]) . (($matches[0] != $str) ? $end_char : '');

Oh man, so obvious. Something I’m just thinking too hard. You know I was working on a preg_match_all() variant with the PREG_OFFSET_CAPTURE flag that captures word positions. It worked, but all the speed was gone.

Your fix is just what we need, Seppo. I optimized it a tiny bit by the strlen() function. Comparing the string length is a bit faster than comparing the strings themselves.

function word_limiter($str, $limit = 100, $end_char = '&#8230;') {
    
    
// Don't bother about empty strings.
    // Get rid of them here because the regex below would match them too.
    
if (trim($str) == '')
        return
$str;
    
    
// Added the initial \s* in order to make the regex work in case $str starts with whitespace.
    // Without it a string like " test" would be counted for two words instead of one.
    
preg_match('/\s*(?:\S*\s*){'. (int) $limit .'}/', $str, $matches);
    
    
// Only add end character if the string got chopped off.
    
if (strlen($matches[0]) == strlen($str))
        
$end_char = '';
    
    
// Chop off trailing whitespace and add the end character.
    
return rtrim($matches[0]) . $end_char;
}

Will this be the final version? (I hope so, lol!)

 Signature 

Kohana rocks!

Profile
 
 
Posted: 07 May 2007 02:24 PM   [ Ignore ]   [ # 15 ]  
Research Assistant
Avatar
RankRankRank
Total Posts:  956
Joined  08-06-2006

Wow, guys, thanks for the tenacious work here. Those improvements are really great! This has got to be the fastest PHP word_limiter in the world.

I don’t think we are done yet though.

Now it’s time to send it tons of crap (aka should this support unicode or just ascii?) and see what burps up!

I’m not sure if i’m kidding about the unicode, what is CI’s policy on unicode support? grin

 Signature 

imap_pop_class - get email and attachments site_migrate - port sites to CI OOCalendar - OO Calendar

Profile
 
 
   
1 of 3
1
 
Post Marker Legend
New Topic New posts Hot Topic Hot Topic with new posts New Poll New Poll Moved Topic Moved Topic Sticky Topic Sticky topic
Old Topic No new posts Hot Old Topic Hot Topic with no new posts Old Poll Old Poll Closed Topic Closed Topic Announcement Announcements
Theme
Change Theme
Visitor Statistics
The most visitors ever was 719, on June 06, 2008 10:16 AM
Total Registered Members: 77516 Total Logged-in Users: 30
Total Topics: 101527 Total Anonymous Users: 2
Total Replies: 544280 Total Guests: 256
Total Posts: 645807    
Members ( View Memberlist )