Home »
PHP »
PHP Programs
PHP program to get an array of all words from text file
By IncludeHelp Last updated : January 27, 2024
Problem statement
Given a text file, write a PHP code to get an array of all words from the given file.
Getting an array of all words from a text file
For this purpose, first read the file and get the content of the file using the file_get_contents() function, and then tokenize the content into words using the str_word_count() function. You will get an array of the all words available in the given file.
PHP program to get an array of all words from a text file
The below program gets and prints an array of all words from the given text file.
<?php
function find_all_word($filename)
{
// Read the file and store the content in a variable
$file_data = file_get_contents($filename);
// The below regex will remove the punctuations
$file_data = preg_replace("/[^\p{L}\p{N}\s]/u", "", $file_data);
// Tokenize the file's data into words
// store it into a variable (array-like)
$words_arr = str_word_count($file_data, 1);
// Return the result
return $words_arr;
}
// Main code
// Take a file
$filename = "file.txt";
// Call the function to, get an array of
// all words from text file
$all_words = find_all_word($filename);
// print the array
print_r($all_words);
?>
Output
file.txt: File's content (file to be used in the below code):
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book.
It has survived not only five centuries, but also the leap into electronic typesetting,
remaining essentially unchanged. It was popularised in the 1960s with the release
of Letraset sheets containing Lorem Ipsum passages,
and more recently with desktop publishing software like
Aldus PageMaker including versions of Lorem Ipsum.
The output of the above program is:
Array
(
[0] => Lorem
[1] => Ipsum
[2] => is
[3] => simply
[4] => dummy
[5] => text
[6] => of
[7] => the
[8] => printing
[9] => and
[10] => typesetting
[11] => industry
[12] => Lorem
[13] => Ipsum
[14] => has
[15] => been
[16] => the
[17] => industrys
[18] => standard
[19] => dummy
[20] => text
[21] => ever
[22] => since
[23] => the
[24] => s
[25] => when
[26] => an
[27] => unknown
[28] => printer
[29] => took
[30] => a
[31] => galley
[32] => of
[33] => type
[34] => and
[35] => scrambled
[36] => it
[37] => to
[38] => make
[39] => a
[40] => type
[41] => specimen
[42] => book
[43] => It
[44] => has
[45] => survived
[46] => not
[47] => only
[48] => five
[49] => centuries
[50] => but
[51] => also
[52] => the
[53] => leap
[54] => into
[55] => electronic
[56] => typesetting
[57] => remaining
[58] => essentially
[59] => unchanged
[60] => It
[61] => was
[62] => popularised
[63] => in
[64] => the
[65] => s
[66] => with
[67] => the
[68] => release
[69] => of
[70] => Letraset
[71] => sheets
[72] => containing
[73] => Lorem
[74] => Ipsum
[75] => passages
[76] => and
[77] => more
[78] => recently
[79] => with
[80] => desktop
[81] => publishing
[82] => software
[83] => like
[84] => Aldus
[85] => PageMaker
[86] => including
[87] => versions
[88] => of
[89] => Lorem
[90] => Ipsum
)
Getting an array of unique words
If you want to get the all unique words in a file, use the array_unique() method while returning the result from the function.
Consider the below code statement:
return array_unique($words_arr);
Output:
Array
(
[0] => Lorem
[1] => Ipsum
[2] => is
[3] => simply
[4] => dummy
[5] => text
[6] => of
[7] => the
[8] => printing
[9] => and
[10] => typesetting
[11] => industry
[14] => has
[15] => been
[17] => industrys
[18] => standard
[21] => ever
[22] => since
[24] => s
[25] => when
[26] => an
[27] => unknown
[28] => printer
[29] => took
[30] => a
[31] => galley
[33] => type
[35] => scrambled
[36] => it
[37] => to
[38] => make
[41] => specimen
[42] => book
[43] => It
[45] => survived
[46] => not
[47] => only
[48] => five
[49] => centuries
[50] => but
[51] => also
[53] => leap
[54] => into
[55] => electronic
[57] => remaining
[58] => essentially
[59] => unchanged
[61] => was
[62] => popularised
[63] => in
[66] => with
[68] => release
[70] => Letraset
[71] => sheets
[72] => containing
[75] => passages
[77] => more
[78] => recently
[80] => desktop
[81] => publishing
[82] => software
[83] => like
[84] => Aldus
[85] => PageMaker
[86] => including
[87] => versions
)
Ignoring case
If you want to ignore the case, convert the file into lowercase or uppercase before tokenizing it into a word array.
Consider the below code statement:
$file_data = strtolower(preg_replace("/[^\p{L}\p{N}\s]/u", "", $file_data));
More PHP File Handling Programs »