PHP program to get an array of all words from text file

By IncludeHelp Last updated : January 27, 2024

Problem statement

Given a text file, write a PHP code to get an array of all words from the given file.

Getting an array of all words from a text file

For this purpose, first read the file and get the content of the file using the file_get_contents() function, and then tokenize the content into words using the str_word_count() function. You will get an array of the all words available in the given file.

PHP program to get an array of all words from a text file

The below program gets and prints an array of all words from the given text file.

<?php

function find_all_word($filename)
{
    // Read the file and store the content in a variable
    $file_data = file_get_contents($filename);

    // The below regex will remove the punctuations
    $file_data = preg_replace("/[^\p{L}\p{N}\s]/u", "", $file_data);

    // Tokenize the file's data into words
    // store it into a variable (array-like)
    $words_arr = str_word_count($file_data, 1);

    // Return the result
    return $words_arr;
}

// Main code
// Take a file
$filename = "file.txt";

// Call the function to, get an array of
// all words from text file
$all_words = find_all_word($filename);

// print the array
print_r($all_words);
?>

Output

file.txt: File's content (file to be used in the below code):

Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book.
It has survived not only five centuries, but also the leap into electronic typesetting, 
remaining essentially unchanged. It was popularised in the 1960s with the release
of Letraset sheets containing Lorem Ipsum passages, 
and more recently with desktop publishing software like 
Aldus PageMaker including versions of Lorem Ipsum.

The output of the above program is:

Array
(
    [0] => Lorem
    [1] => Ipsum
    [2] => is
    [3] => simply
    [4] => dummy
    [5] => text
    [6] => of
    [7] => the
    [8] => printing
    [9] => and
    [10] => typesetting
    [11] => industry
    [12] => Lorem
    [13] => Ipsum
    [14] => has
    [15] => been
    [16] => the
    [17] => industrys
    [18] => standard
    [19] => dummy
    [20] => text
    [21] => ever
    [22] => since
    [23] => the
    [24] => s
    [25] => when
    [26] => an
    [27] => unknown
    [28] => printer
    [29] => took
    [30] => a
    [31] => galley
    [32] => of
    [33] => type
    [34] => and
    [35] => scrambled
    [36] => it
    [37] => to
    [38] => make
    [39] => a
    [40] => type
    [41] => specimen
    [42] => book
    [43] => It
    [44] => has
    [45] => survived
    [46] => not
    [47] => only
    [48] => five
    [49] => centuries
    [50] => but
    [51] => also
    [52] => the
    [53] => leap
    [54] => into
    [55] => electronic
    [56] => typesetting
    [57] => remaining
    [58] => essentially
    [59] => unchanged
    [60] => It
    [61] => was
    [62] => popularised
    [63] => in
    [64] => the
    [65] => s
    [66] => with
    [67] => the
    [68] => release
    [69] => of
    [70] => Letraset
    [71] => sheets
    [72] => containing
    [73] => Lorem
    [74] => Ipsum
    [75] => passages
    [76] => and
    [77] => more
    [78] => recently
    [79] => with
    [80] => desktop
    [81] => publishing
    [82] => software
    [83] => like
    [84] => Aldus
    [85] => PageMaker
    [86] => including
    [87] => versions
    [88] => of
    [89] => Lorem
    [90] => Ipsum
)

Getting an array of unique words

If you want to get the all unique words in a file, use the array_unique() method while returning the result from the function.

Consider the below code statement:

return array_unique($words_arr);

Output:

Array
(
    [0] => Lorem
    [1] => Ipsum
    [2] => is
    [3] => simply
    [4] => dummy
    [5] => text
    [6] => of
    [7] => the
    [8] => printing
    [9] => and
    [10] => typesetting
    [11] => industry
    [14] => has
    [15] => been
    [17] => industrys
    [18] => standard
    [21] => ever
    [22] => since
    [24] => s
    [25] => when
    [26] => an
    [27] => unknown
    [28] => printer
    [29] => took
    [30] => a
    [31] => galley
    [33] => type
    [35] => scrambled
    [36] => it
    [37] => to
    [38] => make
    [41] => specimen
    [42] => book
    [43] => It
    [45] => survived
    [46] => not
    [47] => only
    [48] => five
    [49] => centuries
    [50] => but
    [51] => also
    [53] => leap
    [54] => into
    [55] => electronic
    [57] => remaining
    [58] => essentially
    [59] => unchanged
    [61] => was
    [62] => popularised
    [63] => in
    [66] => with
    [68] => release
    [70] => Letraset
    [71] => sheets
    [72] => containing
    [75] => passages
    [77] => more
    [78] => recently
    [80] => desktop
    [81] => publishing
    [82] => software
    [83] => like
    [84] => Aldus
    [85] => PageMaker
    [86] => including
    [87] => versions
)

Ignoring case

If you want to ignore the case, convert the file into lowercase or uppercase before tokenizing it into a word array.

Consider the below code statement:

$file_data = strtolower(preg_replace("/[^\p{L}\p{N}\s]/u", "", $file_data));

More PHP File Handling Programs »

Comments and Discussions!

Load comments ↻





Copyright © 2024 www.includehelp.com. All rights reserved.