Similar Threads
1. read file character by character using winapi
2. count number of digits, characters, whitespace characters and words in a string
use a regular expression to count number of digits, characters,
whitespace characters and words in a string
so far i got this and it won't work
use strict;
use warnings;
my $string1 = "hello there my id is 2 104503";
my $string2 = "today is a nice day";
number1 ($string1 );
number1 ($string2 );
number2 ($string1 );
number2 ($string2 );
sub number1
{
my $string = shift();
if ($string =~ / \d/ ) {
print "'$string' has a digit.\n";
}
else {
print "'$string' has no digits.\n";
}
}
sub number2
{
my $string = shift();
if ($string =~ /\w/ ) {
print "'$string' has a digit.\n";
}
else {
print "'$string' has no digits.\n";
}
}
$in = <STDIN>;
3. Checking for invalid characters within a string - VB.Net
4. Check the first 2 characters of string
Hi Group,
How can I check that the first 2 characters of a string are the percent (%)
sign?
string name = strSurname
Regards
5. reading through a text file checking the first character on each line
6. check characters in string
Hi I have a userform with a textbox where the user will enter an email
address.
What I would like to do is have some code that checks that the text entered
is a valid email address format (like website forms do).
so what I was thinking was having code that:
checked that there were no spaces (or other characters like / etc) in the
string
check that the string contained an @ character
and checked that after the @ character there is a .
Is anyone able to point me in the right direction?
Thanks.
7. how to check for similar words in two character string variables
8. how to check for similar words in two character string
I won't comment on fuzzy matching/etc. because i'm not an expert there, and
a search of the L will come up with all sorts of results. I will comment on
the general practice of finding words from var1 in var2.
This will iterate through VAR1 and find that word in VAR2. It's entirely
literal, and certainly isn't up to the task you ask; not only do you need to
add fuzzy matching, but INDEXW is probably a bit too literal for your needs
regardless, and you almost certainly should exclude trivial strings [&, 'W',
etc.] from your search.
data var1;
infile datalines truncover;
input
@1 obsnum 6.
@7 var1 $50.
;
datalines;
1 CATARACT EXTRACTION WITH IOL-RIGHT
2 CATARACT EXTRACTION WITH IOL-LEFT
3 SPINE THORACO LUMBAR POSTERIOR FUSION SILO
4 SPINE THORACO LUMBAR POSTERIOR FUSION SILO
5 KNEE ARTHROPLASTY TOTAL UNILATERAL
6 PHACOEMULSIFICATION W IOL
7 LEG-LIGATION & STRIPPING VARICOSE VEINS -BILATERAL
8 LEG-LIGATION & STRIPPING VARICOSE VEINS -BILATERAL
9 EYE-EXTRACTION CATARACT IOL
10 EYE-EXTRACTION CATARACT IOL
;;;;
run;
data var2;
infile datalines truncover;
input
@1 obsnum 6.
@7 var2 $80.
;
datalines;
1 Excision total, lens extracapsular phakoemulsification technique w
2 Excision total, lens extracapsular phakoemulsification technique w
3 Installation of external appliance, circulatory system NEC extraco
4 Fusion, spinal vertebrae open posterior approach [posterolateral a
5 Implantation of internal device, knee joint with combined sources
6 Excision total, lens extracapsular phakoemulsification technique w
7 Excision partial, veins of leg NEC without use of tissue open appr
8 Destruction, skin of leg using device NEC [electrocautery]
9 Excision total, lens extracapsular phakoemulsification technique w
10 Excision total, lens extracapsular phakoemulsification technique w
;;;;;;;
run;
data allvars;
merge var1 var2;
by obsnum;
run;
data want;
set allvars;
found=0;
format comp_found $100.;
do _n_ = 1 to countc(trim(compbl(var1)),' -')+1;
_var1wd= scan(var1,_n_,' -');
if indexw(upcase(var2),upcase(_var1wd))>0 then do;
found=found+1;
comp_found = catx('|',comp_found,_var1wd);
end;
end;
run;
If you want to compare word in var1 to each word iteratively in var2 [not
using indexw], you would do:
data want;
set allvars;
found=0;
format comp_found $100.;
do _n_ = 1 to countc(trim(compbl(var1)),' -')+1;
_var1wd= scan(var1,_n_,' -');
do _t = 1 to countc(trim(compbl(var2)),' -')+1;
_var2wd= scan(var2,_t,' -');
* put _var1wd= _var2wd=;
if upcase(_var1wd) = upcase(_var2wd) then do;
found=found+1;
comp_found = catx('|',comp_found,_var1wd);
end;
end;
end;
run;
You can then do whatever you want in terms of fuzzy matching instead of the
tenth line [the equality]. SOUNDEX is generally not very good, from what I
recall of previous discussions, but whichever method floats your boat would
fit in here. I also have no idea what you want to do with these matches, so
I count them for you and concatenate them together. Of course drop the
temporary variables, in the real execution. Also, the SCAN and COUNTC
should have the appropriate word delimiters if things other than dash and
space are potential word delimiters. If words with dashes included need to
be checked [so, IOL-RIGHT], you may have to play with the data some to get
it to behave [since IOL-RIGHT will not be checked in its entirety, just the
individual words].
Note that this is NOT very efficient; much more efficient would be using a
hash table, but I doubt that I have the knowledge of hash to come up with
the solution [though I may well try if I have a bit more time tonight].
Temporary arrays might also be faster, and not assigning the scanned
portions to variables might be faster, though I'm not really sure there and
i doubt it's much of a difference unless you're doing this on a godawfully
huge dataset.
-Joe
On Fri, Jul 17, 2009 at 6:01 PM, Cornel Lencar
< XXXX@XXXXX.COM >wrote:
> Hi,
>
> I would like to have the Text Miner application available, but I don't.
>
> I havea dataset with two character string variables. Each variable can
> have from one to many (30-40) individual, distinct words in it. I would
> like to check if any of the words in VARIABLE1 can be found in VARIABLE2.
> It would be nice to see if there are more than one of the VARIABLE1 words
> found in VARIABLE2.
>
> An example set:
>
> Obs VARIABLE1
>
> 1 CATARACT EXTRACTION WITH IOL-RIGHT
> 2 CATARACT EXTRACTION WITH IOL-LEFT
> 3 SPINE THORACO LUMBAR POSTERIOR FUSION SILO
> 4 SPINE THORACO LUMBAR POSTERIOR FUSION SILO
> 5 KNEE ARTHROPLASTY TOTAL UNILATERAL
> 6 PHACOEMULSIFICATION W IOL
> 7 LEG-LIGATION & STRIPPING VARICOSE VEINS -BILATERAL
> 8 LEG-LIGATION & STRIPPING VARICOSE VEINS -BILATERAL
> 9 EYE-EXTRACTION CATARACT IOL
> 10 EYE-EXTRACTION CATARACT IOL
>
> Obs VARIABLE2
>
> 1 Excision total, lens extracapsular phakoemulsification technique w
> 2 Excision total, lens extracapsular phakoemulsification technique w
> 3 Installation of external appliance, circulatory system NEC extraco
> 4 Fusion, spinal vertebrae open posterior approach [posterolateral a
> 5 Implantation of internal device, knee joint with combined sources
> 6 Excision total, lens extracapsular phakoemulsification technique w
> 7 Excision partial, veins of leg NEC without use of tissue open appr
> 8 Destruction, skin of leg using device NEC [electrocautery]
> 9 Excision total, lens extracapsular phakoemulsification technique w
> 10 Excision total, lens extracapsular phakoemulsification technique w
>
> Observations 4, 5, 6, 7, and 8 have some common words between VARIABLE1
> and VARIABLE2 although there are differences in the case type, in the fact
> that in VAR2 some words are composite, and also some words differ sligthly:
> 6 PHACOEMULSIFICATION phakoemulsification
>
> I imagine that the first variable needs to be split in the separate words
> and each word needs to be checked against every of the VARIABLE2 words,
> maybe with soundex?
>
> Any suggestions are more than welcomed.
>
> Sincerely,
>
> Cornel Lencar
>